Background
For those not in the know, character set encodings determine what byte(s) represent a set of characters. This has been a big mess historically, even if one restricts one's view to the United States. Some of the primary standards in this area used to be ASCII and EBDIC. Even with these standards, many values for a byte were left unspecified and this led to a proliferation of "code pages" and other such yuck.
Fast forward to today. Computers are in use all around the world and people are using them in their own languages. We have character sets that are much larger than a single byte (255 characters representable) and even character sets that are not representable in two bytes (65536 characters representable). Yes, there _are_ that many characters in Chinese, although the vast majority of them are not in common use, and a number you will only find in scholarly texts.
Common encodings today include the ISO-8859-x standards, Big-5, Shift-JIS, EUC-KR, GB2312 and others. Just pull down your web browser's encodings menu to see. Thankfully a nice big committee came together and decided what was best for everyone and now we have Unicode. It's not the end of the story though, because there are several ways to map unicode into bytes.
Comments on Existing Support
BeOS was rather forward thinking in this arena and used a mapping of unicode referred to as UTF-8. This mapping allows one to use (nearly all) ASCII files as UTF-8. It is generally space efficient (for roman-alphabet based languages) since most characters can be represented as a single byte. Only when one reaches out into the extended characters does one have to use multiple bytes. In some situations you may have to use up to three if I am not mistaken. Although UTF-8 is a sacrifice to legacy in some ways (heresy!), because of its space properties and backwards compatibility, it is an excellent choice for GE.
So using UTF-8 is great, but GE really needs is a way to interoperate with other character sets. BeOS had a mechanism for doing this, which was used in the Net+ browser, for example. It was implemented in a library called libtextencodings.so. While this was great and nice and forward looking, it had a drawback. It's impossible to add new encodings (or update existing ones) without recompiling and creating a replacement library.
There also exists a BeOS application known as Netpositive-CJK by the great Takayuki ITO. This application hacked the executables for Net+ and StyledEdit so they'd link with an alternative library "libtextwrappers.so" instead of libtextencodings.so. In order to do this it overwrote some existing encoding support. This was necessary under the constraints at the time but it is obviously not optimal for GE.
GE needs to address the various ways in which files from BeOS may enter BeOS in other encodings or leave BeOS for lands where other encodings are expected. Some of these areas include:
- CharacterSetEncodingsForBFS - how to keep track of and monitor character set encodings for different files on BFS
- CharacterSetEncodingsForEmail - how to handle incoming/outgoing messages in different encodings
- CharacterSetEncodingsForWeb - how to handle web pages in different encodings
- CharacterSetEncodingsForFFS - how to handle filenames and file contents from foreign filesystems that are in different encodings
Here are some possible goals for the encodings mechanism in GE:
- The encodings mechanism should support add-ons for new encodings.
- Encodings support should be provided to applications in a transparent manner. This application should be able to request a decoder for a text file similar to the way that it requests a decoder for a video file, and vice versa for encoding.
- CharacterSetConversions should be easy. This could be a bundled application. It's nearly a Tracker add-on.
References
- developerWorks : Unicode : Adding internationalization support to the base standard for JavaScript
- czyborra.com
- Google utf8 utf16 unicode search
Relevant Standards
- Unicode
- ISO 639 Language Codes
- ECMA Standards Index
- ISO ICS 35.040 Character sets and information coding
Resources
- libiconv library - LGPL
IndexPage | TableOfContents