| CharacterSetEncodingsForMarkup |
 |
|
Markup files can have multiple text encodings in the same file, delimited by tags. When doing MarkupParsing, one needs to use the appropriate text decoder. To do this, a markup parser could use the CharacterSetEncodings support directly by specifying a character decoder.
However, if the parser is not aware of character set encodings there are a few options:
- The text decoder used for the whole file could be used to decode the contained text. Since markup files are text, the tags themselves are subject to being encoded as well. Text contained by the tags can be decoded using the same decoder as a reasonable default.
- Alternatively, the text encoding could be auto-detected for each subsection of text. Some sort of priority scheme could break ties. For example, UTF-8 and ASCII are indistinguishable below 128. If the text is all in this range, one could have a priority given to UTF-8 for example.
|
|
|