How does one store text from multiple natural languages, such as Japanese and Chinese, in one XML file which has 'English' as the main language?

David Smith

XML's "natural" encoding is Unicode UTF-8. This means that you can mix characters from different languages freely within an XML document. You will, however, need a UTF-8 editor.

In a project I'm currently working on we mix English and Japanese text in the same documents with English sections enclosed within <english> </english> tags and Japanese within <japanese> </japanese> tags. The only reason we tag the languages is so that we can choose which to display - it does not make any difference to the XML parser. It is perfectly valid to mix any characters together while using Unicode

It is also possible to encode XML documents using a character set other than UTF-8. In Japan there are pre-existing character sets such as EUC and JIS which are in common use. These character sets also encompass ASCII so it is easy to mix English and Japanese, however if we needed a third language we could not do it within a single document since XML only allows one encoding for the entire document. For that reason, it's best to convert to Unicode as soon as you can. Using Java and Xerces it is fairly easy to convert character sets into/out of Unicode.

0 Comments  (click to add your comment)
Comment and Contribute






(Maximum characters: 1200). You have 1200 characters left.



About | Sitemap | Contact