For each national language supported by Unicode, what is the maximum number of bytes used to represent any character of that language when using UTF-8?

Joe Sam Shirah

I don't know of an available table like that offhand, but you can tell from any character what the UTF-8 result will be. Basically, for UCS-2, or 16 bit Unicode, the character range 0 - hex 7F ( 0 - 127 ) will take 1 byte. The range hex 80 through hex 07FF ( 128 - 2047 ) will take 2 bytes. The range hex 0800 - FFFF ( 2048 - 65535 ) takes 3 bytes. For more details, see RFC 2279.

If you feel that you have to have a table of languages/byte sizes, you can get the language data and ranges to create one at Unicode Online Data. This will work for maximums, but obviously not specifically for foreign words incorporated into a language, nor for ASCII symbols and character representations of numeric values ( 1, 47, 312 and so on. )

0 Comments  (click to add your comment)
Comment and Contribute






(Maximum characters: 1200). You have 1200 characters left.



About | Sitemap | Contact
We have made updates to our Privacy Policy to reflect the implementation of the General Data Protection Regulation.