which utf encoding provide support for multiple language?
2 posts in topic
Flat View  Flat View
TOPIC ACTIONS:
 

Posted By:   Abhishek_Jain
Posted On:   Friday, February 18, 2005 01:08 AM

::PROBLEM:: I have to support multiple languages like chinese, russian, french, german, italian & many more in my software. Kindly assist me in telling which 'utf encoding' would be better (utf-16 or utf-8 or any other)? These are the following observations I found after struggling on net for utf-16 & utf-8 ... 1) UTF-16 represents a large set of characters, (i.e. 215) whereas UTF-8 represents a small set (i.e. 27). 2) UTF-8 is efficient if you use a lot of ASCII, e.g. if you're an English speaker and all you use is ASCII, but it's more bytes per character than UTF-16 for a whole lot of other scripts (plus it's more bytes per character than an lot of current script-specif   More>>


::PROBLEM::

I have to support multiple languages like chinese, russian, french, german, italian & many more in my software. Kindly assist me in telling which 'utf encoding' would be better (utf-16 or utf-8 or any other)?



These are the following observations I found after struggling on net for utf-16 & utf-8 ...



1) UTF-16 represents a large set of characters, (i.e. 215) whereas UTF-8 represents a small set (i.e. 27).



2) UTF-8 is efficient if you use a lot of ASCII, e.g. if you're an English speaker and all you use is ASCII, but it's more bytes per character than UTF-16 for a whole lot of other scripts (plus it's more bytes per character than an lot of current script-specific encodings).



3) UTF-8 was designed far better than UTF-16 when it comes to all aspects of interoperabilty. Thus it should be the preferred encoding for all transport protocols and all interface points between systems from different vendors.



4) All the UTF-16 APIs in Windows and MacOS are a huge barrier to deployment of Unicode on those platforms since all the code has to be rewritten (and most of it never is). If they had instead retro-fitted UTF-8 into the existing 8-bit APIs we'd have much better Unicode deployment.



5) UTF8 is intended for when English is a dominant language, in which case it is more space efficient, or when full compatibility with the ASCII7 standard is a must.



6) The way UTF-8 was designed, old configuration files, shell scripts, and even lots of age-old software can function properly with Unicode text, even though Unicode was invented years after they came to be.



7) Windows also uses UTF-16 internally.



   <<Less

Re: which utf encoding provide support for multiple language?

Posted By:   Anonymous  
Posted On:   Monday, February 21, 2005 01:34 AM


If by 'support' you mean integration with other systems with different character sets, then UTF-16 is the way to go. Because of the fixed double byte character set it is faster to translate to and from than UTF-8 is.



However, if all you need is to store data in a database and display it on a front end, then UTF-8 would be better because it will use less space. That's both in the database and over the transport protocol.

Re: which utf encoding provide support for multiple language?

Posted By:   Stephen_Ostermiller  
Posted On:   Friday, February 18, 2005 05:32 PM

I would use UTF-8 because I speak English. ;-)



  1. Both UTF-16 and UTF-8 are byte representations of characters in the unicode character set. Both UTF-16 and UTF-8 can represent all unicode characters. That is the vast majority of languages on earth.


  2. You are correct about efficiency of UTF-8 and UTF-16 for english vs non-english. UTF-8 uses one byte for each ascii character but may use as much as 6 bytes for chinese characters. UTF-16 uses two bytes for each and every character.


  3. UTF-8 was designed to work passably well with programs that generally accept ASCII input. This means that you can use standard unix tools like grep on UTF-8 files. However, such interoperability can come at a price: several security vulnerabilities were introduced by older programs trying to filter text from a UTF-8 file that was no longer ACSII encoded.


  4. It is certainly possible to write unicode aware applications for Windows and Mac these days.


  5. UTF-8 is probably more efficient for almost all european languages, as they are all mostly ASCII.


  6. Old programs can function, but feeding unicode into a program that expects ASCII can introduce bugs including security problems.


  7. Java uses UTF-16 internally. :-)



About | Sitemap | Contact