If an HTML form encoded in UTF-8 is presented, users...
1 posts in topic
Flat View  Flat View
TOPIC ACTIONS:
 

Posted By:   Jonathan_Asbell
Posted On:   Sunday, March 4, 2001 09:12 AM

If an HTML form encoded in UTF-8 is presented, users may enter any character their computer allows into the form. This includes high ASCII characters, which the browser escapes as %NN. What must I do to properly decode their intended characters? Ex: the "Euro" symbol is hex value "80" on the PC but hex value "DB" on the Mac. There are also differences between international versions of windows as well, CPXXXX etc, where high ASCII values differ. On a side question, does the browser reliably take care of any conversion for me?

Re: If an HTML form encoded in UTF-8 is presented, users...

Posted By:   Anonymous  
Posted On:   Friday, March 30, 2001 08:12 PM

Note: This answer was actually provided by Jonathan Asbell.


Here is the answer according to Hans Bergsten from Gefion Software http://www.gefionsoftware.com
Author of JavaServer Pages (O'Reilly)

(Also I would like to thank him as he has been very generous in taking the time to help me resolve this problem).





When a browser sends a parameter in some encoding, such as UTF-8, it encodes each character byte value as a hexadecimal string using the encoding for the page (e.g. UTF-8). At the server, however, the part of the container that interprets these character values always assumes they are 8859-1 byte values. So it creates a Unicode string based on the byte
values interpreted as 8859-1. Since the 8859-1 assumption is made by the container, this hack (read "fix") works independently of the platform you run it on.




In the Servlet 2.2 API, the methods that parse parameter input always assume
that it's sent as ISO 8859-1
(i.e. getParameter() et al). So they create a String containing the correct bytes but incorrect charset.



If you know what the charset is, you can convert the bytes to a string using the correct charset:



new String(value.getBytes("8859_1"), "utf-8")



8859-1 is the default encoding of HTTP.

About | Sitemap | Contact