jGuru
Register Email     Password Forgot your
password?
HOME FAQS FORUMS DOWNLOADS ARTICLES PEERSCOPE LEARN

  Search   jGuru Search Help

Question How to get international unicode characters from a a form input field/servlet parameter into a string?
Derived from A question posed by Martin Kultermann
Topics Tools:AppServer:WebServer:Tomcat, Java:API:Servlets:Internationalization and Localization
Author charief gael
Created Nov 25, 2002


Answer

[ I have a servlet based app that generates and processes HTML forms. I would like to support mutiple languages/character sets that will be stored in unicode UTF-8 in the database. I am setting the following tags, for example:

<meta content="text/html; charset=Shift_JIS" http-equiv="Content-Type"> and
<form accept-charset="Shift_JIS"...

I am also setting the locale and content type on the HttpServletResponse to "ja" and "text/html; charset=Shift_JIS" respectively.

Unfortunately, the CharacterEncoding on the HttpServletRequest from the post is always null even though it's set properly in the browser.

Does anyone have a sample on how to get international unicode characters from a a form input field/servlet parameter into a string and from a java string into a form input or text field? ]

Answer:

If the request.getCharacterEncoding() is null, the default parsing value of the String is ISO-8859-1.

See at :
http://w6.metronet.com/~wjm/tomcat/2001/May/msg00433.html

So if you want to get an Unicode String (UTF-8) you have to do something like that:


 String myparam = request.getParameter("myparamname");
 if (myparam != null)
   myparam = new String(myparam.getBytes"8859_1"),"UTF8");


Is this item helpful?  yes  no     Previous votes   Yes: 9  No: 0



Comments and alternative answers

Comment on this FAQ entry

How to get international unicode characters from a a form input field/servlet parameter into a string?
Arthur Tang, Dec 13, 2002
myparam.getBytes("8859_1") will definitely kill double byte string if page encoding is not 8859_x or latins.

I found that all of the big 3 browsers(IE,NS/M,O) have very poor support of sending encoding information of their request. So you will not get the encoding anyway from the request.

I took the approach of setting the uniform encoding, such as utf-8, to all pages in your app (and hope the users do not change the browser's encoding between pages, fortunately most of user do not know what it is and won't change it), or page before submit. So, the request's encoding will be as you specified (utf-8). AND, most important is to set the request char encoding (setCharacterEncoding()) to 'your' encoding (utf-8) before you get the parameter (getParameter). This will interpret the submitted request in utf-8. Otherwise, the getParameter will split the double byte chars into some string cannot interpret again.

This work for any langauages, mixed language on the same page, or even mixed language on a form field (as long as the user can type it in).



Is this item helpful?  yes  no     Previous votes   Yes: 5  No: 1



Reply to this answer/comment  Help  
JSP and UTF8 is simple
dom sir, Dec 21, 2003  [replies:1]
I don't know way many people suggest things like the following "myparam = new String(myparam.getBytes"8859_1"),"UTF8");" You don't have to do that all you need is putting the following on top of your JSPs <META HTTP-EQUIV="Content-Type" CONTENT="text/html; charset=UTF-8"> <%@ page pageEncoding="UTF-8" %> <%@ page language="java" contentType="text/html;charset=UTF-8" %>

Is this item helpful?  yes  no     Previous votes   Yes: 1  No: 2



Reply to this answer/comment  Help  
Re: JSP and UTF8 is simple
june Quin, Jan 29, 2004
HI, The reason is all of the settings you mentioned don't solve the problem. I tried those and after getting a double byte character value using request.getParameter("name") and print out the value, the value is corrupted, but if I do what many people suggested, the value is not corrupted.

Is this item helpful?  yes  no     Previous votes   Yes: 0  No: 0



Reply to this answer/comment  Help  
Reading non-Latin form data from forms on UTF-8 pages
Malcolm McMahon, Mar 3, 2004

I've found for myself the approach.

parm = new String(param.getBytes("ASCII"), "UTF-8");

I'm uneasy using it however, because sooner or later a version of Tomcat or some other servlet engine is going to fix the problem and do the stream to string conversion based on the character set specified in the headers of the message from the browser.

When that happens existing applications with this code are going to screw up a treat.



Is this item helpful?  yes  no     Previous votes   Yes: 3  No: 0



Reply to this answer/comment  Help  
Non-Latin form data
Malcolm McMahon, Mar 3, 2004  [replies:3]
Just tried it another way and it seems to work (at least for a UTF-8 form).
if(request.getCharacterEncoding() == null)
   request.setCharacterEncoding("UTF-8");
  address = request.getParameter("address");
This has successfully input Greek text without the need for explicit conversion and should be more future-proof (since it doesn't depend on the servlet engine not doing the translation automatically.)

Is this item helpful?  yes  no     Previous votes   Yes: 2  No: 0



Reply to this answer/comment  Help  
Re: Non-Latin form data
K Fl, Aug 6, 2005  [replies:2]
And just make sure you set the method="post" in your form declaration! GETs do not seem to work with non-latin character sets. K

Is this item helpful?  yes  no     Previous votes   Yes: 0  No: 0



Reply to this answer/comment  Help  
Re[2]: Non-Latin form data
stgma stgma, Sep 29, 2005  [replies:1]

Thanks, that was very valuable information. I couldn't make it work because I was using GET.

How does request.getParameter("query") know whether query comes from GET or POST? So in case it comes from GET I do the new String(query.getBytes("ISO-8859-1"), "UTF-8") style conversion.

That's because sometimes it's useful to have the query on the URL so you can bookmark it or send it to others (e.g results.jsp?query=%CE%B1%CE%B2%CE%B3 which is query=abc in Greek)

thanks



Is this item helpful?  yes  no     Previous votes   Yes: 0  No: 0



Reply to this answer/comment  Help  
Re[3]: Non-Latin form data
KJG Swenson, Dec 3, 2005
This test program demonstrates the problem:

<%@page contentType="text/html;charset=UTF-8" pageEncoding="ISO-8859-1"
%><%@page import="java.net.URLEncoder"
%><%
request.setCharacterEncoding("UTF-8");
response.setContentType("text/html;charset=UTF-8"); //this is redundant

String qParam = request.getParameter("q");
String queryString = request.getQueryString();

%>
<HTML>
<BODY>
Query is: <%=queryString%><br>
Parameter q is: <%=qParam%><br>
Parameter q URL encoded: <%=URLEncoder.encode(qParam, "UTF-8")%>
<form action="queryParameterTest.jsp" method="get">
<input type="text" name="q" value="<%=qParam%>">
<input type="submit">
</form>
<font size="-2">To keep this test simple, proper handling for
&amp; &lt; &gt; and &quot; is not included in this program,
so avoid typing those characters into the box above</font>
</body>


This program has a form with a single input, and send it back to the same form. When everything works correctly, you should be able to put ANY unicode string into the edit box (except special HTML characters please) and you should get the same thing back. My experience is that TomCat 5.5.9 fails to decode the query parameters correctly for non ASCII characters. It simply does not follow the W3C standard for handling query parameters.

I have content encoding set correctly (twice), I have request character set correctly. You can type in a correct URL like

queryParameterTest.jsp?q=CD%C3%89FG

And you should get CDÉFG but I don't on my tomcat. And, yes, I require GET method, not POST.

Is this item helpful?  yes  no     Previous votes   Yes: 0  No: 0



Reply to this answer/comment  Help  
Unicode, forms, etc. and MySQL
Deborah Lee Soltesz, Nov 4, 2005  [replies:2]

For those working with JSP and databases who are still having problems...

I'm working with a MySQL database, JSP, and forms and was having similar problems to others here. The answers here apparently work, but in my case, the key was the database connection. I found an article on JavaWorld describing a similar set-up to my own, with the following key element:

Connection db =
DriverManager.getConnection(
"jdbc:mysql:///" + dbname +
"?requireSSL=false&useUnicode=true&characterEncoding=UTF-8",
"root", "");

As soon as I added the "&useUnicode=true&characterEncoding=UTF-8" to my connection string, all my problems were solved.

See the article (page 2)

On page one of the article, he describes explicitly setting the encoding for the db tables and fields to UTF-8 when creating them. Somewhere else I'd read an article simply advising to use VARCHAR for fields which would contain Unicode - I didn't explictly set my table fields for UTF-8, and that's working fine for me.



Is this item helpful?  yes  no     Previous votes   Yes: 0  No: 0



Reply to this answer/comment  Help  
Re: Unicode, forms, etc. and MySQL
Brian Lai, Feb 17, 2007  [replies:1]

After spinning around this same problem for one full day, and trying out various solution that all of your provided, I decided to dig into Tomcat's source code and see what's happening behind the scenes.

FYI, I'm working with a JBoss4.0.3SP1, which ship with a copy of Tomcat 5.5.9:

In org.apache.coyote.Request line#283, inside of a function called getCharacterEncoding(), I found:

   charEncoding = ContentType.getCharsetFromContentType(getContentType());

Step into org.apache.tomcat.util.ContentType, and here comes the evil:

    // Basically return everything after ";charset="
    // If no charset specified, use the HTTP default (ASCII) character set.
    public static String getCharsetFromContentType(String type) {
        if (type == null) {
            return null;
        }
        int semi = type.indexOf(";");
        if (semi == -1) {
            return null;
        }
        int charsetLocation = type.indexOf("charset=", semi);
        if (charsetLocation == -1) {
            return null;
        }
	String afterCharset = type.substring(charsetLocation + 8);
        // The charset value in a Content-Type header is allowed to be quoted
        // and charset values can't contain quotes.  Just convert any quote
        // chars into spaces and let trim clean things up.
        afterCharset = afterCharset.replace('"', ' ');
        String encoding = afterCharset.trim();
        return encoding;
    }

Huh?! Most browsers submits a <form> with content-type=application/x-www-form-urlencoded, and there will be no ";" or "charset=..." coming after that! So if you are submitting some sort of <form>, this function will always return null.

This "null" will then be picked up by function parseParameters() in org.apache.catalina.connector.Request, and this is how it will handle our "null":

    protected void parseParameters() {

        parametersParsed = true;

        Parameters parameters = coyoteRequest.getParameters();

        // getCharacterEncoding() may have been overridden to search for
        // hidden form field containing request encoding
        String enc = getCharacterEncoding();

        boolean useBodyEncodingForURI = connector.getUseBodyEncodingForURI();
        if (enc != null) {
            parameters.setEncoding(enc);
            if (useBodyEncodingForURI) {
                parameters.setQueryStringEncoding(enc);
            }
        } else {
            parameters.setEncoding
                (org.apache.coyote.Constants.DEFAULT_CHARACTER_ENCODING);
            if (useBodyEncodingForURI) {
                parameters.setQueryStringEncoding
                    (org.apache.coyote.Constants.DEFAULT_CHARACTER_ENCODING);
            }
        }
        ...
        ...
    }

Guys, did you see those two lines of comments?

    // getCharacterEncoding() may have been overridden to search for
    // hidden form field containing request encoding

This is what Tomcat is expecting us to solve our problem. Or if you don't want to extend this whole class just to solve your problem, you can put this few lines after line:

    String enc = getCharacterEncoding();

    //add these lines
    if (enc == null) {
        enc = connector.getURIEncoding();
    }

Now use the build.xml comes with the Tomcat src to rebuild the Tomcat source, problem solved.



Is this item helpful?  yes  no     Previous votes   Yes: 1  No: 0



Reply to this answer/comment  Help  
Re[2]: Unicode, forms, etc. and MySQL
parampreet sethi, Aug 3, 2009
Hi Guys, This is regarding the issue I am facing while sending UTF-8 characters using GET method to a servlet directly from browser. I have done the following settings: 1. Created a CharsetFilter, which sets encoding type for each request as UTF-8 2. Applied this filter in web.xml before all the requests 3. In my servlet, while writing the response, I have set response.setContentType to text/html;charset=utf-8 For the above mentioned settings accented characters like ÀÁÅÆÇÈÉÊËÌÍÎÏÐÑÒÓÔÕÖØÙÚÛÜÝÞßàáâãäåæçèéêëìíîïðñòóôõöøùúûüýþÿ works correctly. But Chinese characters, Arabic characters etc does not work. How ever if along with above settings, I change the server.xml settings to have useBodyEncodingForURI="true" OR/AND URIEncoding="UTF-8" in connector tag, the Chinese & Arabic characters works fine but now accented characters do not work . I have tried all the combination of the settings mentioned but some how only one of the above two situations work. Has anybody come across this problem? Any pointers will be great. I can not use POST request, as My servlet is the entry point to my application. Thanks Param

Is this item helpful?  yes  no     Previous votes   Yes: 1  No: 0



Reply to this answer/comment  Help  


Ask A Question



 
Related Links

Tomcat FAQ

Tomcat Forum

Jakarta Project

Tomcat Documentation

jGuru JSP FAQ

jGuru Servlet FAQ

Servlets FAQ

Servlets Forum

Servlet-related resource list from Purple Technology

jGuru JSP FAQ

Sun Servlet Home Page

java.isavvix.com

Wish List
Features
About jGuru
Contact Us

 


Internet.com
The Network for Technology Professionals

Search:

About Internet.com

Legal Notices, Licensing, Permissions, Privacy Policy.
Advertise | Newsletters | E-mail Offers