Wednesday, February 9, 2011

Posting non-ASCII characters in web forms

I just hit this issue. You write some non-ASCII text (e.g. in Cyrillic) in a form and when submitted the text appears garbled on the server side.

It turns out this is a well known glitch in web development especially when done in Java. The main reasons for the mess are
  1. Web browsers do not specify the encoding of posted data
  2. Java Servlet specification says that default request encoding should be ISO-8859-1 in contrast to UTF-8 which is universally used nowadays
You can find a good description of this issue here HTTP Form Character Sets and Related Problems.

Tomcat FAQ recommends creating a filter to set the request encoding.

But when using Wicket there is no such issue as they fix the request encoding to UTF-8 as described in How to change the character encoding.

Unfortunately Sling guys did it in another way (SLING-508) which requires me to put a hidden input named _charset_ with the value UTF-8 in all my forms. So they have adopted the ugly IE hack. :(
This is also described in Sling documentation.

I wish Sling had a way to set this to UTF-8 in one place and get rid of it.

Sometimes web development is so frustrating.

Update: Mar 2nd
After picking up this discussion on Sling mailing list the guys there decided after all to make this configurable, see SLING-1998.
Great! My forms now work without the _charset_ hack.