Java UTF–8 international character support with Tomcat and Oracle, 26/03/07, Kieran's blog Kieran's blog Warwick Blogs | Contact me Prompts (0 read ) | Sign in March 26, 2007 Java UTF–8 international character support with Tomcat and Oracle Introduction I've spent the last few days looking at getting proper international character support working in our Files.Warwick application working. At E-Lab we've never been that great at doing internationalisation support. BlogBuilder does a pretty good job of internationalisation as can be seen by quite a lot of our bloggers writing in Chinese/Korean/Japanese. However, it's a bit of a cludge and doesn't work everywhere. It didn't take long for someone to upload a file to Files.Warwick with an "é" in the file name. Due to our previous lack of thought in this area, this swiftly turned into a ? :( So...how do you get your app to support international characters throughout? What is international character support? You'll hear all sorts of jargon regarding internationalisation support. Here is a little explanation of what it is all about. What I do NOT mean is i18n support which is making the application support multiple languages in the interface so that you can read help pages and admin links in French or Chinese. What I mean by internationalisation support is being able to accept user input in any language or character set. Tim Bray has a really good explanation of some of the issues surrounding ASCII/Unicode/UTF-8. UTF-8 all the way through the stack We need to look at UTF-8 support in the following areas: URLs Apache HTML Javascript POST data File download (Content-Disposition) JSPs Java code Tomcat Oracle File system I'll go through each of these areas and explain how well they are supported by default and what changes you might need to make to support UTF-8 in each area. URLs URLs should only contain ASCII characters. The ASCII character set is quite restrictive if you want to use Chinese characters for instance, so there is some encoding needed here. So if you've got a file with a Chinese character and you want to link to it, you need to do this: "中.doc" -> "%E4%B8%AD.doc" Thankfully this can be done with a bit of Java: java.net.URLEncoder.encode("中.doc","UTF-8"); So, whenever you need to generate something for the address bar or a direct or something like that, you must URL encode the data. You don't have to detect this as it doesn't hurt to do this for links which are just plain old ASCII as they don't get changed, as you can see with the ".doc" ending on the above example. Apache Generally you don't need to worry about Apache as it shouldn't be messing with your HMTL or URLs. However, if you are doing some proxying with mod_proxy then you might need to have a think about this. We use mod_proxy to do proxying from Apache through to Tomcat. If you've got encoded characters in URL that you need to convert into some query string for your underlying app then you're going to have a strange little problem. If you have a URL coming into Apache that looks like this: http://mydomain/%E4%B8%AD.doc and you have a mod_rewrite/proxy rule like this: RewriteRule ^/(.*) http://mydomain:8080/filedownload/?filename=$1 [QSA,L,P] Unfortunately the $1 is going to get mangled during the rewrite. QSA (QueryStringAppend) actually deals with these characters just fine and will send this through untouched, but when you grab a bit of the URL such as my $1 here then the characters get mangled as Apache tries to do some unescaping of its own into ISO-8859-1, but it's UTF-8 not ISO-8859-1 so it doesn't work properly. So, to keep our special characters in UTF-8, we'll escape it back again. RewriteMap escape int:escape RewriteRule ^/(.*) http://mydomain:8080/filedownload/?filename=${escape:$1} [QSA,L,P] Take a look at your rewrite logs to see if this is working. HTML HTML support for UTF-8 is good, you just need to make sure you set the character encoding properly on your pages. This should be as simple as bit of code in the HEAD of your page:
You should be able to write out UTF-8 characters for real into the page without any special encoding. Javascript Javascript supports UTF-8 characters very well so as long as you don't use escape() then when your users enter characters, they shouldn't get mangled. We also use AJAX do do some functions in our application so you need to think about that as well but again, it should just work. All of the above only holds true if you set the character encoding right on your surrounding HTML. POST data Getting POST datafrom the user in the right format is simple too. As long as your HTML has the right encoding then you should be ok. File download (Content-Disposition) If you want to serve files for download from your app, as we obviously do with Files.Warwick then you'll need to understand how browsers deal with non ASCII characters in file names when downloading. Unfortunately the standard is not exactly well defined as no one really thought about UTF-8 file names until recently. Internet Explorer supports URL encoded file names but Firefox supports a rather strange Base64 encoded value for high byte file names, so something like this should do the job:
String userAgent = request.getHeader("User-Agent");
String encodedFileName = null;
if (userAgent.contains("MSIE") || userAgent.contains("Opera")) {
encodedFileName = URLEncoder.encode(node.getName(), "UTF-8");
} else {
encodedFileName = "=?UTF-8?B?" + new String(Base64.encodeBase64(node.getName().getBytes("UTF-8")), "UTF-8") + "?=";
}
response.setHeader("Content-Disposition", "attachment; filename=\"" + encodedFileName + "\"");
Obviously you can tweak the user agent detection to be a bit smarter than this. JSPs UTF-8 support in JSPs is pretty much a one liner. <%@ page language="java" pageEncoding="utf-8" contentType="text/html;charset=utf-8" %> Include that at the top of every single JSP perhaps in a prelude.jsp file and you're away. Java code As long as you source strings are properly encoded then generally you can rely on Java to keep your UTF-8 encoded input. However, be careful what String functions you perform on your UTF-8 data. Be sure to do things like this: myStr.getBytes("UTF-8") rather than just myStr.getBytes() If you don't then you'll most likely end up with ISO-8859-1 bytes instead. If for some reason you can not get your input data to be UTF-8, and it is coming in with a different encoding, you could do something like this to convert it to UTF-8: String myUTF8 = new String(my8859.getBytes("ISO-8859-1"),"UTF-8") Debugging can be fun with high byte characters as generally logging to a console isn't going to show you the characters you are expecting. If you did this: System.out.println(new String(new byte[] { -28, -72, -83},"UTF-8") Then you'd probably just see a ? rather than the Chinese character that it really should be. However, you can make log4j log UTF-8 messages. Just add
To the appender in your log4j.xml config. Or this: log4j.appender.myappender.Encoding=UTF-8 To your log4j.properties file. You might still only see the UTF-8 data properly if you view the log file in an editor/viewer that can view UTF-8 data (Windows notepad is ok for instance). Tomcat By default Tomcat will encode everything in ISO-8859-1. You can in theory override this by setting the incoming encoding of the HttpServletRequest to be UTF-8, but once some of the request is read, then the encoding is set, so chances are you might not be able to manually do: request.setCharacterEncoding("UTF-8") early enough to have an effect. So instead you can tell Tomcat you want it to run in UTF-8 mode by default. Just add the following to the Connector you want UTF-8 on in your server.xml config file in Tomcat. URIEncoding="UTF-8" Not doing this has the fun quirk that if you have a request like this: /test.htm?highByte=%E4%B8%AD If you did request.getQueryString() you'd get the raw String that "highByte=%E4%B8%AD", but if you did request.getParameter("highByte") then you'd get the ISO-8859-1 encoded value instead which would not be right. Sigh. Oracle You could just URL encode all of your data and put it into the database in ASCII like you always used to. However, that doesn't make for very readable data. There are two options here although I've only tried the one. Set the default character encoding of your Oracle database to be UTF-8. However, it is set on a per server basis, not a per schema basis so your whole server would be affected. Use NVARCHAR2 fields instead of VARCHAR2 fields and you can store real UTF-8 data. We went for option 2 as we have a shared Oracle server. First of all, convert all fields that you want to store UTF-8 data in from VARCHAR2s to NVARCHAR2s. Be careful as I don't think you can change back! You then need to tell your JDBC code somehow that it needs to send data that the NVARCHAR2 fields can undertand. There are a couple of ways of doing this too: Set the defaultNChar property on the connection to true. Use the setFormOfUse() method that is an Oracle specific extension to the PrepearedStatement I went for option 1 as the problem with option 2 is that you have to somehow get at the Oracle specific connection or prepared statement within your Java code. This is not fun as you'll often be using a connection pool that will hide away these details. Files system File system support of UTF-8 characters is again pretty good, but you are sometimes going to have issues with viewing the file listings. I just couldn't get a UTF-8 file name to display properly over a putty SSH connection. Through a simple Java test program, I could write and read back a UTF-8 file name on our Solaris 10 box, but all I could ever actually read when doing an "ls" was ?????.doc. So for the sake of maintainability of the file system I went for a URL encoded version of the file. This isn't ideal, but it works. Conclusion As you can see, there is quite a lot of work involved in supporting UTF-8 throughout. A lot of my time was spent researching as my understanding of encoding issues wasn't great. Now that I've put together this guide, I hope all of our apps can start to work towards full UTF-8 support. Of course the above guide is quite specific to my experience in the app I was dealing with and the environment I work in so your experiences might be more or less painful :) : 26 Mar 2007 17:07 | Tags: Encoding Java JSP Oracle Programming Tomcat Unicode Utf-8 Work | Comments (15) | Report a problem 15 comments by 3 or more people [Skip to the latest comment] Update: I forgot to mention UTF-8 in emails. With a MimeMessage instead of just doing: message.setText(utf8Body) and message.setSubject(utf8Subject); Do this instead: message.setText(utf8Body,”UTF-8”) and message.setSubject(utf8Subject,”UTF-8”); If you are using Spring and the SimpleMailMessage, you don’t get access to these methods. Using the JavaMailSender interface, you can do this: MimeMessagePreparator prep = new MimeMessagePreparator() { public void prepare(final MimeMessage message) throws Exception { message.setSubject(utf8Subject, “UTF-8”); message.setText(utf8Body, “UTF-8”); } }; mailSender.send(prep); 27 Mar 2007, 16:01 Nick Howes all I could ever actually read when doing an “ls” was ?.doc I just managed to get a PuTTY session working with UTF-8 characters. You have to add this stuff to the bash user’s .inputrc file (stolen from this guide to utf-8 in bash): # Enable 8bit input
set meta-flag on
set input-meta on
# Turns off 8th bit stripping
set convert-meta off
# Keep the 8th bit for display
set output-meta on Then in PuTTY, in the Translation tab you can select UTF-8 from the dropdown. 28 Mar 2007, 10:17 Megha This post was super helpful! Thanks. Just a note, the encoding for the response must be set along with that of the request to display characters from the DB on a jsp. response.setCharacterEncoding(“UTF8”); 04 Apr 2007, 21:07 Dubs Thanks!!!!!!!!!! 05 Apr 2007, 06:43 DIY Lover You’ve missed the most important part IMO: XML files & (X)HTML file saving protocols. The meta deceleration you make is good, but should be mirrored in your file encoding. If you’re not sure how they are saved then you’re probably not saving them in UTF-8! In notepad, go to save as and check the encoding options & you’ll find UTF-8 there. On XML files? If it’s not encoded in UTF-8 it won’t be correctly parsed. End of story. A handy one to keep track of! 04 May 2007, 14:46 Ski Snow Trains Bum Good point – this caused me no end of problems a few months a go. 04 May 2007, 14:50 Robert Thanks for posting this great article! Here are a couple more tips for using UTF-8 in Java: You point out the problem with using req.setCharacterEncoding(“UTF-8”):http://java.sun.com/javaee/5/docs/api/javax/servlet/ServletRequest.html#setCharacterEncoding(java.lang.String) it comes too late. Sun recommends setting this with a filter. Tomcat comes with an example (filter.SetCharacterEncodingFilter.java) that works right out of the box. You just have to add it to your classpath and register it in your web.xml file. Another problem is with resource bundles. Properties files are read as ISO-8859-1. To include Unicode characters, you must use \uXXXX escapes. (Learn to use Java’s native2ascii tool.) Java 6 has a new ResourceBundle.Control class that allows you to read XML resource bundles in UTF-8. “Core Java Technologies Tech Tips(October 18, 2005)”;http://java.sun.com/developer/JDCTechTips/2005/tt1018.html discusses this. Thanks again! 27 May 2007, 01:24 OnTheLookOut Here is a nice article written about java’s broken implementation of utf-8 (writing). http://tripoverit.blogspot.com/2007/04/javas-utf-8-and-unicode-writing-is.html 15 Jun 2007, 05:29 Suresh Hi I am facing a problem regarding french character. I am using teradata as database and jdbc to insert and fertch data. Teradata is supporting french character and before creating statement & exeuting executeUpdate() method , french character is coming fine.But after executing executeUpdate() for inserting into DB , characters like ? are getting inserted into DB. I have tried each and every options in JSPs and java code also. But….. Please help me for getting a solution for this. Thanks in advance SD 09 Jul 2007, 08:11 Frank Hi Kieran! First of all thanks for your very good posting here, that really helped out and focussed on nearly everything mankind need for a UTF-8 conversion. How about some MySQL-facts I found out? (all tested with MySQL Version 5.0) First of all you need your databases and tables to be in utf8 character set (well of course ;) Your MySQL instance has to run with utf8 as default charachter set To accomplish the first task you can simply ALTER your table with an SQL statement like this_ ALTER DATABASE db_name CHARACTER SET utf8 COLLATE utf8_general_ci; For the second task start your mysql-demon from the command line with following option: _path_to_mysql\bin\mysqld —character-set-server=utf8 for more information on the options see http://dev.mysql.com/doc/refman/5.0/en/server-options.html 10 Jul 2007, 17:31 specialbrandk Thank you for assembling this excellent note. It is super useful. 08 Aug 2007, 04:00 ekSi Oh my god, I’m so thankful to find your writings, your solution works like a charm… Thank You sooo much :) 16 Aug 2007, 03:48 Will Hey Kieran – did you know this post comes up when doing a Google search for ‘JSP internationalisation’? Interesting stuff anyway. 05 Oct 2007, 10:25 Mathew Mannion You are my hero. 18 May 2010, 11:50 Mathew Mannion One thing that was glossed over is that if you use Spring, there is a filter you can put in your XML to ensure both the request AND the response have UTF-8 encoding by default:
characterEncodingFilter org.springframework.web.filter.CharacterEncodingFilter encoding UTF-8 forceEncoding true Then you can just put it at the top of your filter stack. 18 May 2010, 12:32 Add a comment You are not allowed to comment on this entry as it has restricted commenting permissions. March 2007 Mo Tu We Th Fr Sa Su Feb | Today | Apr 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 Tags Blogging (140) Java (42) New House (39) Programming (29) Work (20) Search this blog Favourite blogs External Stuff Gemini Contract Hire and Car Leasing LifeHacker Sunbirds Bed and Breakfast Wired IT People Autology: John Dale's blog Colin's blog Cubicle 23 Hannah's handbag Innovating Research! Kieran's blog Secret Plans and Clever Tricks Transversality - Robert O'Toole Warwick Learning Technology Magazine [Ux] Other Warwick People Computer-aided assessment for sciences Andy's Dumb Ramblings Bartleby COOL LINKS Contemplating the Frame Dan Lawrence's Blog Hollyzone Neighbourhood #1 The Militant Wing of Pacifism Most recent comments One thing that was glossed over is that if you use Spring, there is a filter you can put in your XML… by Mathew Mannion on this entry You are my hero. by Mathew Mannion on this entry And may all your chickens come home to roost – in a nice fluffy organic, non–supermarket farmed kind… by Julie Moreton on this entry Good luck I hope that you enjoy the new job! by on this entry Good luck Kieran. :) by on this entry Galleries Slideshow of all galleries Ant & Jon's wedding (19 images) Anthony & Phoebe's wedding (7 images) Bathroom renovation (13 images) Dogs (42 images) Ferrari driving (6 images) Garden make over (56 images) Goodbye home (8 images) Kitchen renovation (71 images) Lanzarote 2004 (3 images) Moving in (18 images) Norfolk 2006 (27 images) Our new house (12 images) Random (20 images) South Africa 2005 (37 images) South Africa 2006 (72 images) Stats (21 images) Steph's leaving do (21 images) Tenerife 2005 (16 images) Tom and Louis (25 images) Wedding stuff (25 images) Not signed in Sign in Powered by BlogBuilder © MMXXI