Thursday, January 12, 2012

ISO-8859-1 vs UTF-8

This is a developer note. Just ignore it if you only visit here for crossword-y stuff.

Our earliest crosswords had typical letters in the clues, no peculiarities like diacritics. For example, in the "trick or treat" puzzle, 1D is clued with "Snag". A perfectly ordinary English word, no special characters.

In next week's puzzle, themed "chocolate", 43D is clued "Dalí contemporary". With that accent on the "i", we found a bug in the site.

The clue looks fine when querying the database, but that doesn't really tell you anything. For all you know, your database client is encoding the characters differently from your web server.

And indeed, in previewing this puzzle, I could see that the clue looked incorrect on the website. The "í" looked like "í".

I'm familiar with this problem from previous work with PHP based web sites, so I knew what to look for. I was pretty sure the database encoding was fine, and the problem was in the way the web page was being displayed. Eventually I got to this page which explained the solution really well.

I looked at the header info in Firebug, and found it was ISO-8859-1 (aka Latin-1). I thought that should be OK, because "í" is part of the ISO-8859-1 character set. However, somewhere along the way, that ISO-8859-1 character is apparently being converted to UTF-8, while the web page displays it as ISO-8859-1. At least I think this is what's happening.

I also found a page at Blue Box which indicates you've got problems if your database is using latin1 encoding. I am in that situation and really don't have control over it (not my db server). Here's the related query:

mysql> show variables like 'char%';
+--------------------------+----------------------------------------------+
| Variable_name | Value |
+--------------------------+----------------------------------------------+
| character_set_client | latin1 |
| character_set_connection | latin1 |
| character_set_database | utf8 |
| character_set_filesystem | binary |
| character_set_results | latin1 |
| character_set_server | latin1 |
| character_set_system | utf8 |
| character_sets_dir | /data/mysql/brackemyre/share/mysql/charsets/ |
+--------------------------+----------------------------------------------+
8 rows in set (0.01 sec)

OK that's not nice, but not much I can do about it.

In the end, I added this code to the top of my PHP file:
header('Content-Type: text/html; charset=utf-8');

I also added a meta tag to my HTML: 
<meta http-equiv="Content-Type" content="text/html; charset=utf-8">

This fixed the problem, as you can see from the screenshot below:
The fix was remarkably easy, whew. Internationalization can sometimes be a real pain.

No comments:

Post a Comment