Character sets are a mystery to many native-English coders, and if you’re not aware of them then you’ll occasionally find that things break, or go weird. Unicode was designed to replace all existing character sets with a single universal one. This article explains how to use the UTF-8 encoding of Unicode throughout your project.
One “charming quirk” of PHP is that while it’s pretty good about handling character sets, it’ll let you blithely carry on without even knowing what they are, which can mean you end up spitting out some pretty crazy weirdies that leave you scratching your head, or just getting annoyed. Increasingly, web editing tools are producing UTF-8, while PHP’s default output is the catchily
Windows-1252-1. So what happens is that after writing PHP for years, you finally find out what you’re supposed to do
htmlspecialchars() on everything you send to the browser, and when you do, you find it’s mangled your € and £
symbols. You have to tell it to behave. Here’s how.
UTF-8 is one way of writing (encoding) the Unicode character set. Unicode aims to solve the problem of character sets by containing every possible character and symbol from every language. If you write your pages in Unicode, you can display any language and character without using HTML entities (like
©). It’s one huge thing less to think about. The UTF-8
encoding is very well-supported by both development tools and browsers, and it’s quite space-efficient, which means your file sizes won’t get bigger.
Make sure your web page is UTF-8
There’re two parts to this. First, your HTML editing tool (Dreamweaver, Emacs, Textmate, whatever) needs to save the file with the UTF-8 encoding. I mean, obviously, you don’t want to be telling everyone you’re using UTF-8 if you aren’t. If you’ve got a Mac, this is usually the default because that’s what the operating system uses. Notepad’s a bit of a bitch about UTF-8 because it
squirts in a bit of invisible code (aptly named a BOM) at the start of the file that sometimes doesn’t turn out to be invisible at all, so go out and get yourself a slightly better text editor. I’ll wait.
Part two is making sure that the web browser knows that the document is UTF-8.
This is done by adding a tag to your document’s HEAD section.
Making PHP output UTF-8 too
This part is pretty straightforward. You just
echo. Except that you don’t, because you use
htmlspecialchars() so that no one can take over your script and use it to steal stuff from your visitors, right? So all you have to
do is fill in the encoding argument of that function.
If you get tired of writing that every time instead of
htmlentities() because you don’t need to encode any characters other than the reserved HTML characters when you’re using Unicode. This makes your source code easier to read and is more compact.
Talking UTF-8 to your database
Well, I say database, but really I mean MySQL, which is more or less the same thing except that people who manage real databases for a living will look down their noses at you. If you aren’t using MySQL, the method varies and you’ll
have to look it up for yourself, sorry.
Everyone in the world writes UTF-8 like `UTF-8`. Except for MySQL. They write it `utf8`. No hyphen. Tsk.
If you want to be really good about the whole thing, you make the database use UTF-8 internally too (start with
CREATE DATABASE mydb CHARACTER SET utf8;), but really it doesn’t matter as long as the database converts to UTF-8 when
it’s talking to PHP.
All you have to do is send an SQL statement when you connect to your database:
SET NAMES utf8. How you do this depends on how you connect to your database.
A quick note on forms
Web browsers will (or ought to) send you form data using the same encoding as the page is set to, so if your HTML is in UTF-8, form data will be too.
Turning other stuff into UTF-8
This is actually quite tricky because if you’re reading a file from a hard disk, the software pretty much has to guess what character set it’s in. Which is a bit crappy. Fortunately as a PHP coder, you hardly ever have to do this
bit yourself (and if you do, you can get
If you know what it is, and it’s not UTF-8, you can use PHP’s
mb_convert_encoding(). One of the (vanishingly few) nice things about XML is that when you write an XML file, the first line has the encoding in it, and
if it doesn’t, it defaults to UTF-8 so you never have to guess.