Thou shalt always(ish) use UTF-8 in PHP

by | Jan 29, 2021

UTF-8 in PHP

Character sets are a mystery to many native-English coders, and if you’re not aware of them then you’ll occasionally find that things break, or go weird. Unicode was designed to replace all existing character sets with a single universal one. This article explains how to use the UTF-8 encoding of Unicode throughout your project.

 

One “charming quirk” of PHP is that while it’s pretty good about handling character sets, it’ll let you blithely carry on without even knowing what they are, which can mean you end up spitting out some pretty crazy weirdies that leave you scratching your head, or just getting annoyed. Increasingly, web editing tools are producing UTF-8, while PHP’s default output is the catchily
named Windows-1252-1. So what happens is that after writing PHP for years, you finally find out what you’re supposed to do htmlspecialchars() on everything you send to the browser, and when you do, you find it’s mangled your € and £
symbols. You have to tell it to behave. Here’s how.

Why UTF-8?

UTF-8 is one way of writing (encoding) the Unicode character set. Unicode aims to solve the problem of character sets by containing every possible character and symbol from every language. If you write your pages in Unicode, you can display any language and character without using HTML entities (like ©). It’s one huge thing less to think about. The UTF-8
encoding is very well-supported by both development tools and browsers, and it’s quite space-efficient, which means your file sizes won’t get bigger.

Make sure your web page is UTF-8

There’re two parts to this. First, your HTML editing tool (Dreamweaver, Emacs, Textmate, whatever) needs to save the file with the UTF-8 encoding. I mean, obviously, you don’t want to be telling everyone you’re using UTF-8 if you aren’t. If you’ve got a Mac, this is usually the default because that’s what the operating system uses. Notepad’s a bit of a bitch about UTF-8 because it
squirts in a bit of invisible code (aptly named a BOM) at the start of the file that sometimes doesn’t turn out to be invisible at all, so go out and get yourself a slightly better text editor. I’ll wait.

Part two is making sure that the web browser knows that the document is UTF-8.
This is done by adding a tag to your document’s HEAD section.

Making PHP output UTF-8 too

This part is pretty straightforward. You just print/echo. Except that you don’t, because you use htmlspecialchars() so that no one can take over your script and use it to steal stuff from your visitors, right? So all you have to
do is fill in the encoding argument of that function.

If you get tired of writing that every time instead of echo or print, you could make a couple of functions that do it for you. Go wild. Check the PHP Manual on what ENT_COMPAT means and why it’s probably what you mean. You want, not htmlentities() because you don’t need to encode any characters other than the reserved HTML characters when you’re using Unicode. This makes your source code easier to read and is more compact.

Talking UTF-8 to your database

Well, I say database, but really I mean MySQL, which is more or less the same thing except that people who manage real databases for a living will look down their noses at you. If you aren’t using MySQL, the method varies and you’ll
have to look it up for yourself, sorry.

Gotcha

Everyone in the world writes UTF-8 like `UTF-8`. Except for MySQL. They write it `utf8`. No hyphen. Tsk.

If you want to be really good about the whole thing, you make the database use UTF-8 internally too (start with CREATE DATABASE mydb CHARACTER SET utf8;), but really it doesn’t matter as long as the database converts to UTF-8 when
it’s talking to PHP.

All you have to do is send an SQL statement when you connect to your database:
SET NAMES utf8. How you do this depends on how you connect to your database.

A quick note on forms

Web browsers will (or ought to) send you form data using the same encoding as the page is set to, so if your HTML is in UTF-8, form data will be too.

Turning other stuff into UTF-8

This is actually quite tricky because if you’re reading a file from a hard disk, the software pretty much has to guess what character set it’s in. Which is a bit crappy. Fortunately as a PHP coder, you hardly ever have to do this
bit yourself (and if you do, you can get mb_detect_encoding()to help).

If you know what it is, and it’s not UTF-8, you can use PHP’s iconv() or mb_convert_encoding(). One of the (vanishingly few) nice things about XML is that when you write an XML file, the first line has the encoding in it, and
if it doesn’t, it defaults to UTF-8 so you never have to guess.

Related Articles

Using prepared statements in PHP

Using prepared statements in PHP

Whether you're reading from, or writing to a database, using prepared statements are easy, convenient, and secure. So what are they? About these examples The examples here are all for PHP's built-in database layer, PDO, but many other database layers also support...

Keep your code clean and simple

Keep your code clean and simple

“Keep it simple, stupid” Always write code that you'll be able to understand in six months time when you've moved on to bigger and better things. If you have to be clever, leave yourself clues. It's less likely to break this week, and when it breaks in 6 months time,...

Escaping your output in PHP

Escaping your output in PHP

Output escaping happens when you tell your PHP script to output some content, usually to a web page, but also to other servers, or XML files, or even the command-line. Every kind of output needs to be escaped differently, which means PHP never does it automatically....