Validate Unicode/UTF-8 Form Input (Language Specific Characters)

Validating language specific characters in user input can be a bit of a pain, especially if you are using regular expressions to filter user submitted data.

Not only that, but testing your methods can also be a pain - firstly you will need some shortcuts for entering UTF-8/Unicode chars into form inputs if you are on an English keyboard. Second you will probably want to drop using a web based MySQL browser for this task (if you are inserting the UTF-8 characters into a database) e.g: phpMyadmin. Using the command line to view your MySQL data will help you view exactly what encoding of data is being stored in the db.

Once you have these two tools in place you can get onto validating the data.

Inserting Unicode/UTF-8 Encoded Data to Form Input
Using Ubuntu there are a couple of options. You can insert language specific characters directly from the keyboard using the 'alt-gr' key and then randomly combining with other alphabetic keys to see what sort of results you get. Such as øþeßæ.

If you want to insert other language/locale specifc keys such as äáñ using Ubuntu 7.10 you can achieve this with the compose key found in System->Keyboard Preferences->Layout Options. I chose the left Windows key.

To compose a character with an umlat (ä) then you hold down the compose key, hit the letter followed by the character you want to place above the letter, in my case it is: left windows key+a+shift+2 to produce ä.

Now you can input locale specific unicode keys into form inputs to your hearts desire.

Accessing MySQL from the Command Line
In linux its pretty easy. A command such as mysql -h localhost -u root -p will open up a command line to run MySQL queries from. In this case -h is the host (localhost), -u is the user (root) and -p is the password. If you just hit enter after -p you will be prompted for a password.

Now you can run queries to display exactly the data that is being stored in the DB when you insert it from your php backend.

Filtering the Input
Client Side Approach
There are two approaches with a regex. Either specify the characters you want to allow, or specify the characters you want to deny. Depending on which is the bigger subset will determine which approach you use. Unfortunately you cannot use \w to incorporate unicode characters because by definition this tag breaks down to [a-zA-Z0-9].

In our case we loosened the client side javascript validation right up, and are relying on Zend_Input_Filter combined with PDO to filter the data correctly before entering into the DB. To take care of the views we using Savante 3 templating engine we used the eprint() function.

Server Side Approach
Using Zend_Filter_Input to filter and validate unicode/utf-8 encoded characters is fairly straight forward. There is one gotcha however which is worth a mention as I managed to scan past it on the Zend Framework reference.

By default Zend_Filter is going to escape your data into html entities - a different encoding to Unicode and UTF-8 which we do not want to store in the db in this case. You need to change the default filter for escaping values, there are two ways of acheiving this however, and the first method described on the Zend Reference guide doesn't seem to get the job done.

This will not work:

$options = array('escapeFilter' => 'StringTrim');
$input = new Zend_Filter_Input($filters, $validators, $data, $options);

Because the filter rules (such as our new escape filter above) are run BEFORE the data is processed through validation.

So the data is escaped before validation. But then the reference goes on to say:

Other filters [this method] you declare in the array of filter rules are applied to input data before data are validated. If escaping filters were run before validation, the process of validation would be more complex, and it would be harder to provide both escaped and unescaped versions of the data.

Confusing isn't it.

This method will work nonetheless:

$input = new Zend_Filter_Input($filters, $validators, $data);
$input->setDefaultEscapeFilter(new Zend_Filter_StringTrim());

You should see the unicode characters appearing in your database just as they were entered on the form.

Notes:
I have had a number of issues with Zend_Filter_Input, half the issue I believe is the reference manual for the framework, which does not display the information I need in an accessible manner. Shortly you will see another post venting some of the frustrations I have had with Zend Filter, in the mean time I hope these tips can help with your development.

References and interesting links: