phputf8 - Tools for working with UTF-8 in PHP
require_once '/path/to/utf8/utf8.php';
require_once UTF8 . '/utils/validation.php';
require_once UTF8 . '/utils/ascii.php';
# Check the UTF-8 is well formed
if ( !utf8_is_valid($_POST['somecontent']) ) {
require_once UTF8 . '/utils/bad.php';
trigger_error('Bad UTF-8 detected. Clearning', E_USER_NOTICE);
# Strip out bad sequences - replace with ? character
$_POST['somecontent'] = utf8_bad_replace($_POST['somecontent']);
}
# This works fine with UTF-8
$_POST['somecontent'] = ltrim($_POST['somecontent']);
# If it contains only ascii chars, use native str fns for speed...
if ( !utf8_is_ascii($_POST['somecontent']) ) {
$endfirstword = strpos($_POST['somecontent'],' ');
$firstword = substr($_POST['somecontent'],0,$endOfFirstWord);
$firstword = strtoupper($firstword);
$therest = substr($_POST['somecontent'],$endOfFirstWord);
} else {
# It contains multibyte sequences - use the slower but safe
$endfirstword = utf8_strpos($_POST['somecontent'],' ');
$firstword = utf8_substr($_POST['somecontent'],0,$endOfFirstWord);
$firstword = utf8_strtoupper($firstword);
$therest = utf8_substr($_POST['somecontent'],$endOfFirstWord);
}
# htmlspecialchars is also safe for use with UTF-8
header("Content-Type: text/html; charset=utf-8");
echo "<pre>";
echo "<strong>".htmlspecialchars($firstword)."</strong>";
echo htmlspecialchars($therest);
echo "</pre>";
phputf8 does a few things for you;
utf8_. Six of these functions
are loaded "on the fly", depending on whether you have the mbstring
extension available. The rest build on top of those six. UTF8 . '/utils/validation.php' contains functions for testing
strings for bad UTF-8 sequences. Note that other functions in the library
assume valid UTF-8.UTF8 . '/utils/bad.php'UTF8 . '/utils/ascii.php' contains utilities to detect
whether a UTF-8 string contains just ASCII characters (allowing
you to use PHP's faster, native, string functions) and also stripping
everything non-ASCII from a stringUTF8 . '/utils/specials.php' contains basic transliteration
functionality (http://en.wikipedia.org/wiki/Transliteration) - not
much but enough to convert common European, non-ascii characters to
a reasonable ASCII equivalent. You might use these when preparing a
string for use as a filename, afterwhich you strip all other non-ascii
characters using the ASCII utilities.utf8_to_ascii package
at http://sourceforge.net/projects/phputf8. Much more powerful
functionality is provided by the pecl transliteration extension -
http://derickrethans.nl/translit.php and
http://pecl.php.net/package/translit.There are seven essential functions provided by phputf8, which are
required by many of the other functions. These are all loaded
when you include the main utf8.php script e.g.
require_once '/path/to/utf8/utf8.php';
Six of these functions depend on whether the mbstring extension is installed (see http://www.php.net/mbstring) - if it is available, the following functions will be wrappers around the equivalent mb_string functions;
utf8_strlenutf8_strposutf8_strrposutf8_substrutf8_strtolowerutf8_strtoupperNote: phputf8 cannot support mbstring function overloading; it relies in some cases on PHP's native string functions counting characters as bytes.
The seventh function is utf8_substr_replace, which is
implemented independent of mbstring (mbstring doesn't
provide it).
Important Note - if you do not load utf8.php and you wish
to use the mbstring implementations, you need to set the mbstring
encoding to UTF-8 yourself - see http://www.php.net/mb_internal_encoding.
All other string functions must be included on demand. They are
available directly under the UTF8 directory with filenames
corresponding to the equivalent PHP string functions, but still
with the function prefix utf8_.
For example, to load the strrev implementation;
# Load the main script
require_once '/path/to/utf8/utf8.php';
# Load the UTF-8 aware strrev implementation
require_once UTF8 . '/strrev.php';
print utf8_strrev('Iñtërnâtiônàlizætiøn')."\n";
All string implementations are found in the UTF8 directory.
For documentation for each function, see the phpdocs
http://phputf8.sourceforge.net/api.
TODO Some of the functions, such as utf8_strcspn take
arguments like 'start' and 'length', requiring values in terms
of characters not bytes - i.e. return values from functions
like utf8_strlen and utf8_strpos. Additional implementations
would be useful which take byte indexes instead of character
positions - this would allow further advantage to be taken of
UTF-8's design and more use of PHP's native functions for performance.
It's important to understand that multi-byte UTF-8 characters can be badly formed. UTF-8 has rules regarding multi-byte characters and those rules can be broken. Some possible reasons why a sequence of bytes might be badly formed UTF-8;
With that in mind, the functions provided in ./utils/validation.php
and ./utils/bad.php are intend to help guard against such problems.
There are two functions in ./utils/validation.php, one "strict"
and the other slightly more relaxed.
The strict version is utf8_is_valid - as well is checking each
sequence, byte-by-byte, it also regards sequences which are not
part of the Unicode standard as being invalid (UTF-8 allows for
5 and 6 byte sequences but have no meaning in Unicode, and will
result in browsers displaying "junk" characters (e.g. ? character).
The second function utf8_compliant relies of behaviour of
PHP's PCRE extension, to spot invalid UTF-8 sequences. This
function will pass 5 and 6 byte sequences but also performs
much better than utf8_is_valid.
Both are simple to use;
require_once UTF8 . '/utils/validation.php';
if ( utf8_is_valid($str) ) {
print "Its valid\n";
}
if ( utf8_is_compliant($str) ) {
print "Its compliant\n";
}
If you detect a UTF-8 encoded string contains badly formed
sequences, functions in ./utils/bad.php can help. Be warned
that performance on large strings will be an issue.
It provides the following functitons;
utf8_bad_find require_once UTF8 . '/utils/validation.php';
require_once UTF8 . '/utils/bad.php';
$clean = '';
while ( FALSE !== ( $badIndex = utf8_bad_find($str) ) ) {
print "Bad byte found at $badIndex\n";
$clean .= substr($str,0,$badIndex);
$str = substr($str,$badIndex+1);
}
$clean .= $str;
utf8_bad_findallutf8_bad_find but searches the complete string and
returns the index of all bad bytes found in an arrayutf8_bad_striputf8_bad_replaceutf8_bad_identify and utf8_bad_explainThe above functions for validating and cleaning UTF-8 strings
all regard ASCII control characters as being valid and
acceptable. But ASCII control chars are not acceptable in XML
documents - use the utf8_strip_ascii_ctrl function in
./utils/ascii.php (available v0.3+), which will remove
all ASCII control characters that are illegal in XML.
See http://hsivonen.iki.fi/producing-xml/#controlchar.
Because validation and cleaning UTF-8 strings comes with a pretty high cost, in terms of performance, you should be aiming to do this once only, at the point where you receive some input (e.g. a submitted form) before going on to using the rest of the string functions in this library.
You should also be aware that validation and cleaning is your job -
the utf8_* string functions assume they are being given well formed
UTF-8 to process, because the performance overhead of checking, every
time you called utf8_strlen, for example, would be very high.
The first thing you shouldn't be attempting to do is replace all use of PHP's native string functions with functions from this library. Doing so will have a dramatic (and bad) effect on your codes performance. It also misses opportunities you may have to continue using PHP's native string functions.
There are two main areas to consider, when working out how to support UTF-8 with this library and achieve optimal performance.
First, if the majority of the data your application will be processing is written in English, most of the time you will be able to use PHP's native string functions, only using the utf8_* string functions when you encounter multibyte characters. This has already been implied above in the example in the "SYNOPSIS". Most characters used in English fall within the ASCII-7 range and ASCII characters in UTF-8 are no different to normal ASCII characters.
So check whether a string is 100% ASCII first, and if so, use PHP's native string functions on it.
require_once '/path/to/utf8/utf8.php';
require_once UTF8 . '/utils/ascii.php';
if ( utf8_is_ascii($string) ) {
# use native PHP string functions
} else {
# use utf8_* string functions
}
Second, you may be able to exploit UTF-8's design to your advantage, depending on what exactly you are doing to a string. This road requires more effort and a good understanding of UTF-8's design.
As a starting point, you really need to examine the range table shown on Wikipedias page on UTF-8 http://en.wikipedia.org/wiki/UTF-8.
Some key points about UTF-8's design;
$new_utf8_str = strstr('Iñtërnâtiônàlizætiøn','l');
$new_utf8_str
being assigned the value 'lizætiøn', even though the haystack
string contains multibyte characters. # Pop off a piece of a string using multi-byte character
$new_utf8_str = strstr('Iñtërnâtiônàlizætiøn','ô');
# Explode string using multibyte character
$array = explode('ô','Iñtërnâtiônàlizætiøn');
# Using byte index instead of chacter index...
$haystack = 'Iñtërnâtiônàlizætiøn';
$needle = 'ô';
$pos = strpos($haystack, $needle);
print "Position in bytes is $pos<br>";
$substr = substr($haystack, 0, $pos);
print "Substr: $substr<br>";
Put those together and often you will be able to use existing code with little or no modification.
Often you will be able to continue working in bytes instead of logical characters (as the last example above shows).
There are some functions which you will always need to replace,
for example strtoupper. You should be able to get some idea of
which these functions are by looking at
http://www.phpwact.org/php/i18n/utf-8.
Sometimes you will need to be able to remove all multi-byte characters from a UTF-8 string and use only ASCII. Some possible reasons why;
To simply remove all multibyte characters, the ./utils/ascii.php
collection of functions can help e.g.;
require_once '/path/to/utf8/utf8.php';
require_once UTF8 . '/utils/ascii.php';
$str = "admın";
print utf8_strip_non_ascii($str); // prints "admn"
Not also the utf8_strip_non_ascii_ctrl function which also -
strips out ASCII control codes - see
"Warning on ASCII Control Characters" for information on that
topic.
Now simply throwing out characters is not kind to users. An alternative is transliteration, where you try to replace multi byte characters with equivalent ASCII characters that a human would understand. For example "Zürich" could be converted to "Zuerich", the multi byte "ü" character being replaced by "ue".
See http://en.wikipedia.org/wiki/Transliteration for a general introduction to transliteration.
The main phputf8 package contains a single function in
the ./utils/ascii.php script that does some (basic)
replacements of accented characters common in languages
like French. After using this function, you should still
strip out all remaining multi-byte characters. For
example;
require_once '/path/to/utf8/utf8.php';
require_once UTF8 . '/utils/ascii.php';
$filename = utf8_accents_to_ascii($filename);
$filename = utf8_strip_non_ascii($filename);
This will at least preserve some characters in an ASCII form that will be understandable by users.
Further an much more powerful transliteration capabilities are provided in the seperate utf8_to_ascii package distributed at http://sourceforge.net/projects/phputf8. Because it is a port of Perls' Text::Unidecode package to PHP, it is distruted under the same license.
A quick intro to utf8_to_ascii and be found at http://www.sitepoint.com/blogs/2006/03/03/us-ascii-transliterations-of-unicode-text/
Be warned that utf8_to_ascii does have limitations and a better choice, if you have rights to install it in your environemt, is Derick Rethans transliteration extension: http://pecl.php.net/package/translit.
http://www.phpwact.org/php/i18n/charsets, http://www.phpwact.org/php/i18n/utf-8 http://wiki.silverorange.com/UTF-8_Notes http://svn.wikimedia.org/viewvc/mediawiki/trunk/phase3/includes/normal/ - Unicode normalization in PHP http://www.webtuesday.ch/_media/meetings/utf-8_survival.pdf