
Don’t use strlen()
Each time I see someone use strlen()
I cringe. It will break.
Despite its name, strlen()
doesn’t count characters. It counts bytes. In UTF-8 a character may be up to four bytes long.
So what happens if we use strlen()
and its companion substr()
to shorten the title of post?
<?php # -*- coding: utf-8 -*- declare( encoding = 'UTF-8' ); header('Content-Type: text/plain;charset=utf-8'); $string = 'Doppelgänger'; print 'strlen(): ' . strlen( $string ) . "\n"; print 'mb_strlen(): ' . mb_strlen( $string, 'utf8' ) . "\n\n"; print 'substr(): ' . substr( $string, 0, 8 ) . "\n"; print 'mb_substr(): ' . mb_substr( $string, 0, 8, 'utf8' );
Output:
I have to use an image here. If I had used the plain text output our newsfeed would break. And that’s what happens each time you use strlen() and substr() on strings encoded in UTF-8: You end up with partial characters and invalid UTF-8.
Alternatives for mb_strlen()
You can use different methods to get the real string length.
$length = preg_match_all( '(.)su', $string, $matches );
See also Hakre: PHP UTF-8 string Length.
Or just use …
$length = strlen( utf8_decode( $string ) );
There is also a nice php-utf8 library on GitHub from Frank Smit.