Each time I see someone use
strlen() I cringe. It will break.
Despite its name,
strlen() doesn’t count characters. It counts bytes. In UTF-8 a character may be up to four bytes long.
So what happens if we use
strlen() and its companion
substr() to shorten the title of post?
<?php # -*- coding: utf-8 -*- declare( encoding = 'UTF-8' ); header('Content-Type: text/plain;charset=utf-8'); $string = 'Doppelgänger'; print 'strlen(): ' . strlen( $string ) . "\n"; print 'mb_strlen(): ' . mb_strlen( $string, 'utf8' ) . "\n\n"; print 'substr(): ' . substr( $string, 0, 8 ) . "\n"; print 'mb_substr(): ' . mb_substr( $string, 0, 8, 'utf8' );
I have to use an image here. If I had used the plain text output our newsfeed would break. And that’s what happens each time you use strlen() and substr() on strings encoded in UTF-8: You end up with partial characters and invalid UTF-8.
You can use different methods to get the real string length.
$length = preg_match_all( '(.)su', $string, $matches );
See also Hakre: PHP UTF-8 string Length.
Or just use …
$length = strlen( utf8_decode( $string ) );