broken doppelgaenger

Don’t use strlen()

Each time I see someone use strlen() I cringe. It will break.

Despite its name, strlen() doesn’t count characters. It counts bytes. In UTF-8 a character may be up to four bytes long.

So what happens if we use strlen() and its companion substr() to shorten the title of post?

<?php # -*- coding: utf-8 -*-
declare( encoding = 'UTF-8' );
header('Content-Type: text/plain;charset=utf-8');

$string = 'Doppelgänger';
print 'strlen():    ' . strlen( $string ) . "\n";
print 'mb_strlen(): ' . mb_strlen( $string, 'utf8' ) . "\n\n";

print 'substr():    ' . substr( $string, 0, 8 ) . "\n";
print 'mb_substr(): ' . mb_substr( $string, 0, 8, 'utf8' );
Don’t use strlen() - WP Engineer

Output:

I have to use an image here. If I had used the plain text output our newsfeed would break. And that’s what happens each time you use strlen() and substr() on strings encoded in UTF-8: You end up with partial characters and invalid UTF-8.

Alternatives for mb_strlen()

You can use different methods to get the real string length.

$length = preg_match_all( '(.)su', $string, $matches );

See also Hakre: PHP UTF-8 string Length.

Or just use …

$length = strlen( utf8_decode( $string ) );

There is also a nice php-utf8 library on GitHub from Frank Smit.

Comments are closed.

16 comments

  1. Matthias

    Danke für den Hinweis. Hat mir hier geholfen.

  2. Sergey Vlasov

    So, how do you determine the character count in PHP?

  3. Thomas

    @Sergey Use mb_strlen().

  4. NIcolas

    WordPress core sometimes uses strlen(). Should they use only mb_strlen() ?

  5. Thomas

    @Nicolas: It depends. If you really know you have only single byte characters it is okay. Unfortunately WordPress uses strlen() sometimes on data where this is not the case (plugin description length in WP_Plugin_Install_List_Table or image captions in wp_read_image_metadata() for example).

    There are rarely critical side effects unless substr() is used to write something into the database or into the output.

    substr($_SERVER['HTTP_USER_AGENT'], 0, 254); for example is written to the data base and may be invalid UTF-8.

  6. Marcel

    mb_strlen is not always available ...

  7. Thomas

    @Marcel wp-includes/compat.php defines the function if it is missing.

  8. Patrick

    Wow, thanks a lot Thomas. Didn't know this yet.

  9. Andy W

    To me, this sounds more like a problem with the PHP function not doing what the name is suggests it does.

    There is no mention on the documentation for either strlen() or mb_strlen that this is the case... it's just shoddy work on behalf of the PHP development team

    I think strlen() should give you the number of characters in a string and there should be a dedicated function for the number of bytes perhaps strbytes()?

  10. Mohamed Tair

    Wow, thanks a lot .
    Didn't know this yet.

  11. adumpaul

    Nice works.Really great stuff.Keep it up.Thank you.

  12. GaryJ

    wp-includes/compat.php defines mb_substr(), but not mb_strlen().

  13. Thomas

    @GaryJ, you are right, I stand corrected. :)

    I have added some alternatives and links to show other ways.

  14. adumpaul

    Nice article.Its really nice works.Thank you.

  15. Guillaume

    You saved my sunday :)

One pingback

  1. Les acteurs du Web en ont parlé [#28] | Le blog des nouvelles technologies : Web, Technologies, Développement, Interopérabilité