Don’t use strlen()

Each time I see someone use strlen() I cringe. It will break.

Despite its name, strlen() doesn’t count characters. It counts bytes. In UTF-8 a character may be up to four bytes long.

So what happens if we use strlen() and its companion substr() to shorten the title of post?

<?php # -*- coding: utf-8 -*-
declare( encoding = 'UTF-8' );
header('Content-Type: text/plain;charset=utf-8');

$string = 'Doppelgänger';
print 'strlen():    ' . strlen( $string ) . "\n";
print 'mb_strlen(): ' . mb_strlen( $string, 'utf8' ) . "\n\n";

print 'substr():    ' . substr( $string, 0, 8 ) . "\n";
print 'mb_substr(): ' . mb_substr( $string, 0, 8, 'utf8' );

Output:

I have to use an image here. If I had used the plain text output our newsfeed would break. And that’s what happens each time you use strlen() and substr() on strings encoded in UTF-8: You end up with partial characters and invalid UTF-8.

Alternatives for mb_strlen()

You can use different methods to get the real string length.

$length = preg_match_all( '(.)su', $string, $matches );

See also Hakre: PHP UTF-8 string Length.

Or just use …

$length = strlen( utf8_decode( $string ) );

There is also a nice php-utf8 library on GitHub from Frank Smit.

15 Comments
  1. Matthias says:

    Danke für den Hinweis. Hat mir hier geholfen.

  2. So, how do you determine the character count in PHP?

  3. Thomas says:

    @Sergey Use mb_strlen().

  4. NIcolas says:

    WordPress core sometimes uses strlen(). Should they use only mb_strlen() ?

  5. Thomas says:

    @Nicolas: It depends. If you really know you have only single byte characters it is okay. Unfortunately WordPress uses strlen() sometimes on data where this is not the case (plugin description length in WP_Plugin_Install_List_Table or image captions in wp_read_image_metadata() for example).

    There are rarely critical side effects unless substr() is used to write something into the database or into the output.

    substr($_SERVER['HTTP_USER_AGENT'], 0, 254); for example is written to the data base and may be invalid UTF-8.

  6. Marcel says:

    mb_strlen is not always available ...

  7. Thomas says:

    @Marcel wp-includes/compat.php defines the function if it is missing.

  8. Patrick says:

    Wow, thanks a lot Thomas. Didn't know this yet.

  9. Andy W says:

    To me, this sounds more like a problem with the PHP function not doing what the name is suggests it does.

    There is no mention on the documentation for either strlen() or mb_strlen that this is the case... it's just shoddy work on behalf of the PHP development team

    I think strlen() should give you the number of characters in a string and there should be a dedicated function for the number of bytes perhaps strbytes()?

  10. Mohamed Tair says:

    Wow, thanks a lot .
    Didn't know this yet.

  11. adumpaul says:

    Nice works.Really great stuff.Keep it up.Thank you.

  12. GaryJ says:

    wp-includes/compat.php defines mb_substr(), but not mb_strlen().

  13. Thomas says:

    @GaryJ, you are right, I stand corrected. :)

    I have added some alternatives and links to show other ways.

  14. adumpaul says:

    Nice article.Its really nice works.Thank you.

  15. Guillaume says:

    You saved my sunday :)

1 Ping
  1. Les acteurs du Web en ont parlé [#28] | Le blog des nouvelles technologies : Web, Technologies, Développement, Interopérabilité