Don’t use strlen()

Each time I see someone use strlen() I cringe. It will break.

Despite its name, strlen() doesn’t count characters. It counts bytes. In UTF-8 a character may be up to four bytes long.

So what happens if we use strlen() and its companion substr() to shorten the title of post?

<?php # -*- coding: utf-8 -*-
declare( encoding = 'UTF-8' );
header('Content-Type: text/plain;charset=utf-8');

$string = 'Doppelgänger';
print 'strlen():    ' . strlen( $string ) . "\n";
print 'mb_strlen(): ' . mb_strlen( $string, 'utf8' ) . "\n\n";

print 'substr():    ' . substr( $string, 0, 8 ) . "\n";
print 'mb_substr(): ' . mb_substr( $string, 0, 8, 'utf8' );

Output:

I have to use an image here. If I had used the plain text output our newsfeed would break. And that’s what happens each time you use strlen() and substr() on strings encoded in UTF-8: You end up with partial characters and invalid UTF-8.

Alternatives for mb_strlen()

You can use different methods to get the real string length.

$length = preg_match_all( '(.)su', $string, $matches );

See also Hakre: PHP UTF-8 string Length.

Or just use …

$length = strlen( utf8_decode( $string ) );

There is also a nice php-utf8 library on GitHub from Frank Smit.


Posted

in

by

Comments

16 responses to “Don’t use strlen()”

  1. Matthias Avatar

    Danke für den Hinweis. Hat mir hier geholfen.

  2. Sergey Vlasov Avatar

    So, how do you determine the character count in PHP?

  3. Thomas Avatar

    @Sergey Use mb_strlen().

  4. NIcolas Avatar
    NIcolas

    WordPress core sometimes uses strlen(). Should they use only mb_strlen() ?

  5. Thomas Avatar

    @Nicolas: It depends. If you really know you have only single byte characters it is okay. Unfortunately WordPress uses strlen() sometimes on data where this is not the case (plugin description length in WP_Plugin_Install_List_Table or image captions in wp_read_image_metadata() for example).

    There are rarely critical side effects unless substr() is used to write something into the database or into the output.

    substr($_SERVER[‘HTTP_USER_AGENT’], 0, 254); for example is written to the data base and may be invalid UTF-8.

  6. Marcel Avatar

    mb_strlen is not always available …

  7. Thomas Avatar

    @Marcel wp-includes/compat.php defines the function if it is missing.

  8. Patrick Avatar
    Patrick

    Wow, thanks a lot Thomas. Didn’t know this yet.

  9. Andy W Avatar

    To me, this sounds more like a problem with the PHP function not doing what the name is suggests it does.

    There is no mention on the documentation for either strlen() or mb_strlen that this is the case… it’s just shoddy work on behalf of the PHP development team

    I think strlen() should give you the number of characters in a string and there should be a dedicated function for the number of bytes perhaps strbytes()?

  10. Mohamed Tair Avatar

    Wow, thanks a lot .
    Didn’t know this yet.

  11. adumpaul Avatar

    Nice works.Really great stuff.Keep it up.Thank you.

  12. GaryJ Avatar
    GaryJ

    wp-includes/compat.php defines mb_substr(), but not mb_strlen().

  13. Thomas Avatar

    @GaryJ, you are right, I stand corrected. 🙂

    I have added some alternatives and links to show other ways.

  14. adumpaul Avatar

    Nice article.Its really nice works.Thank you.

  15. Guillaume Avatar
    Guillaume

    You saved my sunday 🙂