[ Index ]

PHP Cross Reference of WordPress Trunk (Updated Daily)

Search

title

Body

[close]

/wp-includes/ -> compat-utf8.php (summary)

(no description)

File Size: 565 lines (20 kb)
Included or required:0 times
Referenced: 0 times
Includes or requires: 0 files

Defines 8 functions

  _wp_scan_utf8()
  _wp_is_valid_utf8_fallback()
  _wp_scrub_utf8_fallback()
  _wp_utf8_codepoint_count()
  _wp_utf8_codepoint_span()
  _wp_has_noncharacters_fallback()
  _wp_utf8_encode_fallback()
  _wp_utf8_decode_fallback()

Functions
Functions that are not part of a class:

_wp_scan_utf8( string $bytes, int &$at, int &$invalid_length, ?int $max_bytes = null, ?int $max_code_points = null, ?bool &$has_noncharacters = null )   X-Ref
Finds spans of valid and invalid UTF-8 bytes in a given string.

This is a low-level tool to power various UTF-8 functionality.
It scans through a string until it finds invalid byte spans.
When it does this, it does three things:

- Assigns `$at` to the position after the last successful code point.
- Assigns `$invalid_length` to the length of the maximal subpart of
the invalid bytes starting at `$at`.
- Returns how many code points were successfully scanned.

This information is enough to build a number of useful UTF-8 functions.

Example:

// ñ is U+F1, which in `ISO-8859-1`/`latin1`/`Windows-1252`/`cp1252` is 0xF1.
"Pi\xF1a" === $pineapple = mb_convert_encoding( "Piña", 'Windows-1252', 'UTF-8' );
$at = $invalid_length = 0;

// The first step finds the invalid 0xF1 byte.
2 === _wp_scan_utf8( $pineapple, $at, $invalid_length );
$at === 2; $invalid_length === 1;

// The second step continues to the end of the string.
1 === _wp_scan_utf8( $pineapple, $at, $invalid_length );
$at === 4; $invalid_length === 0;

Note! While passing an options array here might be convenient from a calling-code standpoint,
this function is intended to serve as a very low-level foundation upon which to build
higher level functionality. For the sake of keeping costs explicit all arguments are
passed directly.

param: string    $bytes             UTF-8 encoded string which might include invalid spans of bytes.
param: int       $at                Where to start scanning.
param: int       $invalid_length    Will be set to how many bytes are to be ignored after `$at`.
param: int|null  $max_bytes         Stop scanning after this many bytes have been seen.
param: int|null  $max_code_points   Stop scanning after this many code points have been seen.
param: bool|null $has_noncharacters Set to indicate if scanned string contained noncharacters.
return: int How many code points were successfully scanned.

_wp_is_valid_utf8_fallback( string $bytes )   X-Ref
Fallback mechanism for safely validating UTF-8 bytes.

param: string $bytes String which might contain text encoded as UTF-8.
return: bool Whether the provided bytes can decode as valid UTF-8.

_wp_scrub_utf8_fallback( string $bytes )   X-Ref
Fallback mechanism for replacing invalid spans of UTF-8 bytes.

Example:

'Pi�a' === _wp_scrub_utf8_fallback( "Pi\xF1a" ); // “ñ” is 0xF1 in Windows-1252.

param: string $bytes UTF-8 encoded string which might contain spans of invalid bytes.
return: string Input string with spans of invalid bytes swapped with the replacement character.

_wp_utf8_codepoint_count( string $text, ?int $byte_offset = 0, ?int $max_byte_length = PHP_INT_MAX )   X-Ref
Returns how many code points are found in the given UTF-8 string.

Invalid spans of bytes count as a single code point according
to the maximal subpart rule. This function is a fallback method
for calling `mb_strlen( $text, 'UTF-8' )`.

When negative values are provided for the byte offsets or length,
this will always report zero code points.

Example:

4  === _wp_utf8_codepoint_count( 'text' );

// Groups are 'test', "\x90" as '�', 'wp', "\xE2\x80" as '�', "\xC0" as '�', and 'test'.
13 === _wp_utf8_codepoint_count( "test\x90wp\xE2\x80\xC0test" );

param: string $text            Count code points in this string.
param: ?int   $byte_offset     Start counting after this many bytes in `$text`. Must be positive.
param: ?int   $max_byte_length Optional. Stop counting after having scanned past this many bytes.
return: int How many code points were found.

_wp_utf8_codepoint_span( string $text, int $byte_offset, int $max_code_points, ?int &$found_code_points = 0 )   X-Ref
Given a starting offset within a string and a maximum number of code points,
return how many bytes are occupied by the span of characters.

Invalid spans of bytes count as a single code point according to the maximal
subpart rule. This function is a fallback method for calling
`strlen( mb_substr( substr( $text, $at ), 0, $max_code_points ) )`.

param: string $text              Count bytes of span in this text.
param: int    $byte_offset       Start counting at this byte offset.
param: int    $max_code_points   Stop counting after this many code points have been seen,
param: ?int   $found_code_points Optional. Will be set to number of found code points in
return: int Number of bytes spanned by the code points.

_wp_has_noncharacters_fallback( string $text )   X-Ref
Fallback support for determining if a string contains Unicode noncharacters.

param: string $text Are there noncharacters in this string?
return: bool Whether noncharacters were found in the string.

_wp_utf8_encode_fallback( $iso_8859_1_text )   X-Ref
Converts a string from ISO-8859-1 to UTF-8, maintaining backwards compatibility
with the deprecated function from the PHP standard library.

param: string $iso_8859_1_text Text treated as ISO-8859-1 (latin1) bytes.
return: string Text converted into UTF-8.

_wp_utf8_decode_fallback( $utf8_text )   X-Ref
Converts a string from UTF-8 to ISO-8859-1, maintaining backwards compatibility
with the deprecated function from the PHP standard library.

param: string $utf8_text Text treated as UTF-8 bytes.
return: string Text converted into ISO-8859-1.



Generated : Thu Oct 30 08:20:06 2025 Cross-referenced by PHPXref