[ Index ]

PHP Cross Reference of WordPress Trunk (Updated Daily)

Search

title

Body

[close]

/wp-includes/html-api/ -> class-wp-html-tag-processor.php (summary)

HTML API: WP_HTML_Tag_Processor class Scans through an HTML document to find specific tags, then transforms those tags by adding, removing, or updating the values of the HTML attributes within that tag (opener).

File Size: 5032 lines (167 kb)
Included or required:0 times
Referenced: 0 times
Includes or requires: 0 files

Defines 1 class

WP_HTML_Tag_Processor:: (49 methods):
  __construct()
  change_parsing_namespace()
  next_tag()
  next_token()
  base_class_next_token()
  paused_at_incomplete_token()
  class_list()
  has_class()
  set_bookmark()
  release_bookmark()
  skip_rawtext()
  skip_rcdata()
  skip_script_data()
  parse_next_tag()
  parse_next_attribute()
  skip_whitespace()
  after_tag()
  class_name_updates_to_attributes_updates()
  apply_attributes_updates()
  has_bookmark()
  seek()
  sort_start_ascending()
  get_enqueued_attribute_value()
  get_attribute()
  get_attribute_names_with_prefix()
  get_namespace()
  get_tag()
  get_qualified_tag_name()
  get_qualified_attribute_name()
  has_self_closing_flag()
  is_tag_closer()
  get_token_type()
  get_token_name()
  get_comment_type()
  get_full_comment_text()
  subdivide_text_appropriately()
  get_modifiable_text()
  set_modifiable_text()
  get_script_content_type()
  escape_javascript_script_contents()
  set_attribute()
  remove_attribute()
  add_class()
  remove_class()
  __toString()
  get_updated_html()
  parse_query()
  matches()
  get_doctype_info()


Class: WP_HTML_Tag_Processor  - X-Ref

Core class used to modify attributes in an HTML document for tags matching a query.

## Usage

Use of this class requires three steps:

1. Create a new class instance with your input HTML document.
2. Find the tag(s) you are looking for.
3. Request changes to the attributes in those tag(s).

Example:

$tags = new WP_HTML_Tag_Processor( $html );
if ( $tags->next_tag( 'option' ) ) {
$tags->set_attribute( 'selected', true );
}

### Finding tags

The `next_tag()` function moves the internal cursor through
your input HTML document until it finds a tag meeting any of
the supplied restrictions in the optional query argument. If
no argument is provided then it will find the next HTML tag,
regardless of what kind it is.

If you want to _find whatever the next tag is_:

$tags->next_tag();

| Goal                                                      | Query                                                                           |
|-----------------------------------------------------------|---------------------------------------------------------------------------------|
| Find any tag.                                             | `$tags->next_tag();`                                                            |
| Find next image tag.                                      | `$tags->next_tag( array( 'tag_name' => 'img' ) );`                              |
| Find next image tag (without passing the array).          | `$tags->next_tag( 'img' );`                                                     |
| Find next tag containing the `fullwidth` CSS class.       | `$tags->next_tag( array( 'class_name' => 'fullwidth' ) );`                      |
| Find next image tag containing the `fullwidth` CSS class. | `$tags->next_tag( array( 'tag_name' => 'img', 'class_name' => 'fullwidth' ) );` |

If a tag was found meeting your criteria then `next_tag()`
will return `true` and you can proceed to modify it. If it
returns `false`, however, it failed to find the tag and
moved the cursor to the end of the file.

Once the cursor reaches the end of the file the processor
is done and if you want to reach an earlier tag you will
need to recreate the processor and start over, as it's
unable to back up or move in reverse.

See the section on bookmarks for an exception to this
no-backing-up rule.

#### Custom queries

Sometimes it's necessary to further inspect an HTML tag than
the query syntax here permits. In these cases one may further
inspect the search results using the read-only functions
provided by the processor or external state or variables.

Example:

// Paint up to the first five DIV or SPAN tags marked with the "jazzy" style.
$remaining_count = 5;
while ( $remaining_count > 0 && $tags->next_tag() ) {
if (
( 'DIV' === $tags->get_tag() || 'SPAN' === $tags->get_tag() ) &&
'jazzy' === $tags->get_attribute( 'data-style' )
) {
$tags->add_class( 'theme-style-everest-jazz' );
$remaining_count--;
}
}

`get_attribute()` will return `null` if the attribute wasn't present
on the tag when it was called. It may return `""` (the empty string)
in cases where the attribute was present but its value was empty.
For boolean attributes, those whose name is present but no value is
given, it will return `true` (the only way to set `false` for an
attribute is to remove it).

#### When matching fails

When `next_tag()` returns `false` it could mean different things:

- The requested tag wasn't found in the input document.
- The input document ended in the middle of an HTML syntax element.

When a document ends in the middle of a syntax element it will pause
the processor. This is to make it possible in the future to extend the
input document and proceed - an important requirement for chunked
streaming parsing of a document.

Example:

$processor = new WP_HTML_Tag_Processor( 'This <div is="a" partial="token' );
false === $processor->next_tag();

If a special element (see next section) is encountered but no closing tag
is found it will count as an incomplete tag. The parser will pause as if
the opening tag were incomplete.

Example:

$processor = new WP_HTML_Tag_Processor( '<style>// there could be more styling to come' );
false === $processor->next_tag();

$processor = new WP_HTML_Tag_Processor( '<style>// this is everything</style><div>' );
true === $processor->next_tag( 'DIV' );

#### Special self-contained elements

Some HTML elements are handled in a special way; their start and end tags
act like a void tag. These are special because their contents can't contain
HTML markup. Everything inside these elements is handled in a special way
and content that _appears_ like HTML tags inside of them isn't. There can
be no nesting in these elements.

In the following list, "raw text" means that all of the content in the HTML
until the matching closing tag is treated verbatim without any replacements
and without any parsing.

- IFRAME allows no content but requires a closing tag.
- NOEMBED (deprecated) content is raw text.
- NOFRAMES (deprecated) content is raw text.
- SCRIPT content is plaintext apart from legacy rules allowing `</script>` inside an HTML comment.
- STYLE content is raw text.
- TITLE content is plain text but character references are decoded.
- TEXTAREA content is plain text but character references are decoded.
- XMP (deprecated) content is raw text.

### Modifying HTML attributes for a found tag

Once you've found the start of an opening tag you can modify
any number of the attributes on that tag. You can set a new
value for an attribute, remove the entire attribute, or do
nothing and move on to the next opening tag.

Example:

if ( $tags->next_tag( array( 'class_name' => 'wp-group-block' ) ) ) {
$tags->set_attribute( 'title', 'This groups the contained content.' );
$tags->remove_attribute( 'data-test-id' );
}

If `set_attribute()` is called for an existing attribute it will
overwrite the existing value. Similarly, calling `remove_attribute()`
for a non-existing attribute has no effect on the document. Both
of these methods are safe to call without knowing if a given attribute
exists beforehand.

### Modifying CSS classes for a found tag

The tag processor treats the `class` attribute as a special case.
Because it's a common operation to add or remove CSS classes, this
interface adds helper methods to make that easier.

As with attribute values, adding or removing CSS classes is a safe
operation that doesn't require checking if the attribute or class
exists before making changes. If removing the only class then the
entire `class` attribute will be removed.

Example:

// from `<span>Yippee!</span>`
//   to `<span class="is-active">Yippee!</span>`
$tags->add_class( 'is-active' );

// from `<span class="excited">Yippee!</span>`
//   to `<span class="excited is-active">Yippee!</span>`
$tags->add_class( 'is-active' );

// from `<span class="is-active heavy-accent">Yippee!</span>`
//   to `<span class="is-active heavy-accent">Yippee!</span>`
$tags->add_class( 'is-active' );

// from `<input type="text" class="is-active rugby not-disabled" length="24">`
//   to `<input type="text" class="is-active not-disabled" length="24">
$tags->remove_class( 'rugby' );

// from `<input type="text" class="rugby" length="24">`
//   to `<input type="text" length="24">
$tags->remove_class( 'rugby' );

// from `<input type="text" length="24">`
//   to `<input type="text" length="24">
$tags->remove_class( 'rugby' );

When class changes are enqueued but a direct change to `class` is made via
`set_attribute` then the changes to `set_attribute` (or `remove_attribute`)
will take precedence over those made through `add_class` and `remove_class`.

### Bookmarks

While scanning through the input HTMl document it's possible to set
a named bookmark when a particular tag is found. Later on, after
continuing to scan other tags, it's possible to `seek` to one of
the set bookmarks and then proceed again from that point forward.

Because bookmarks create processing overhead one should avoid
creating too many of them. As a rule, create only bookmarks
of known string literal names; avoid creating "mark_{$index}"
and so on. It's fine from a performance standpoint to create a
bookmark and update it frequently, such as within a loop.

$total_todos = 0;
while ( $p->next_tag( array( 'tag_name' => 'UL', 'class_name' => 'todo' ) ) ) {
$p->set_bookmark( 'list-start' );
while ( $p->next_tag( array( 'tag_closers' => 'visit' ) ) ) {
if ( 'UL' === $p->get_tag() && $p->is_tag_closer() ) {
$p->set_bookmark( 'list-end' );
$p->seek( 'list-start' );
$p->set_attribute( 'data-contained-todos', (string) $total_todos );
$total_todos = 0;
$p->seek( 'list-end' );
break;
}

if ( 'LI' === $p->get_tag() && ! $p->is_tag_closer() ) {
$total_todos++;
}
}
}

## Tokens and finer-grained processing.

It's possible to scan through every lexical token in the
HTML document using the `next_token()` function. This
alternative form takes no argument and provides no built-in
query syntax.

Example:

$title = '(untitled)';
$text  = '';
while ( $processor->next_token() ) {
switch ( $processor->get_token_name() ) {
case '#text':
$text .= $processor->get_modifiable_text();
break;

case 'BR':
$text .= "\n";
break;

case 'TITLE':
$title = $processor->get_modifiable_text();
break;
}
}
return trim( "# {$title}\n\n{$text}" );

### Tokens and _modifiable text_.

#### Special "atomic" HTML elements.

Not all HTML elements are able to contain other elements inside of them.
For instance, the contents inside a TITLE element are plaintext (except
that character references like &amp; will be decoded). This means that
if the string `<img>` appears inside a TITLE element, then it's not an
image tag, but rather it's text describing an image tag. Likewise, the
contents of a SCRIPT or STYLE element are handled entirely separately in
a browser than the contents of other elements because they represent a
different language than HTML.

For these elements the Tag Processor treats the entire sequence as one,
from the opening tag, including its contents, through its closing tag.
This means that the it's not possible to match the closing tag for a
SCRIPT element unless it's unexpected; the Tag Processor already matched
it when it found the opening tag.

The inner contents of these elements are that element's _modifiable text_.

The special elements are:
- `SCRIPT` whose contents are treated as raw plaintext but supports a legacy
style of including JavaScript inside of HTML comments to avoid accidentally
closing the SCRIPT from inside a JavaScript string. E.g. `console.log( '</script>' )`.
- `TITLE` and `TEXTAREA` whose contents are treated as plaintext and then any
character references are decoded. E.g. `1 &lt; 2 < 3` becomes `1 < 2 < 3`.
- `IFRAME`, `NOSCRIPT`, `NOEMBED`, `NOFRAME`, `STYLE` whose contents are treated as
raw plaintext and left as-is. E.g. `1 &lt; 2 < 3` remains `1 &lt; 2 < 3`.

#### Other tokens with modifiable text.

There are also non-elements which are void/self-closing in nature and contain
modifiable text that is part of that individual syntax token itself.

- `#text` nodes, whose entire token _is_ the modifiable text.
- HTML comments and tokens that become comments due to some syntax error. The
text for these tokens is the portion of the comment inside of the syntax.
E.g. for `<!-- comment -->` the text is `" comment "` (note the spaces are included).
- `CDATA` sections, whose text is the content inside of the section itself. E.g. for
`<![CDATA[some content]]>` the text is `"some content"` (with restrictions [1]).
- "Funky comments," which are a special case of invalid closing tags whose name is
invalid. The text for these nodes is the text that a browser would transform into
an HTML comment when parsing. E.g. for `</%post_author>` the text is `%post_author`.
- `DOCTYPE` declarations like `<DOCTYPE html>` which have no closing tag.
- XML Processing instruction nodes like `<?wp __( "Like" ); ?>` (with restrictions [2]).
- The empty end tag `</>` which is ignored in the browser and DOM.

[1]: There are no CDATA sections in HTML. When encountering `<![CDATA[`, everything
until the next `>` becomes a bogus HTML comment, meaning there can be no CDATA
section in an HTML document containing `>`. The Tag Processor will first find
all valid and bogus HTML comments, and then if the comment _would_ have been a
CDATA section _were they to exist_, it will indicate this as the type of comment.

[2]: XML allows a broader range of characters in a processing instruction's target name
and disallows "xml" as a name, since it's special. The Tag Processor only recognizes
target names with an ASCII-representable subset of characters. It also exhibits the
same constraint as with CDATA sections, in that `>` cannot exist within the token
since Processing Instructions do no exist within HTML and their syntax transforms
into a bogus comment in the DOM.

## Design and limitations

The Tag Processor is designed to linearly scan HTML documents and tokenize
HTML tags and their attributes. It's designed to do this as efficiently as
possible without compromising parsing integrity. Therefore it will be
slower than some methods of modifying HTML, such as those incorporating
over-simplified PCRE patterns, but will not introduce the defects and
failures that those methods bring in, which lead to broken page renders
and often to security vulnerabilities. On the other hand, it will be faster
than full-blown HTML parsers such as DOMDocument and use considerably
less memory. It requires a negligible memory overhead, enough to consider
it a zero-overhead system.

The performance characteristics are maintained by avoiding tree construction
and semantic cleanups which are specified in HTML5. Because of this, for
example, it's not possible for the Tag Processor to associate any given
opening tag with its corresponding closing tag, or to return the inner markup
inside an element. Systems may be built on top of the Tag Processor to do
this, but the Tag Processor is and should be constrained so it can remain an
efficient, low-level, and reliable HTML scanner.

The Tag Processor's design incorporates a "garbage-in-garbage-out" philosophy.
HTML5 specifies that certain invalid content be transformed into different forms
for display, such as removing null bytes from an input document and replacing
invalid characters with the Unicode replacement character `U+FFFD` (visually "�").
Where errors or transformations exist within the HTML5 specification, the Tag Processor
leaves those invalid inputs untouched, passing them through to the final browser
to handle. While this implies that certain operations will be non-spec-compliant,
such as reading the value of an attribute with invalid content, it also preserves a
simplicity and efficiency for handling those error cases.

Most operations within the Tag Processor are designed to minimize the difference
between an input and output document for any given change. For example, the
`add_class` and `remove_class` methods preserve whitespace and the class ordering
within the `class` attribute; and when encountering tags with duplicated attributes,
the Tag Processor will leave those invalid duplicate attributes where they are but
update the proper attribute which the browser will read for parsing its value. An
exception to this rule is that all attribute updates store their values as
double-quoted strings, meaning that attributes on input with single-quoted or
unquoted values will appear in the output with double-quotes.

### Scripting Flag

The Tag Processor parses HTML with the "scripting flag" disabled. This means
that it doesn't run any scripts while parsing the page. In a browser with
JavaScript enabled, for example, the script can change the parse of the
document as it loads. On the server, however, evaluating JavaScript is not
only impractical, but also unwanted.

Practically this means that the Tag Processor will descend into NOSCRIPT
elements and process its child tags. Were the scripting flag enabled, such
as in a typical browser, the contents of NOSCRIPT are skipped entirely.

This allows the HTML API to process the content that will be presented in
a browser when scripting is disabled, but it offers a different view of a
page than most browser sessions will experience. E.g. the tags inside the
NOSCRIPT disappear.

### Text Encoding

The Tag Processor assumes that the input HTML document is encoded with a
text encoding compatible with 7-bit ASCII's '<', '>', '&', ';', '/', '=',
"'", '"', 'a' - 'z', 'A' - 'Z', and the whitespace characters ' ', tab,
carriage-return, newline, and form-feed.

In practice, this includes almost every single-byte encoding as well as
UTF-8. Notably, however, it does not include UTF-16. If providing input
that's incompatible, then convert the encoding beforehand.

__construct( $html )   X-Ref
Constructor.

param: string $html HTML to process.

change_parsing_namespace( string $new_namespace )   X-Ref
Switches parsing mode into a new namespace, such as when
encountering an SVG tag and entering foreign content.

param: string $new_namespace One of 'html', 'svg', or 'math' indicating into what
return: bool Whether the namespace was valid and changed.

next_tag( $query = null )   X-Ref
Finds the next tag matching the $query.

param: array|string|null $query {
return: bool Whether a tag was matched.

next_token()   X-Ref
Finds the next token in the HTML document.

An HTML document can be viewed as a stream of tokens,
where tokens are things like HTML tags, HTML comments,
text nodes, etc. This method finds the next token in
the HTML document and returns whether it found one.

If it starts parsing a token and reaches the end of the
document then it will seek to the start of the last
token and pause, returning `false` to indicate that it
failed to find a complete token.

Possible token types, based on the HTML specification:

- an HTML tag, whether opening, closing, or void.
- a text node - the plaintext inside tags.
- an HTML comment.
- a DOCTYPE declaration.
- a processing instruction, e.g. `<?xml version="1.0" ?>`.

The Tag Processor currently only supports the tag token.

return: bool Whether a token was parsed.

base_class_next_token()   X-Ref
Internal method which finds the next token in the HTML document.

This method is a protected internal function which implements the logic for
finding the next token in a document. It exists so that the parser can update
its state without affecting the location of the cursor in the document and
without triggering subclass methods for things like `next_token()`, e.g. when
applying patches before searching for the next token.

return: bool Whether a token was parsed.

paused_at_incomplete_token()   X-Ref
Whether the processor paused because the input HTML document ended
in the middle of a syntax element, such as in the middle of a tag.

Example:

$processor = new WP_HTML_Tag_Processor( '<input type="text" value="Th' );
false      === $processor->get_next_tag();
true       === $processor->paused_at_incomplete_token();

return: bool Whether the parse paused at the start of an incomplete token.

class_list()   X-Ref
Generator for a foreach loop to step through each class name for the matched tag.

This generator function is designed to be used inside a "foreach" loop.

Example:

$p = new WP_HTML_Tag_Processor( "<div class='free &lt;egg&lt;\tlang-en'>" );
$p->next_tag();
foreach ( $p->class_list() as $class_name ) {
echo "{$class_name} ";
}
// Outputs: "free <egg> lang-en "


has_class( $wanted_class )   X-Ref
Returns if a matched tag contains the given ASCII case-insensitive class name.

param: string $wanted_class Look for this CSS class name, ASCII case-insensitive.
return: bool|null Whether the matched tag contains the given class name, or null if not matched.

set_bookmark( $name )   X-Ref
Sets a bookmark in the HTML document.

Bookmarks represent specific places or tokens in the HTML
document, such as a tag opener or closer. When applying
edits to a document, such as setting an attribute, the
text offsets of that token may shift; the bookmark is
kept updated with those shifts and remains stable unless
the entire span of text in which the token sits is removed.

Release bookmarks when they are no longer needed.

Example:

<main><h2>Surprising fact you may not know!</h2></main>
^  ^
\-|-- this `H2` opener bookmark tracks the token

<main class="clickbait"><h2>Surprising fact you may no…
^  ^
\-|-- it shifts with edits

Bookmarks provide the ability to seek to a previously-scanned
place in the HTML document. This avoids the need to re-scan
the entire document.

Example:

<ul><li>One</li><li>Two</li><li>Three</li></ul>
^^^^
want to note this last item

$p = new WP_HTML_Tag_Processor( $html );
$in_list = false;
while ( $p->next_tag( array( 'tag_closers' => $in_list ? 'visit' : 'skip' ) ) ) {
if ( 'UL' === $p->get_tag() ) {
if ( $p->is_tag_closer() ) {
$in_list = false;
$p->set_bookmark( 'resume' );
if ( $p->seek( 'last-li' ) ) {
$p->add_class( 'last-li' );
}
$p->seek( 'resume' );
$p->release_bookmark( 'last-li' );
$p->release_bookmark( 'resume' );
} else {
$in_list = true;
}
}

if ( 'LI' === $p->get_tag() ) {
$p->set_bookmark( 'last-li' );
}
}

Bookmarks intentionally hide the internal string offsets
to which they refer. They are maintained internally as
updates are applied to the HTML document and therefore
retain their "position" - the location to which they
originally pointed. The inability to use bookmarks with
functions like `substr` is therefore intentional to guard
against accidentally breaking the HTML.

Because bookmarks allocate memory and require processing
for every applied update, they are limited and require
a name. They should not be created with programmatically-made
names, such as "li_{$index}" with some loop. As a general
rule they should only be created with string-literal names
like "start-of-section" or "last-paragraph".

Bookmarks are a powerful tool to enable complicated behavior.
Consider double-checking that you need this tool if you are
reaching for it, as inappropriate use could lead to broken
HTML structure or unwanted processing overhead.

param: string $name Identifies this particular bookmark.
return: bool Whether the bookmark was successfully created.

release_bookmark( $name )   X-Ref
Removes a bookmark that is no longer needed.

Releasing a bookmark frees up the small
performance overhead it requires.

param: string $name Name of the bookmark to remove.
return: bool Whether the bookmark already existed before removal.

skip_rawtext( string $tag_name )   X-Ref
Skips contents of generic rawtext elements.

param: string $tag_name The uppercase tag name which will close the RAWTEXT region.
return: bool Whether an end to the RAWTEXT region was found before the end of the document.

skip_rcdata( string $tag_name )   X-Ref
Skips contents of RCDATA elements, namely title and textarea tags.

param: string $tag_name The uppercase tag name which will close the RCDATA region.
return: bool Whether an end to the RCDATA region was found before the end of the document.

skip_script_data()   X-Ref
Skips contents of script tags.

return: bool Whether the script tag was closed before the end of the document.

parse_next_tag()   X-Ref
Parses the next tag.

This will find and start parsing the next tag, including
the opening `<`, the potential closer `/`, and the tag
name. It does not parse the attributes or scan to the
closing `>`; these are left for other methods.

return: bool Whether a tag was found before the end of the document.

parse_next_attribute()   X-Ref
Parses the next attribute.

return: bool Whether an attribute was found before the end of the document.

skip_whitespace()   X-Ref
Move the internal cursor past any immediate successive whitespace.


after_tag()   X-Ref
Applies attribute updates and cleans up once a tag is fully parsed.


class_name_updates_to_attributes_updates()   X-Ref
Converts class name updates into tag attributes updates
(they are accumulated in different data formats for performance).


apply_attributes_updates( int $shift_this_point )   X-Ref
Applies attribute updates to HTML document.

param: int $shift_this_point Accumulate and return shift for this position.
return: int How many bytes the given pointer moved in response to the updates.

has_bookmark( $bookmark_name )   X-Ref
Checks whether a bookmark with the given name exists.

param: string $bookmark_name Name to identify a bookmark that potentially exists.
return: bool Whether that bookmark exists.

seek( $bookmark_name )   X-Ref
Move the internal cursor in the Tag Processor to a given bookmark's location.

In order to prevent accidental infinite loops, there's a
maximum limit on the number of times seek() can be called.

param: string $bookmark_name Jump to the place in the document identified by this bookmark name.
return: bool Whether the internal cursor was successfully moved to the bookmark's location.

sort_start_ascending( WP_HTML_Text_Replacement $a, WP_HTML_Text_Replacement $b )   X-Ref
Compare two WP_HTML_Text_Replacement objects.

param: WP_HTML_Text_Replacement $a First attribute update.
param: WP_HTML_Text_Replacement $b Second attribute update.
return: int Comparison value for string order.

get_enqueued_attribute_value( string $comparable_name )   X-Ref
Return the enqueued value for a given attribute, if one exists.

Enqueued updates can take different data types:
- If an update is enqueued and is boolean, the return will be `true`
- If an update is otherwise enqueued, the return will be the string value of that update.
- If an attribute is enqueued to be removed, the return will be `null` to indicate that.
- If no updates are enqueued, the return will be `false` to differentiate from "removed."

param: string $comparable_name The attribute name in its comparable form.
return: string|boolean|null Value of enqueued update if present, otherwise false.

get_attribute( $name )   X-Ref
Returns the value of a requested attribute from a matched tag opener if that attribute exists.

Example:

$p = new WP_HTML_Tag_Processor( '<div enabled class="test" data-test-id="14">Test</div>' );
$p->next_tag( array( 'class_name' => 'test' ) ) === true;
$p->get_attribute( 'data-test-id' ) === '14';
$p->get_attribute( 'enabled' ) === true;
$p->get_attribute( 'aria-label' ) === null;

$p->next_tag() === false;
$p->get_attribute( 'class' ) === null;

param: string $name Name of attribute whose value is requested.
return: string|true|null Value of attribute or `null` if not available. Boolean attributes return `true`.

get_attribute_names_with_prefix( $prefix )   X-Ref
Gets lowercase names of all attributes matching a given prefix in the current tag.

Note that matching is case-insensitive. This is in accordance with the spec:

> There must never be two or more attributes on
> the same start tag whose names are an ASCII
> case-insensitive match for each other.
- HTML 5 spec

Example:

$p = new WP_HTML_Tag_Processor( '<div data-ENABLED class="test" DATA-test-id="14">Test</div>' );
$p->next_tag( array( 'class_name' => 'test' ) ) === true;
$p->get_attribute_names_with_prefix( 'data-' ) === array( 'data-enabled', 'data-test-id' );

$p->next_tag() === false;
$p->get_attribute_names_with_prefix( 'data-' ) === null;

param: string $prefix Prefix of requested attribute names.
return: array|null List of attribute names, or `null` when no tag opener is matched.

get_namespace()   X-Ref
Returns the namespace of the matched token.

return: string One of 'html', 'math', or 'svg'.

get_tag()   X-Ref
Returns the uppercase name of the matched tag.

Example:

$p = new WP_HTML_Tag_Processor( '<div class="test">Test</div>' );
$p->next_tag() === true;
$p->get_tag() === 'DIV';

$p->next_tag() === false;
$p->get_tag() === null;

return: string|null Name of currently matched tag in input HTML, or `null` if none found.

get_qualified_tag_name()   X-Ref
Returns the adjusted tag name for a given token, taking into
account the current parsing context, whether HTML, SVG, or MathML.

return: string|null Name of current tag name.

get_qualified_attribute_name( $attribute_name )   X-Ref
Returns the adjusted attribute name for a given attribute, taking into
account the current parsing context, whether HTML, SVG, or MathML.

param: string $attribute_name Which attribute to adjust.
return: string|null

has_self_closing_flag()   X-Ref
Indicates if the currently matched tag contains the self-closing flag.

No HTML elements ought to have the self-closing flag and for those, the self-closing
flag will be ignored. For void elements this is benign because they "self close"
automatically. For non-void HTML elements though problems will appear if someone
intends to use a self-closing element in place of that element with an empty body.
For HTML foreign elements and custom elements the self-closing flag determines if
they self-close or not.

This function does not determine if a tag is self-closing,
but only if the self-closing flag is present in the syntax.

return: bool Whether the currently matched tag contains the self-closing flag.

is_tag_closer()   X-Ref
Indicates if the current tag token is a tag closer.

Example:

$p = new WP_HTML_Tag_Processor( '<div></div>' );
$p->next_tag( array( 'tag_name' => 'div', 'tag_closers' => 'visit' ) );
$p->is_tag_closer() === false;

$p->next_tag( array( 'tag_name' => 'div', 'tag_closers' => 'visit' ) );
$p->is_tag_closer() === true;

return: bool Whether the current tag is a tag closer.

get_token_type()   X-Ref
Indicates the kind of matched token, if any.

This differs from `get_token_name()` in that it always
returns a static string indicating the type, whereas
`get_token_name()` may return values derived from the
token itself, such as a tag name or processing
instruction tag.

Possible values:
- `#tag` when matched on a tag.
- `#text` when matched on a text node.
- `#cdata-section` when matched on a CDATA node.
- `#comment` when matched on a comment.
- `#doctype` when matched on a DOCTYPE declaration.
- `#presumptuous-tag` when matched on an empty tag closer.
- `#funky-comment` when matched on a funky comment.

return: string|null What kind of token is matched, or null.

get_token_name()   X-Ref
Returns the node name represented by the token.

This matches the DOM API value `nodeName`. Some values
are static, such as `#text` for a text node, while others
are dynamically generated from the token itself.

Dynamic names:
- Uppercase tag name for tag matches.
- `html` for DOCTYPE declarations.

Note that if the Tag Processor is not matched on a token
then this function will return `null`, either because it
hasn't yet found a token or because it reached the end
of the document without matching a token.

return: string|null Name of the matched token.

get_comment_type()   X-Ref
Indicates what kind of comment produced the comment node.

Because there are different kinds of HTML syntax which produce
comments, the Tag Processor tracks and exposes this as a type
for the comment. Nominally only regular HTML comments exist as
they are commonly known, but a number of unrelated syntax errors
also produce comments.

return: string|null

get_full_comment_text()   X-Ref
Returns the text of a matched comment or null if not on a comment type node.

This method returns the entire text content of a comment node as it
would appear in the browser.

This differs from {@see ::get_modifiable_text()} in that certain comment
types in the HTML API cannot allow their entire comment text content to
be modified. Namely, "bogus comments" of the form `<?not allowed in html>`
will create a comment whose text content starts with `?`. Note that if
that character were modified, it would be possible to change the node
type.

return: string|null The comment text as it would appear in the browser or null

subdivide_text_appropriately()   X-Ref
Subdivides a matched text node, splitting NULL byte sequences and decoded whitespace as
distinct nodes prefixes.

Note that once anything that's neither a NULL byte nor decoded whitespace is
encountered, then the remainder of the text node is left intact as generic text.

- The HTML Processor uses this to apply distinct rules for different kinds of text.
- Inter-element whitespace can be detected and skipped with this method.

Text nodes aren't eagerly subdivided because there's no need to split them unless
decisions are being made on NULL byte sequences or whitespace-only text.

Example:

$processor = new WP_HTML_Tag_Processor( "\x00Apples & Oranges" );
true  === $processor->next_token();                   // Text is "Apples & Oranges".
true  === $processor->subdivide_text_appropriately(); // Text is "".
true  === $processor->next_token();                   // Text is "Apples & Oranges".
false === $processor->subdivide_text_appropriately();

$processor = new WP_HTML_Tag_Processor( "&#x13; \r\n\tMore" );
true  === $processor->next_token();                   // Text is "␤ ␤␉More".
true  === $processor->subdivide_text_appropriately(); // Text is "␤ ␤␉".
true  === $processor->next_token();                   // Text is "More".
false === $processor->subdivide_text_appropriately();

return: bool Whether the text node was subdivided.

get_modifiable_text()   X-Ref
Returns the modifiable text for a matched token, or an empty string.

Modifiable text is text content that may be read and changed without
changing the HTML structure of the document around it. This includes
the contents of `#text` nodes in the HTML as well as the inner
contents of HTML comments, Processing Instructions, and others, even
though these nodes aren't part of a parsed DOM tree. They also contain
the contents of SCRIPT and STYLE tags, of TEXTAREA tags, and of any
other section in an HTML document which cannot contain HTML markup (DATA).

If a token has no modifiable text then an empty string is returned to
avoid needless crashing or type errors. An empty string does not mean
that a token has modifiable text, and a token with modifiable text may
have an empty string (e.g. a comment with no contents).

Limitations:

- This function will not strip the leading newline appropriately
after seeking into a LISTING or PRE element. To ensure that the
newline is treated properly, seek to the LISTING or PRE opening
tag instead of to the first text node inside the element.

return: string

set_modifiable_text( string $plaintext_content )   X-Ref
Sets the modifiable text for the matched token, if matched.

Modifiable text is text content that may be read and changed without
changing the HTML structure of the document around it. This includes
the contents of `#text` nodes in the HTML as well as the inner
contents of HTML comments, Processing Instructions, and others, even
though these nodes aren't part of a parsed DOM tree. They also contain
the contents of SCRIPT and STYLE tags, of TEXTAREA tags, and of any
other section in an HTML document which cannot contain HTML markup (DATA).

Not all modifiable text may be set by this method, and not all content
may be set as modifiable text. In the case that this fails it will return
`false` indicating as much. For instance, if the contents of a SCRIPT
element are neither JavaScript nor JSON, it’s not possible to guarantee
that escaping strings like `</script>` won’t break the script; in these
cases, updates will be rejected and it’s up to calling code to perform
language-specific escaping or workarounds. Similarly, it will not allow
setting content into a comment which would prematurely terminate the comment.

Example:

// Add a preface to all STYLE contents.
while ( $processor->next_tag( 'STYLE' ) ) {
$style = $processor->get_modifiable_text();
$processor->set_modifiable_text( "// Made with love on the World Wide Web\n{$style}" );
}

// Replace smiley text with Emoji smilies.
while ( $processor->next_token() ) {
if ( '#text' !== $processor->get_token_name() ) {
continue;
}

$chunk = $processor->get_modifiable_text();
if ( ! str_contains( $chunk, ':)' ) ) {
continue;
}

$processor->set_modifiable_text( str_replace( ':)', '🙂', $chunk ) );
}

This function handles all necessary HTML encoding. Provide normal, unescaped string values.
The HTML API will encode the strings appropriately so that the browser will interpret them
as the intended value.

Example:

// Renders as “Eggs & Milk” in a browser, encoded as `<p>Eggs &amp; Milk</p>`.
$processor->set_modifiable_text( 'Eggs & Milk' );

// Renders as “Eggs &amp; Milk” in a browser, encoded as `<p>Eggs &amp;amp; Milk</p>`.
$processor->set_modifiable_text( 'Eggs &amp; Milk' );

param: string $plaintext_content New text content to represent in the matched token.
return: bool Whether the text was able to update.

get_script_content_type()   X-Ref
No description

escape_javascript_script_contents( string $sourcecode )   X-Ref
Escape JavaScript and JSON script tag contents.

Ensure that the script contents cannot modify the HTML structure or break out
of its containing SCRIPT element. JavaScript and JSON may both be escaped with
the same rules, even though there are additional escaping measures available
to JavaScript source code which aren’t applicable to serialized JSON data.

A simple method safely escapes all content except for a few extremely rare and
unlikely exceptions: prevent the appearance of `<script` and `</script` within
the contents by replacing the first letter of the tag name with a Unicode escape.

Example:

$plaintext = '<script>document.write( "A </script> closes a script." );</script>';
$escaped   = '<script>document.write( "A </\u0073cript> closes a script." );</script>';

This works because of how parsing changes after encountering an opening SCRIPT
tag. The actual parsing comprises a complicated state machine, the result of
legacy behaviors and diverse browser support. However, without these two strings
in the script contents, two key things are ensured: `</script>` cannot appear to
prematurely close the tag, and the problematic double-escaped state becomes
unreachable. A JavaScript engine or JSON decoder will then decode the Unicode
escape (`\u0073`) back into its original plaintext value, but only after having
been safely extracted from the HTML.

While it may seem tempting to replace the `<` character instead, doing so would
break JavaScript syntax. The `<` character is used in comparison operators and
other JavaScript syntax; replacing it would break valid JavaScript. Replacing
only the `s` in `<script` and `</script` avoids modifying JavaScript syntax.

### Exceptions

This _should_ work everywhere, but there are some extreme exceptions.

- Comments.
- Tagged templates, such as `String.raw()`, which provide access to “raw” strings.
- The `source` property of a RegExp object.

Each of these exceptions appear at the source code level, not at the semantic or
evaluation level. Normal JavaScript will remain semantically equivalent after escaping,
but any JavaScript which analyzes the raw source code will see potentially-different
values.

#### Comments

Comments are never unescaped because they aren’t parsed by the JavaScript engine.
When viewing the source in a browser’s developer tools, the comments will retain
their escaped text.

Example:

// A comment: "</script>"
…becomes…
// A comment: "</\u0073cript>"

#### Tagged templates.

Tagged templates “enable the embedding of arbitrary string content, where escape
sequences may follow a different syntax.” For example, they can aid representing
a RegExp pattern or LaTex snippet within a JavaScript string, where the string
escape characters might get noisy and distracting.

Example:

console.log( 'A \notin B' );           // Prints a newline because of the "\n".
console.log( 'A \\notin B' );          // Prints "A \notin B".
console.log( String.raw`A \notin B` ); // Prints "A \notin B".

This means that if `<script` transforms into `<\u0073cript` _inside_ a raw string
or tagged template literal which relies on its `.raw` property, the output of the
code will be different after escaping.

Example:

console.log( String.raw`</script>` );      // Prematurely closes the SCRIPT element.
console.log( String.raw`</\u0073cript>` ); // Prints "</\u0073cript".

#### RegExp sources.

The RegExp object exposes its raw source in a similar way to how tagged templates and raw
strings do. Thankfully, because escape sequences are decoded when compiling the pattern,
escaped RegExp patterns will match the same way as the plaintext sequences would.

Example:

true === /<script>/.test( '<script>' );
true === /<\u0073cript>/.test( '<script>' );

However, as with raw strings, any code which reads the source will see the escaped value
instead of the decoded one.

Example:

console.log( /<script>/.source );      // Prints "<script>".
console.log( /<\u0073cript>/.source ); // Prints "<\u0073cript>".

#### Unsupported escaping.

It is not possible to properly represent every possible JavaScript source file
inside a SCRIPT element. As with CSS stylesheets, SVG images, and MathML, the
only 100% reliable way to represent all possible inputs is to link to external
files of the given content-type.

In some cases it’s possible to manually prevent escaping issues. These are not
automatically handled by this function because doing so would require a full
JavaScript tokenizer. Consider the following example listing various ways to
manually escape a closing script tag.

Example:

console.log( String.raw`</script>` );                // !!UNSAFE!! Will be escaped.
console.log( String.raw`</\u0073cript>` );           // "</\u0073cript>"
console.log( String.raw`</scr` + String.raw`ipt>` ); // "</script>"
console.log( String.raw`</${"script"}>` );           // "</script>"
console.log( '</scr' + 'ipt>' );                     // "</script>"
console.log( "\x3C/script>" );                       // "</script>"
console.log( "<\/script>" );                         // "</script>"

The following graph is a simplified interpretation of how HTML interprets the contents
of a SCRIPT tag and identifies the closing tag. It is useful to understand what text
is dangerous inside of a SCRIPT tag and why different approaches to escaping work.

Open script


╔═════════════════════════════════════════╗   <!--(…)>
║                                         ║   (all dashes)
║                 script                  ╟────────────────╮
║                  data                   ║                │
╭───────────╢                                         ║ ◀──────────────╯
│           ╚═╤═══════════════════════════════════════╝
│             │               ▲                    ▲
│             │ <!--          │ -->                ╰─────╮
│             ▼               │                          │
│           ┌─────────────────┴───────────────────────┐  │
│ </script¹ │                 escaped                 │  │
│           └─┬─────────────────────────────┬─────────┘  │
│             │               ▲             │            │ -->
│             │ </script¹     │ </script¹   │ <script¹   │
│             ▼               │             ▼            │
│           ╔══════════════╗  │           ┌───────────┐  │
│           ║ Close script ║  │           │  double   │  │
╰──────────▶║              ║  ╰───────────┤  escaped  ├──╯
╚══════════════╝              └───────────┘

¹ = Case insensitive 'script' followed by one of ' \t\f\r\n/>', known
as “tag-name-terminating characters.” This sequence forms the start
of what could be a SCRIPT opening or closing tag.

param: string $sourcecode Raw contents intended to be serialized into an HTML SCRIPT element.
return: string Escaped form of input contents which will not lead to premature closing of the containing SCRIPT element.

set_attribute( $name, $value )   X-Ref
Updates or creates a new attribute on the currently matched tag with the passed value.

This function handles all necessary HTML encoding. Provide normal, unescaped string values.
The HTML API will encode the strings appropriately so that the browser will interpret them
as the intended value.

Example:

// Renders “Eggs & Milk” in a browser, encoded as `<abbr title="Eggs &amp; Milk">`.
$processor->set_attribute( 'title', 'Eggs & Milk' );

// Renders “Eggs &amp; Milk” in a browser, encoded as `<abbr title="Eggs &amp;amp; Milk">`.
$processor->set_attribute( 'title', 'Eggs &amp; Milk' );

// Renders `true` as `<abbr title>`.
$processor->set_attribute( 'title', true );

// Renders without the attribute for `false` as `<abbr>`.
$processor->set_attribute( 'title', false );

Special handling is provided for boolean attribute values:
- When `true` is passed as the value, then only the attribute name is added to the tag.
- When `false` is passed, the attribute gets removed if it existed before.

param: string      $name  The attribute name to target.
param: string|bool $value The new attribute value.
return: bool Whether an attribute value was set.

remove_attribute( $name )   X-Ref
Remove an attribute from the currently-matched tag.

param: string $name The attribute name to remove.
return: bool Whether an attribute was removed.

add_class( $class_name )   X-Ref
Adds a new class name to the currently matched tag.

param: string $class_name The class name to add.
return: bool Whether the class was set to be added.

remove_class( $class_name )   X-Ref
Removes a class name from the currently matched tag.

param: string $class_name The class name to remove.
return: bool Whether the class was set to be removed.

__toString()   X-Ref
Returns the string representation of the HTML Tag Processor.

return: string The processed HTML.

get_updated_html()   X-Ref
Returns the string representation of the HTML Tag Processor.

return: string The processed HTML.

parse_query( $query )   X-Ref
Parses tag query input into internal search criteria.

param: array|string|null $query {

matches()   X-Ref
Checks whether a given tag and its attributes match the search criteria.

return: bool Whether the given tag and its attribute match the search criteria.

get_doctype_info()   X-Ref
Gets DOCTYPE declaration info from a DOCTYPE token.

DOCTYPE tokens may appear in many places in an HTML document. In most places, they are
simply ignored. The main parsing functions find the basic shape of DOCTYPE tokens but
do not perform detailed parsing.

This method can be called to perform a full parse of the DOCTYPE token and retrieve
its information.

return: WP_HTML_Doctype_Info|null The DOCTYPE declaration information or `null` if not



Generated : Tue Apr 21 08:20:12 2026 Cross-referenced by PHPXref