[ Index ]

PHP Cross Reference of WordPress Trunk (Updated Daily)

Search

title

Body

[close]

/wp-includes/html-api/ -> class-wp-html-tag-processor.php (summary)

HTML API: WP_HTML_Tag_Processor class Scans through an HTML document to find specific tags, then transforms those tags by adding, removing, or updating the values of the HTML attributes within that tag (opener).

File Size: 4503 lines (145 kb)
Included or required:0 times
Referenced: 0 times
Includes or requires: 0 files

Defines 1 class

WP_HTML_Tag_Processor:: (46 methods):
  __construct()
  change_parsing_namespace()
  next_tag()
  next_token()
  base_class_next_token()
  paused_at_incomplete_token()
  class_list()
  has_class()
  set_bookmark()
  release_bookmark()
  skip_rawtext()
  skip_rcdata()
  skip_script_data()
  parse_next_tag()
  parse_next_attribute()
  skip_whitespace()
  after_tag()
  class_name_updates_to_attributes_updates()
  apply_attributes_updates()
  has_bookmark()
  seek()
  sort_start_ascending()
  get_enqueued_attribute_value()
  get_attribute()
  get_attribute_names_with_prefix()
  get_namespace()
  get_tag()
  get_qualified_tag_name()
  get_qualified_attribute_name()
  has_self_closing_flag()
  is_tag_closer()
  get_token_type()
  get_token_name()
  get_comment_type()
  subdivide_text_appropriately()
  get_modifiable_text()
  set_modifiable_text()
  set_attribute()
  remove_attribute()
  add_class()
  remove_class()
  __toString()
  get_updated_html()
  parse_query()
  matches()
  get_doctype_info()


Class: WP_HTML_Tag_Processor  - X-Ref

Core class used to modify attributes in an HTML document for tags matching a query.

## Usage

Use of this class requires three steps:

1. Create a new class instance with your input HTML document.
2. Find the tag(s) you are looking for.
3. Request changes to the attributes in those tag(s).

Example:

$tags = new WP_HTML_Tag_Processor( $html );
if ( $tags->next_tag( 'option' ) ) {
$tags->set_attribute( 'selected', true );
}

### Finding tags

The `next_tag()` function moves the internal cursor through
your input HTML document until it finds a tag meeting any of
the supplied restrictions in the optional query argument. If
no argument is provided then it will find the next HTML tag,
regardless of what kind it is.

If you want to _find whatever the next tag is_:

$tags->next_tag();

| Goal                                                      | Query                                                                           |
|-----------------------------------------------------------|---------------------------------------------------------------------------------|
| Find any tag.                                             | `$tags->next_tag();`                                                            |
| Find next image tag.                                      | `$tags->next_tag( array( 'tag_name' => 'img' ) );`                              |
| Find next image tag (without passing the array).          | `$tags->next_tag( 'img' );`                                                     |
| Find next tag containing the `fullwidth` CSS class.       | `$tags->next_tag( array( 'class_name' => 'fullwidth' ) );`                      |
| Find next image tag containing the `fullwidth` CSS class. | `$tags->next_tag( array( 'tag_name' => 'img', 'class_name' => 'fullwidth' ) );` |

If a tag was found meeting your criteria then `next_tag()`
will return `true` and you can proceed to modify it. If it
returns `false`, however, it failed to find the tag and
moved the cursor to the end of the file.

Once the cursor reaches the end of the file the processor
is done and if you want to reach an earlier tag you will
need to recreate the processor and start over, as it's
unable to back up or move in reverse.

See the section on bookmarks for an exception to this
no-backing-up rule.

#### Custom queries

Sometimes it's necessary to further inspect an HTML tag than
the query syntax here permits. In these cases one may further
inspect the search results using the read-only functions
provided by the processor or external state or variables.

Example:

// Paint up to the first five DIV or SPAN tags marked with the "jazzy" style.
$remaining_count = 5;
while ( $remaining_count > 0 && $tags->next_tag() ) {
if (
( 'DIV' === $tags->get_tag() || 'SPAN' === $tags->get_tag() ) &&
'jazzy' === $tags->get_attribute( 'data-style' )
) {
$tags->add_class( 'theme-style-everest-jazz' );
$remaining_count--;
}
}

`get_attribute()` will return `null` if the attribute wasn't present
on the tag when it was called. It may return `""` (the empty string)
in cases where the attribute was present but its value was empty.
For boolean attributes, those whose name is present but no value is
given, it will return `true` (the only way to set `false` for an
attribute is to remove it).

#### When matching fails

When `next_tag()` returns `false` it could mean different things:

- The requested tag wasn't found in the input document.
- The input document ended in the middle of an HTML syntax element.

When a document ends in the middle of a syntax element it will pause
the processor. This is to make it possible in the future to extend the
input document and proceed - an important requirement for chunked
streaming parsing of a document.

Example:

$processor = new WP_HTML_Tag_Processor( 'This <div is="a" partial="token' );
false === $processor->next_tag();

If a special element (see next section) is encountered but no closing tag
is found it will count as an incomplete tag. The parser will pause as if
the opening tag were incomplete.

Example:

$processor = new WP_HTML_Tag_Processor( '<style>// there could be more styling to come' );
false === $processor->next_tag();

$processor = new WP_HTML_Tag_Processor( '<style>// this is everything</style><div>' );
true === $processor->next_tag( 'DIV' );

#### Special self-contained elements

Some HTML elements are handled in a special way; their start and end tags
act like a void tag. These are special because their contents can't contain
HTML markup. Everything inside these elements is handled in a special way
and content that _appears_ like HTML tags inside of them isn't. There can
be no nesting in these elements.

In the following list, "raw text" means that all of the content in the HTML
until the matching closing tag is treated verbatim without any replacements
and without any parsing.

- IFRAME allows no content but requires a closing tag.
- NOEMBED (deprecated) content is raw text.
- NOFRAMES (deprecated) content is raw text.
- SCRIPT content is plaintext apart from legacy rules allowing `</script>` inside an HTML comment.
- STYLE content is raw text.
- TITLE content is plain text but character references are decoded.
- TEXTAREA content is plain text but character references are decoded.
- XMP (deprecated) content is raw text.

### Modifying HTML attributes for a found tag

Once you've found the start of an opening tag you can modify
any number of the attributes on that tag. You can set a new
value for an attribute, remove the entire attribute, or do
nothing and move on to the next opening tag.

Example:

if ( $tags->next_tag( array( 'class_name' => 'wp-group-block' ) ) ) {
$tags->set_attribute( 'title', 'This groups the contained content.' );
$tags->remove_attribute( 'data-test-id' );
}

If `set_attribute()` is called for an existing attribute it will
overwrite the existing value. Similarly, calling `remove_attribute()`
for a non-existing attribute has no effect on the document. Both
of these methods are safe to call without knowing if a given attribute
exists beforehand.

### Modifying CSS classes for a found tag

The tag processor treats the `class` attribute as a special case.
Because it's a common operation to add or remove CSS classes, this
interface adds helper methods to make that easier.

As with attribute values, adding or removing CSS classes is a safe
operation that doesn't require checking if the attribute or class
exists before making changes. If removing the only class then the
entire `class` attribute will be removed.

Example:

// from `<span>Yippee!</span>`
//   to `<span class="is-active">Yippee!</span>`
$tags->add_class( 'is-active' );

// from `<span class="excited">Yippee!</span>`
//   to `<span class="excited is-active">Yippee!</span>`
$tags->add_class( 'is-active' );

// from `<span class="is-active heavy-accent">Yippee!</span>`
//   to `<span class="is-active heavy-accent">Yippee!</span>`
$tags->add_class( 'is-active' );

// from `<input type="text" class="is-active rugby not-disabled" length="24">`
//   to `<input type="text" class="is-active not-disabled" length="24">
$tags->remove_class( 'rugby' );

// from `<input type="text" class="rugby" length="24">`
//   to `<input type="text" length="24">
$tags->remove_class( 'rugby' );

// from `<input type="text" length="24">`
//   to `<input type="text" length="24">
$tags->remove_class( 'rugby' );

When class changes are enqueued but a direct change to `class` is made via
`set_attribute` then the changes to `set_attribute` (or `remove_attribute`)
will take precedence over those made through `add_class` and `remove_class`.

### Bookmarks

While scanning through the input HTMl document it's possible to set
a named bookmark when a particular tag is found. Later on, after
continuing to scan other tags, it's possible to `seek` to one of
the set bookmarks and then proceed again from that point forward.

Because bookmarks create processing overhead one should avoid
creating too many of them. As a rule, create only bookmarks
of known string literal names; avoid creating "mark_{$index}"
and so on. It's fine from a performance standpoint to create a
bookmark and update it frequently, such as within a loop.

$total_todos = 0;
while ( $p->next_tag( array( 'tag_name' => 'UL', 'class_name' => 'todo' ) ) ) {
$p->set_bookmark( 'list-start' );
while ( $p->next_tag( array( 'tag_closers' => 'visit' ) ) ) {
if ( 'UL' === $p->get_tag() && $p->is_tag_closer() ) {
$p->set_bookmark( 'list-end' );
$p->seek( 'list-start' );
$p->set_attribute( 'data-contained-todos', (string) $total_todos );
$total_todos = 0;
$p->seek( 'list-end' );
break;
}

if ( 'LI' === $p->get_tag() && ! $p->is_tag_closer() ) {
$total_todos++;
}
}
}

## Tokens and finer-grained processing.

It's possible to scan through every lexical token in the
HTML document using the `next_token()` function. This
alternative form takes no argument and provides no built-in
query syntax.

Example:

$title = '(untitled)';
$text  = '';
while ( $processor->next_token() ) {
switch ( $processor->get_token_name() ) {
case '#text':
$text .= $processor->get_modifiable_text();
break;

case 'BR':
$text .= "\n";
break;

case 'TITLE':
$title = $processor->get_modifiable_text();
break;
}
}
return trim( "# {$title}\n\n{$text}" );

### Tokens and _modifiable text_.

#### Special "atomic" HTML elements.

Not all HTML elements are able to contain other elements inside of them.
For instance, the contents inside a TITLE element are plaintext (except
that character references like &amp; will be decoded). This means that
if the string `<img>` appears inside a TITLE element, then it's not an
image tag, but rather it's text describing an image tag. Likewise, the
contents of a SCRIPT or STYLE element are handled entirely separately in
a browser than the contents of other elements because they represent a
different language than HTML.

For these elements the Tag Processor treats the entire sequence as one,
from the opening tag, including its contents, through its closing tag.
This means that the it's not possible to match the closing tag for a
SCRIPT element unless it's unexpected; the Tag Processor already matched
it when it found the opening tag.

The inner contents of these elements are that element's _modifiable text_.

The special elements are:
- `SCRIPT` whose contents are treated as raw plaintext but supports a legacy
style of including JavaScript inside of HTML comments to avoid accidentally
closing the SCRIPT from inside a JavaScript string. E.g. `console.log( '</script>' )`.
- `TITLE` and `TEXTAREA` whose contents are treated as plaintext and then any
character references are decoded. E.g. `1 &lt; 2 < 3` becomes `1 < 2 < 3`.
- `IFRAME`, `NOSCRIPT`, `NOEMBED`, `NOFRAME`, `STYLE` whose contents are treated as
raw plaintext and left as-is. E.g. `1 &lt; 2 < 3` remains `1 &lt; 2 < 3`.

#### Other tokens with modifiable text.

There are also non-elements which are void/self-closing in nature and contain
modifiable text that is part of that individual syntax token itself.

- `#text` nodes, whose entire token _is_ the modifiable text.
- HTML comments and tokens that become comments due to some syntax error. The
text for these tokens is the portion of the comment inside of the syntax.
E.g. for `<!-- comment -->` the text is `" comment "` (note the spaces are included).
- `CDATA` sections, whose text is the content inside of the section itself. E.g. for
`<![CDATA[some content]]>` the text is `"some content"` (with restrictions [1]).
- "Funky comments," which are a special case of invalid closing tags whose name is
invalid. The text for these nodes is the text that a browser would transform into
an HTML comment when parsing. E.g. for `</%post_author>` the text is `%post_author`.
- `DOCTYPE` declarations like `<DOCTYPE html>` which have no closing tag.
- XML Processing instruction nodes like `<?wp __( "Like" ); ?>` (with restrictions [2]).
- The empty end tag `</>` which is ignored in the browser and DOM.

[1]: There are no CDATA sections in HTML. When encountering `<![CDATA[`, everything
until the next `>` becomes a bogus HTML comment, meaning there can be no CDATA
section in an HTML document containing `>`. The Tag Processor will first find
all valid and bogus HTML comments, and then if the comment _would_ have been a
CDATA section _were they to exist_, it will indicate this as the type of comment.

[2]: XML allows a broader range of characters in a processing instruction's target name
and disallows "xml" as a name, since it's special. The Tag Processor only recognizes
target names with an ASCII-representable subset of characters. It also exhibits the
same constraint as with CDATA sections, in that `>` cannot exist within the token
since Processing Instructions do no exist within HTML and their syntax transforms
into a bogus comment in the DOM.

## Design and limitations

The Tag Processor is designed to linearly scan HTML documents and tokenize
HTML tags and their attributes. It's designed to do this as efficiently as
possible without compromising parsing integrity. Therefore it will be
slower than some methods of modifying HTML, such as those incorporating
over-simplified PCRE patterns, but will not introduce the defects and
failures that those methods bring in, which lead to broken page renders
and often to security vulnerabilities. On the other hand, it will be faster
than full-blown HTML parsers such as DOMDocument and use considerably
less memory. It requires a negligible memory overhead, enough to consider
it a zero-overhead system.

The performance characteristics are maintained by avoiding tree construction
and semantic cleanups which are specified in HTML5. Because of this, for
example, it's not possible for the Tag Processor to associate any given
opening tag with its corresponding closing tag, or to return the inner markup
inside an element. Systems may be built on top of the Tag Processor to do
this, but the Tag Processor is and should be constrained so it can remain an
efficient, low-level, and reliable HTML scanner.

The Tag Processor's design incorporates a "garbage-in-garbage-out" philosophy.
HTML5 specifies that certain invalid content be transformed into different forms
for display, such as removing null bytes from an input document and replacing
invalid characters with the Unicode replacement character `U+FFFD` (visually "�").
Where errors or transformations exist within the HTML5 specification, the Tag Processor
leaves those invalid inputs untouched, passing them through to the final browser
to handle. While this implies that certain operations will be non-spec-compliant,
such as reading the value of an attribute with invalid content, it also preserves a
simplicity and efficiency for handling those error cases.

Most operations within the Tag Processor are designed to minimize the difference
between an input and output document for any given change. For example, the
`add_class` and `remove_class` methods preserve whitespace and the class ordering
within the `class` attribute; and when encountering tags with duplicated attributes,
the Tag Processor will leave those invalid duplicate attributes where they are but
update the proper attribute which the browser will read for parsing its value. An
exception to this rule is that all attribute updates store their values as
double-quoted strings, meaning that attributes on input with single-quoted or
unquoted values will appear in the output with double-quotes.

### Scripting Flag

The Tag Processor parses HTML with the "scripting flag" disabled. This means
that it doesn't run any scripts while parsing the page. In a browser with
JavaScript enabled, for example, the script can change the parse of the
document as it loads. On the server, however, evaluating JavaScript is not
only impractical, but also unwanted.

Practically this means that the Tag Processor will descend into NOSCRIPT
elements and process its child tags. Were the scripting flag enabled, such
as in a typical browser, the contents of NOSCRIPT are skipped entirely.

This allows the HTML API to process the content that will be presented in
a browser when scripting is disabled, but it offers a different view of a
page than most browser sessions will experience. E.g. the tags inside the
NOSCRIPT disappear.

### Text Encoding

The Tag Processor assumes that the input HTML document is encoded with a
text encoding compatible with 7-bit ASCII's '<', '>', '&', ';', '/', '=',
"'", '"', 'a' - 'z', 'A' - 'Z', and the whitespace characters ' ', tab,
carriage-return, newline, and form-feed.

In practice, this includes almost every single-byte encoding as well as
UTF-8. Notably, however, it does not include UTF-16. If providing input
that's incompatible, then convert the encoding beforehand.

__construct( $html )   X-Ref
Constructor.

param: string $html HTML to process.

change_parsing_namespace( string $new_namespace )   X-Ref
Switches parsing mode into a new namespace, such as when
encountering an SVG tag and entering foreign content.

return: bool Whether the namespace was valid and changed.
param: string $new_namespace One of 'html', 'svg', or 'math' indicating into what

next_tag( $query = null )   X-Ref
Finds the next tag matching the $query.

return: bool Whether a tag was matched.
param: array|string|null $query {

next_token()   X-Ref
Finds the next token in the HTML document.

An HTML document can be viewed as a stream of tokens,
where tokens are things like HTML tags, HTML comments,
text nodes, etc. This method finds the next token in
the HTML document and returns whether it found one.

If it starts parsing a token and reaches the end of the
document then it will seek to the start of the last
token and pause, returning `false` to indicate that it
failed to find a complete token.

Possible token types, based on the HTML specification:

- an HTML tag, whether opening, closing, or void.
- a text node - the plaintext inside tags.
- an HTML comment.
- a DOCTYPE declaration.
- a processing instruction, e.g. `<?xml version="1.0" ?>`.

The Tag Processor currently only supports the tag token.

return: bool Whether a token was parsed.

base_class_next_token()   X-Ref
Internal method which finds the next token in the HTML document.

This method is a protected internal function which implements the logic for
finding the next token in a document. It exists so that the parser can update
its state without affecting the location of the cursor in the document and
without triggering subclass methods for things like `next_token()`, e.g. when
applying patches before searching for the next token.

return: bool Whether a token was parsed.

paused_at_incomplete_token()   X-Ref
Whether the processor paused because the input HTML document ended
in the middle of a syntax element, such as in the middle of a tag.

Example:

$processor = new WP_HTML_Tag_Processor( '<input type="text" value="Th' );
false      === $processor->get_next_tag();
true       === $processor->paused_at_incomplete_token();

return: bool Whether the parse paused at the start of an incomplete token.

class_list()   X-Ref
Generator for a foreach loop to step through each class name for the matched tag.

This generator function is designed to be used inside a "foreach" loop.

Example:

$p = new WP_HTML_Tag_Processor( "<div class='free &lt;egg&lt;\tlang-en'>" );
$p->next_tag();
foreach ( $p->class_list() as $class_name ) {
echo "{$class_name} ";
}
// Outputs: "free <egg> lang-en "


has_class( $wanted_class )   X-Ref
Returns if a matched tag contains the given ASCII case-insensitive class name.

return: bool|null Whether the matched tag contains the given class name, or null if not matched.
param: string $wanted_class Look for this CSS class name, ASCII case-insensitive.

set_bookmark( $name )   X-Ref
Sets a bookmark in the HTML document.

Bookmarks represent specific places or tokens in the HTML
document, such as a tag opener or closer. When applying
edits to a document, such as setting an attribute, the
text offsets of that token may shift; the bookmark is
kept updated with those shifts and remains stable unless
the entire span of text in which the token sits is removed.

Release bookmarks when they are no longer needed.

Example:

<main><h2>Surprising fact you may not know!</h2></main>
^  ^
\-|-- this `H2` opener bookmark tracks the token

<main class="clickbait"><h2>Surprising fact you may no…
^  ^
\-|-- it shifts with edits

Bookmarks provide the ability to seek to a previously-scanned
place in the HTML document. This avoids the need to re-scan
the entire document.

Example:

<ul><li>One</li><li>Two</li><li>Three</li></ul>
^^^^
want to note this last item

$p = new WP_HTML_Tag_Processor( $html );
$in_list = false;
while ( $p->next_tag( array( 'tag_closers' => $in_list ? 'visit' : 'skip' ) ) ) {
if ( 'UL' === $p->get_tag() ) {
if ( $p->is_tag_closer() ) {
$in_list = false;
$p->set_bookmark( 'resume' );
if ( $p->seek( 'last-li' ) ) {
$p->add_class( 'last-li' );
}
$p->seek( 'resume' );
$p->release_bookmark( 'last-li' );
$p->release_bookmark( 'resume' );
} else {
$in_list = true;
}
}

if ( 'LI' === $p->get_tag() ) {
$p->set_bookmark( 'last-li' );
}
}

Bookmarks intentionally hide the internal string offsets
to which they refer. They are maintained internally as
updates are applied to the HTML document and therefore
retain their "position" - the location to which they
originally pointed. The inability to use bookmarks with
functions like `substr` is therefore intentional to guard
against accidentally breaking the HTML.

Because bookmarks allocate memory and require processing
for every applied update, they are limited and require
a name. They should not be created with programmatically-made
names, such as "li_{$index}" with some loop. As a general
rule they should only be created with string-literal names
like "start-of-section" or "last-paragraph".

Bookmarks are a powerful tool to enable complicated behavior.
Consider double-checking that you need this tool if you are
reaching for it, as inappropriate use could lead to broken
HTML structure or unwanted processing overhead.

return: bool Whether the bookmark was successfully created.
param: string $name Identifies this particular bookmark.

release_bookmark( $name )   X-Ref
Removes a bookmark that is no longer needed.

Releasing a bookmark frees up the small
performance overhead it requires.

return: bool Whether the bookmark already existed before removal.
param: string $name Name of the bookmark to remove.

skip_rawtext( string $tag_name )   X-Ref
Skips contents of generic rawtext elements.

return: bool Whether an end to the RAWTEXT region was found before the end of the document.
param: string $tag_name The uppercase tag name which will close the RAWTEXT region.

skip_rcdata( string $tag_name )   X-Ref
Skips contents of RCDATA elements, namely title and textarea tags.

return: bool Whether an end to the RCDATA region was found before the end of the document.
param: string $tag_name The uppercase tag name which will close the RCDATA region.

skip_script_data()   X-Ref
Skips contents of script tags.

return: bool Whether the script tag was closed before the end of the document.

parse_next_tag()   X-Ref
Parses the next tag.

This will find and start parsing the next tag, including
the opening `<`, the potential closer `/`, and the tag
name. It does not parse the attributes or scan to the
closing `>`; these are left for other methods.

return: bool Whether a tag was found before the end of the document.

parse_next_attribute()   X-Ref
Parses the next attribute.

return: bool Whether an attribute was found before the end of the document.

skip_whitespace()   X-Ref
Move the internal cursor past any immediate successive whitespace.


after_tag()   X-Ref
Applies attribute updates and cleans up once a tag is fully parsed.


class_name_updates_to_attributes_updates()   X-Ref
Converts class name updates into tag attributes updates
(they are accumulated in different data formats for performance).


apply_attributes_updates( int $shift_this_point )   X-Ref
Applies attribute updates to HTML document.

return: int How many bytes the given pointer moved in response to the updates.
param: int $shift_this_point Accumulate and return shift for this position.

has_bookmark( $bookmark_name )   X-Ref
Checks whether a bookmark with the given name exists.

return: bool Whether that bookmark exists.
param: string $bookmark_name Name to identify a bookmark that potentially exists.

seek( $bookmark_name )   X-Ref
Move the internal cursor in the Tag Processor to a given bookmark's location.

In order to prevent accidental infinite loops, there's a
maximum limit on the number of times seek() can be called.

return: bool Whether the internal cursor was successfully moved to the bookmark's location.
param: string $bookmark_name Jump to the place in the document identified by this bookmark name.

sort_start_ascending( WP_HTML_Text_Replacement $a, WP_HTML_Text_Replacement $b )   X-Ref
Compare two WP_HTML_Text_Replacement objects.

return: int Comparison value for string order.
param: WP_HTML_Text_Replacement $a First attribute update.
param: WP_HTML_Text_Replacement $b Second attribute update.

get_enqueued_attribute_value( string $comparable_name )   X-Ref
Return the enqueued value for a given attribute, if one exists.

Enqueued updates can take different data types:
- If an update is enqueued and is boolean, the return will be `true`
- If an update is otherwise enqueued, the return will be the string value of that update.
- If an attribute is enqueued to be removed, the return will be `null` to indicate that.
- If no updates are enqueued, the return will be `false` to differentiate from "removed."

return: string|boolean|null Value of enqueued update if present, otherwise false.
param: string $comparable_name The attribute name in its comparable form.

get_attribute( $name )   X-Ref
Returns the value of a requested attribute from a matched tag opener if that attribute exists.

Example:

$p = new WP_HTML_Tag_Processor( '<div enabled class="test" data-test-id="14">Test</div>' );
$p->next_tag( array( 'class_name' => 'test' ) ) === true;
$p->get_attribute( 'data-test-id' ) === '14';
$p->get_attribute( 'enabled' ) === true;
$p->get_attribute( 'aria-label' ) === null;

$p->next_tag() === false;
$p->get_attribute( 'class' ) === null;

return: string|true|null Value of attribute or `null` if not available. Boolean attributes return `true`.
param: string $name Name of attribute whose value is requested.

get_attribute_names_with_prefix( $prefix )   X-Ref
Gets lowercase names of all attributes matching a given prefix in the current tag.

Note that matching is case-insensitive. This is in accordance with the spec:

> There must never be two or more attributes on
> the same start tag whose names are an ASCII
> case-insensitive match for each other.
- HTML 5 spec

Example:

$p = new WP_HTML_Tag_Processor( '<div data-ENABLED class="test" DATA-test-id="14">Test</div>' );
$p->next_tag( array( 'class_name' => 'test' ) ) === true;
$p->get_attribute_names_with_prefix( 'data-' ) === array( 'data-enabled', 'data-test-id' );

$p->next_tag() === false;
$p->get_attribute_names_with_prefix( 'data-' ) === null;

return: array|null List of attribute names, or `null` when no tag opener is matched.
param: string $prefix Prefix of requested attribute names.

get_namespace()   X-Ref
Returns the namespace of the matched token.

return: string One of 'html', 'math', or 'svg'.

get_tag()   X-Ref
Returns the uppercase name of the matched tag.

Example:

$p = new WP_HTML_Tag_Processor( '<div class="test">Test</div>' );
$p->next_tag() === true;
$p->get_tag() === 'DIV';

$p->next_tag() === false;
$p->get_tag() === null;

return: string|null Name of currently matched tag in input HTML, or `null` if none found.

get_qualified_tag_name()   X-Ref
Returns the adjusted tag name for a given token, taking into
account the current parsing context, whether HTML, SVG, or MathML.

return: string|null Name of current tag name.

get_qualified_attribute_name( $attribute_name )   X-Ref
Returns the adjusted attribute name for a given attribute, taking into
account the current parsing context, whether HTML, SVG, or MathML.

return: string|null
param: string $attribute_name Which attribute to adjust.

has_self_closing_flag()   X-Ref
Indicates if the currently matched tag contains the self-closing flag.

No HTML elements ought to have the self-closing flag and for those, the self-closing
flag will be ignored. For void elements this is benign because they "self close"
automatically. For non-void HTML elements though problems will appear if someone
intends to use a self-closing element in place of that element with an empty body.
For HTML foreign elements and custom elements the self-closing flag determines if
they self-close or not.

This function does not determine if a tag is self-closing,
but only if the self-closing flag is present in the syntax.

return: bool Whether the currently matched tag contains the self-closing flag.

is_tag_closer()   X-Ref
Indicates if the current tag token is a tag closer.

Example:

$p = new WP_HTML_Tag_Processor( '<div></div>' );
$p->next_tag( array( 'tag_name' => 'div', 'tag_closers' => 'visit' ) );
$p->is_tag_closer() === false;

$p->next_tag( array( 'tag_name' => 'div', 'tag_closers' => 'visit' ) );
$p->is_tag_closer() === true;

return: bool Whether the current tag is a tag closer.

get_token_type()   X-Ref
Indicates the kind of matched token, if any.

This differs from `get_token_name()` in that it always
returns a static string indicating the type, whereas
`get_token_name()` may return values derived from the
token itself, such as a tag name or processing
instruction tag.

Possible values:
- `#tag` when matched on a tag.
- `#text` when matched on a text node.
- `#cdata-section` when matched on a CDATA node.
- `#comment` when matched on a comment.
- `#doctype` when matched on a DOCTYPE declaration.
- `#presumptuous-tag` when matched on an empty tag closer.
- `#funky-comment` when matched on a funky comment.

return: string|null What kind of token is matched, or null.

get_token_name()   X-Ref
Returns the node name represented by the token.

This matches the DOM API value `nodeName`. Some values
are static, such as `#text` for a text node, while others
are dynamically generated from the token itself.

Dynamic names:
- Uppercase tag name for tag matches.
- `html` for DOCTYPE declarations.

Note that if the Tag Processor is not matched on a token
then this function will return `null`, either because it
hasn't yet found a token or because it reached the end
of the document without matching a token.

return: string|null Name of the matched token.

get_comment_type()   X-Ref
Indicates what kind of comment produced the comment node.

Because there are different kinds of HTML syntax which produce
comments, the Tag Processor tracks and exposes this as a type
for the comment. Nominally only regular HTML comments exist as
they are commonly known, but a number of unrelated syntax errors
also produce comments.

return: string|null

subdivide_text_appropriately()   X-Ref
Subdivides a matched text node, splitting NULL byte sequences and decoded whitespace as
distinct nodes prefixes.

Note that once anything that's neither a NULL byte nor decoded whitespace is
encountered, then the remainder of the text node is left intact as generic text.

- The HTML Processor uses this to apply distinct rules for different kinds of text.
- Inter-element whitespace can be detected and skipped with this method.

Text nodes aren't eagerly subdivided because there's no need to split them unless
decisions are being made on NULL byte sequences or whitespace-only text.

Example:

$processor = new WP_HTML_Tag_Processor( "\x00Apples & Oranges" );
true  === $processor->next_token();                   // Text is "Apples & Oranges".
true  === $processor->subdivide_text_appropriately(); // Text is "".
true  === $processor->next_token();                   // Text is "Apples & Oranges".
false === $processor->subdivide_text_appropriately();

$processor = new WP_HTML_Tag_Processor( "&#x13; \r\n\tMore" );
true  === $processor->next_token();                   // Text is "␤ ␤␉More".
true  === $processor->subdivide_text_appropriately(); // Text is "␤ ␤␉".
true  === $processor->next_token();                   // Text is "More".
false === $processor->subdivide_text_appropriately();

return: bool Whether the text node was subdivided.

get_modifiable_text()   X-Ref
Returns the modifiable text for a matched token, or an empty string.

Modifiable text is text content that may be read and changed without
changing the HTML structure of the document around it. This includes
the contents of `#text` nodes in the HTML as well as the inner
contents of HTML comments, Processing Instructions, and others, even
though these nodes aren't part of a parsed DOM tree. They also contain
the contents of SCRIPT and STYLE tags, of TEXTAREA tags, and of any
other section in an HTML document which cannot contain HTML markup (DATA).

If a token has no modifiable text then an empty string is returned to
avoid needless crashing or type errors. An empty string does not mean
that a token has modifiable text, and a token with modifiable text may
have an empty string (e.g. a comment with no contents).

Limitations:

- This function will not strip the leading newline appropriately
after seeking into a LISTING or PRE element. To ensure that the
newline is treated properly, seek to the LISTING or PRE opening
tag instead of to the first text node inside the element.

return: string

set_modifiable_text( string $plaintext_content )   X-Ref
Sets the modifiable text for the matched token, if matched.

Modifiable text is text content that may be read and changed without
changing the HTML structure of the document around it. This includes
the contents of `#text` nodes in the HTML as well as the inner
contents of HTML comments, Processing Instructions, and others, even
though these nodes aren't part of a parsed DOM tree. They also contain
the contents of SCRIPT and STYLE tags, of TEXTAREA tags, and of any
other section in an HTML document which cannot contain HTML markup (DATA).

Not all modifiable text may be set by this method, and not all content
may be set as modifiable text. In the case that this fails it will return
`false` indicating as much. For instance, it will not allow inserting the
string `</script` into a SCRIPT element, because the rules for escaping
that safely are complicated. Similarly, it will not allow setting content
into a comment which would prematurely terminate the comment.

Example:

// Add a preface to all STYLE contents.
while ( $processor->next_tag( 'STYLE' ) ) {
$style = $processor->get_modifiable_text();
$processor->set_modifiable_text( "// Made with love on the World Wide Web\n{$style}" );
}

// Replace smiley text with Emoji smilies.
while ( $processor->next_token() ) {
if ( '#text' !== $processor->get_token_name() ) {
continue;
}

$chunk = $processor->get_modifiable_text();
if ( ! str_contains( $chunk, ':)' ) ) {
continue;
}

$processor->set_modifiable_text( str_replace( ':)', '🙂', $chunk ) );
}

return: bool Whether the text was able to update.
param: string $plaintext_content New text content to represent in the matched token.

set_attribute( $name, $value )   X-Ref
No description

remove_attribute( $name )   X-Ref
Remove an attribute from the currently-matched tag.

return: bool Whether an attribute was removed.
param: string $name The attribute name to remove.

add_class( $class_name )   X-Ref
Adds a new class name to the currently matched tag.

return: bool Whether the class was set to be added.
param: string $class_name The class name to add.

remove_class( $class_name )   X-Ref
Removes a class name from the currently matched tag.

return: bool Whether the class was set to be removed.
param: string $class_name The class name to remove.

__toString()   X-Ref
Returns the string representation of the HTML Tag Processor.

return: string The processed HTML.

get_updated_html()   X-Ref
Returns the string representation of the HTML Tag Processor.

return: string The processed HTML.

parse_query( $query )   X-Ref
Parses tag query input into internal search criteria.

param: array|string|null $query {

matches()   X-Ref
Checks whether a given tag and its attributes match the search criteria.

return: bool Whether the given tag and its attribute match the search criteria.

get_doctype_info()   X-Ref
Gets DOCTYPE declaration info from a DOCTYPE token.

DOCTYPE tokens may appear in many places in an HTML document. In most places, they are
simply ignored. The main parsing functions find the basic shape of DOCTYPE tokens but
do not perform detailed parsing.

This method can be called to perform a full parse of the DOCTYPE token and retrieve
its information.

return: WP_HTML_Doctype_Info|null The DOCTYPE declaration information or `null` if not



Generated : Sat Sep 14 08:20:02 2024 Cross-referenced by PHPXref