[ Index ]

PHP Cross Reference of WordPress Trunk (Updated Daily)

Search

title

Body

[close]

/wp-includes/html-api/ -> class-wp-html-processor.php (summary)

HTML API: WP_HTML_Processor class

File Size: 1929 lines (62 kb)
Included or required:0 times
Referenced: 0 times
Includes or requires: 0 files

Defines 1 class

WP_HTML_Processor:: (23 methods):
  create_fragment()
  __construct()
  get_last_error()
  next_tag()
  next_token()
  matches_breadcrumbs()
  step()
  get_breadcrumbs()
  step_in_body()
  bookmark_token()
  get_tag()
  release_bookmark()
  seek()
  set_bookmark()
  has_bookmark()
  close_a_p_element()
  generate_implied_end_tags()
  generate_implied_end_tags_thoroughly()
  reconstruct_active_formatting_elements()
  run_adoption_agency_algorithm()
  insert_html_element()
  is_special()
  is_void()


Class: WP_HTML_Processor  - X-Ref

Core class used to safely parse and modify an HTML document.

The HTML Processor class properly parses and modifies HTML5 documents.

It supports a subset of the HTML5 specification, and when it encounters
unsupported markup, it aborts early to avoid unintentionally breaking
the document. The HTML Processor should never break an HTML document.

While the `WP_HTML_Tag_Processor` is a valuable tool for modifying
attributes on individual HTML tags, the HTML Processor is more capable
and useful for the following operations:

- Querying based on nested HTML structure.

Eventually the HTML Processor will also support:
- Wrapping a tag in surrounding HTML.
- Unwrapping a tag by removing its parent.
- Inserting and removing nodes.
- Reading and changing inner content.
- Navigating up or around HTML structure.

## Usage

Use of this class requires three steps:

1. Call a static creator method with your input HTML document.
2. Find the location in the document you are looking for.
3. Request changes to the document at that location.

Example:

$processor = WP_HTML_Processor::create_fragment( $html );
if ( $processor->next_tag( array( 'breadcrumbs' => array( 'DIV', 'FIGURE', 'IMG' ) ) ) ) {
$processor->add_class( 'responsive-image' );
}

#### Breadcrumbs

Breadcrumbs represent the stack of open elements from the root
of the document or fragment down to the currently-matched node,
if one is currently selected. Call WP_HTML_Processor::get_breadcrumbs()
to inspect the breadcrumbs for a matched tag.

Breadcrumbs can specify nested HTML structure and are equivalent
to a CSS selector comprising tag names separated by the child
combinator, such as "DIV > FIGURE > IMG".

Since all elements find themselves inside a full HTML document
when parsed, the return value from `get_breadcrumbs()` will always
contain any implicit outermost elements. For example, when parsing
with `create_fragment()` in the `BODY` context (the default), any
tag in the given HTML document will contain `array( 'HTML', 'BODY', … )`
in its breadcrumbs.

Despite containing the implied outermost elements in their breadcrumbs,
tags may be found with the shortest-matching breadcrumb query. That is,
`array( 'IMG' )` matches all IMG elements and `array( 'P', 'IMG' )`
matches all IMG elements directly inside a P element. To ensure that no
partial matches erroneously match it's possible to specify in a query
the full breadcrumb match all the way down from the root HTML element.

Example:

$html = '<figure><img><figcaption>A <em>lovely</em> day outside</figcaption></figure>';
//               ----- Matches here.
$processor->next_tag( array( 'breadcrumbs' => array( 'FIGURE', 'IMG' ) ) );

$html = '<figure><img><figcaption>A <em>lovely</em> day outside</figcaption></figure>';
//                                  ---- Matches here.
$processor->next_tag( array( 'breadcrumbs' => array( 'FIGURE', 'FIGCAPTION', 'EM' ) ) );

$html = '<div><img></div><img>';
//                       ----- Matches here, because IMG must be a direct child of the implicit BODY.
$processor->next_tag( array( 'breadcrumbs' => array( 'BODY', 'IMG' ) ) );

## HTML Support

This class implements a small part of the HTML5 specification.
It's designed to operate within its support and abort early whenever
encountering circumstances it can't properly handle. This is
the principle way in which this class remains as simple as possible
without cutting corners and breaking compliance.

### Supported elements

If any unsupported element appears in the HTML input the HTML Processor
will abort early and stop all processing. This draconian measure ensures
that the HTML Processor won't break any HTML it doesn't fully understand.

The following list specifies the HTML tags that _are_ supported:

- Containers: ADDRESS, BLOCKQUOTE, DETAILS, DIALOG, DIV, FOOTER, HEADER, MAIN, MENU, SPAN, SUMMARY.
- Custom elements: All custom elements are supported. :)
- Form elements: BUTTON, DATALIST, FIELDSET, INPUT, LABEL, LEGEND, METER, PROGRESS, SEARCH.
- Formatting elements: B, BIG, CODE, EM, FONT, I, PRE, SMALL, STRIKE, STRONG, TT, U, WBR.
- Heading elements: H1, H2, H3, H4, H5, H6, HGROUP.
- Links: A.
- Lists: DD, DL, DT, LI, OL, UL.
- Media elements: AUDIO, CANVAS, EMBED, FIGCAPTION, FIGURE, IMG, MAP, PICTURE, SOURCE, TRACK, VIDEO.
- Paragraph: BR, P.
- Phrasing elements: ABBR, AREA, BDI, BDO, CITE, DATA, DEL, DFN, INS, MARK, OUTPUT, Q, SAMP, SUB, SUP, TIME, VAR.
- Sectioning elements: ARTICLE, ASIDE, HR, NAV, SECTION.
- Templating elements: SLOT.
- Text decoration: RUBY.
- Deprecated elements: ACRONYM, BLINK, CENTER, DIR, ISINDEX, KEYGEN, LISTING, MULTICOL, NEXTID, PARAM, SPACER.

### Supported markup

Some kinds of non-normative HTML involve reconstruction of formatting elements and
re-parenting of mis-nested elements. For example, a DIV tag found inside a TABLE
may in fact belong _before_ the table in the DOM. If the HTML Processor encounters
such a case it will stop processing.

The following list specifies HTML markup that _is_ supported:

- Markup involving only those tags listed above.
- Fully-balanced and non-overlapping tags.
- HTML with unexpected tag closers.
- Some unbalanced or overlapping tags.
- P tags after unclosed P tags.
- BUTTON tags after unclosed BUTTON tags.
- A tags after unclosed A tags that don't involve any active formatting elements.

create_fragment( $html, $context = '<body>', $encoding = 'UTF-8' )   X-Ref
Creates an HTML processor in the fragment parsing mode.

Use this for cases where you are processing chunks of HTML that
will be found within a bigger HTML document, such as rendered
block output that exists within a post, `the_content` inside a
rendered site layout.

Fragment parsing occurs within a context, which is an HTML element
that the document will eventually be placed in. It becomes important
when special elements have different rules than others, such as inside
a TEXTAREA or a TITLE tag where things that look like tags are text,
or inside a SCRIPT tag where things that look like HTML syntax are JS.

The context value should be a representation of the tag into which the
HTML is found. For most cases this will be the body element. The HTML
form is provided because a context element may have attributes that
impact the parse, such as with a SCRIPT tag and its `type` attribute.

## Current HTML Support

- The only supported context is `<body>`, which is the default value.
- The only supported document encoding is `UTF-8`, which is the default value.

return: WP_HTML_Processor|null The created processor if successful, otherwise null.
param: string $html     Input HTML fragment to process.
param: string $context  Context element for the fragment, must be default of `<body>`.
param: string $encoding Text encoding of the document; must be default of 'UTF-8'.

__construct( $html, $use_the_static_create_methods_instead = null )   X-Ref
Constructor.

Do not use this method. Use the static creator methods instead.

param: string      $html                                  HTML to process.
param: string|null $use_the_static_create_methods_instead This constructor should not be called manually.

get_last_error()   X-Ref
Returns the last error, if any.

Various situations lead to parsing failure but this class will
return `false` in all those cases. To determine why something
failed it's possible to request the last error. This can be
helpful to know to distinguish whether a given tag couldn't
be found or if content in the document caused the processor
to give up and abort processing.

Example

$processor = WP_HTML_Processor::create_fragment( '<template><strong><button><em><p><em>' );
false === $processor->next_tag();
WP_HTML_Processor::ERROR_UNSUPPORTED === $processor->get_last_error();

return: string|null The last error, if one exists, otherwise null.

next_tag( $query = null )   X-Ref
Finds the next tag matching the $query.

return: bool Whether a tag was matched.
param: array|string|null $query {

next_token()   X-Ref
Ensures internal accounting is maintained for HTML semantic rules while
the underlying Tag Processor class is seeking to a bookmark.

This doesn't currently have a way to represent non-tags and doesn't process
semantic rules for text nodes. For access to the raw tokens consider using
WP_HTML_Tag_Processor instead.

return: bool

matches_breadcrumbs( $breadcrumbs )   X-Ref
Indicates if the currently-matched tag matches the given breadcrumbs.

A "*" represents a single tag wildcard, where any tag matches, but not no tags.

At some point this function _may_ support a `**` syntax for matching any number
of unspecified tags in the breadcrumb stack. This has been intentionally left
out, however, to keep this function simple and to avoid introducing backtracking,
which could open up surprising performance breakdowns.

Example:

$processor = WP_HTML_Processor::create_fragment( '<div><span><figure><img></figure></span></div>' );
$processor->next_tag( 'img' );
true  === $processor->matches_breadcrumbs( array( 'figure', 'img' ) );
true  === $processor->matches_breadcrumbs( array( 'span', 'figure', 'img' ) );
false === $processor->matches_breadcrumbs( array( 'span', 'img' ) );
true  === $processor->matches_breadcrumbs( array( 'span', '*', 'img' ) );

return: bool Whether the currently-matched tag is found at the given nested structure.
param: string[] $breadcrumbs DOM sub-path at which element is found, e.g. `array( 'FIGURE', 'IMG' )`.

step( $node_to_process = self::PROCESS_NEXT_NODE )   X-Ref
Steps through the HTML document and stop at the next tag, if any.

return: bool Whether a tag was matched.
param: string $node_to_process Whether to parse the next node or reprocess the current node.

get_breadcrumbs()   X-Ref
Computes the HTML breadcrumbs for the currently-matched node, if matched.

Breadcrumbs start at the outermost parent and descend toward the matched element.
They always include the entire path from the root HTML node to the matched element.

return: string[]|null Array of tag names representing path to matched node, if matched, otherwise NULL.

step_in_body()   X-Ref
Parses next element in the 'in body' insertion mode.

This internal function performs the 'in body' insertion mode
logic for the generalized WP_HTML_Processor::step() function.

return: bool Whether an element was found.

bookmark_token()   X-Ref
Creates a new bookmark for the currently-matched token and returns the generated name.

return: string|false Name of created bookmark, or false if unable to create.

get_tag()   X-Ref
Returns the uppercase name of the matched tag.

The semantic rules for HTML specify that certain tags be reprocessed
with a different tag name. Because of this, the tag name presented
by the HTML Processor may differ from the one reported by the HTML
Tag Processor, which doesn't apply these semantic rules.

Example:

$processor = new WP_HTML_Tag_Processor( '<div class="test">Test</div>' );
$processor->next_tag() === true;
$processor->get_tag() === 'DIV';

$processor->next_tag() === false;
$processor->get_tag() === null;

return: string|null Name of currently matched tag in input HTML, or `null` if none found.

release_bookmark( $bookmark_name )   X-Ref
Removes a bookmark that is no longer needed.

Releasing a bookmark frees up the small
performance overhead it requires.

return: bool Whether the bookmark already existed before removal.
param: string $bookmark_name Name of the bookmark to remove.

seek( $bookmark_name )   X-Ref
Moves the internal cursor in the HTML Processor to a given bookmark's location.

Be careful! Seeking backwards to a previous location resets the parser to the
start of the document and reparses the entire contents up until it finds the
sought-after bookmarked location.

In order to prevent accidental infinite loops, there's a
maximum limit on the number of times seek() can be called.

return: bool Whether the internal cursor was successfully moved to the bookmark's location.
param: string $bookmark_name Jump to the place in the document identified by this bookmark name.

set_bookmark( $bookmark_name )   X-Ref
Sets a bookmark in the HTML document.

Bookmarks represent specific places or tokens in the HTML
document, such as a tag opener or closer. When applying
edits to a document, such as setting an attribute, the
text offsets of that token may shift; the bookmark is
kept updated with those shifts and remains stable unless
the entire span of text in which the token sits is removed.

Release bookmarks when they are no longer needed.

Example:

<main><h2>Surprising fact you may not know!</h2></main>
^  ^
\-|-- this `H2` opener bookmark tracks the token

<main class="clickbait"><h2>Surprising fact you may no…
^  ^
\-|-- it shifts with edits

Bookmarks provide the ability to seek to a previously-scanned
place in the HTML document. This avoids the need to re-scan
the entire document.

Example:

<ul><li>One</li><li>Two</li><li>Three</li></ul>
^^^^
want to note this last item

$p = new WP_HTML_Tag_Processor( $html );
$in_list = false;
while ( $p->next_tag( array( 'tag_closers' => $in_list ? 'visit' : 'skip' ) ) ) {
if ( 'UL' === $p->get_tag() ) {
if ( $p->is_tag_closer() ) {
$in_list = false;
$p->set_bookmark( 'resume' );
if ( $p->seek( 'last-li' ) ) {
$p->add_class( 'last-li' );
}
$p->seek( 'resume' );
$p->release_bookmark( 'last-li' );
$p->release_bookmark( 'resume' );
} else {
$in_list = true;
}
}

if ( 'LI' === $p->get_tag() ) {
$p->set_bookmark( 'last-li' );
}
}

Bookmarks intentionally hide the internal string offsets
to which they refer. They are maintained internally as
updates are applied to the HTML document and therefore
retain their "position" - the location to which they
originally pointed. The inability to use bookmarks with
functions like `substr` is therefore intentional to guard
against accidentally breaking the HTML.

Because bookmarks allocate memory and require processing
for every applied update, they are limited and require
a name. They should not be created with programmatically-made
names, such as "li_{$index}" with some loop. As a general
rule they should only be created with string-literal names
like "start-of-section" or "last-paragraph".

Bookmarks are a powerful tool to enable complicated behavior.
Consider double-checking that you need this tool if you are
reaching for it, as inappropriate use could lead to broken
HTML structure or unwanted processing overhead.

return: bool Whether the bookmark was successfully created.
param: string $bookmark_name Identifies this particular bookmark.

has_bookmark( $bookmark_name )   X-Ref
Checks whether a bookmark with the given name exists.

return: bool Whether that bookmark exists.
param: string $bookmark_name Name to identify a bookmark that potentially exists.

close_a_p_element()   X-Ref
Closes a P element.


generate_implied_end_tags( $except_for_this_element = null )   X-Ref
Closes elements that have implied end tags.

param: string|null $except_for_this_element Perform as if this element doesn't exist in the stack of open elements.

generate_implied_end_tags_thoroughly()   X-Ref
Closes elements that have implied end tags, thoroughly.

See the HTML specification for an explanation why this is
different from generating end tags in the normal sense.


reconstruct_active_formatting_elements()   X-Ref
Reconstructs the active formatting elements.

> This has the effect of reopening all the formatting elements that were opened
> in the current body, cell, or caption (whichever is youngest) that haven't
> been explicitly closed.

return: bool Whether any formatting elements needed to be reconstructed.

run_adoption_agency_algorithm()   X-Ref
Runs the adoption agency algorithm.


insert_html_element( $token )   X-Ref
Inserts an HTML element on the stack of open elements.

param: WP_HTML_Token $token Name of bookmark pointing to element in original input HTML.

is_special( $tag_name )   X-Ref
Returns whether an element of a given name is in the HTML special category.

return: bool Whether the element of the given name is in the special category.
param: string $tag_name Name of element to check.

is_void( $tag_name )   X-Ref
Returns whether a given element is an HTML Void Element

> area, base, br, col, embed, hr, img, input, link, meta, source, track, wbr

return: bool Whether the given tag is an HTML Void Element.
param: string $tag_name Name of HTML tag to check.



Generated : Sun Apr 28 08:20:02 2024 Cross-referenced by PHPXref