[ Index ]

PHP Cross Reference of WordPress Trunk (Updated Daily)

Search

title

Body

[close]

/wp-includes/html-api/ -> class-wp-html-tag-processor.php (source)

   1  <?php
   2  /**
   3   * HTML API: WP_HTML_Tag_Processor class
   4   *
   5   * Scans through an HTML document to find specific tags, then
   6   * transforms those tags by adding, removing, or updating the
   7   * values of the HTML attributes within that tag (opener).
   8   *
   9   * Does not fully parse HTML or _recurse_ into the HTML structure
  10   * Instead this scans linearly through a document and only parses
  11   * the HTML tag openers.
  12   *
  13   * ### Possible future direction for this module
  14   *
  15   *  - Prune the whitespace when removing classes/attributes: e.g. "a b c" -> "c" not " c".
  16   *    This would increase the size of the changes for some operations but leave more
  17   *    natural-looking output HTML.
  18   *  - Properly decode HTML character references in `get_attribute()`. PHP's
  19   *    `html_entity_decode()` is wrong in a couple ways: it doesn't account for the
  20   *    no-ambiguous-ampersand rule, and it improperly handles the way semicolons may
  21   *    or may not terminate a character reference.
  22   *
  23   * @package WordPress
  24   * @subpackage HTML-API
  25   * @since 6.2.0
  26   */
  27  
  28  /**
  29   * Core class used to modify attributes in an HTML document for tags matching a query.
  30   *
  31   * ## Usage
  32   *
  33   * Use of this class requires three steps:
  34   *
  35   *  1. Create a new class instance with your input HTML document.
  36   *  2. Find the tag(s) you are looking for.
  37   *  3. Request changes to the attributes in those tag(s).
  38   *
  39   * Example:
  40   *
  41   *     $tags = new WP_HTML_Tag_Processor( $html );
  42   *     if ( $tags->next_tag( 'option' ) ) {
  43   *         $tags->set_attribute( 'selected', true );
  44   *     }
  45   *
  46   * ### Finding tags
  47   *
  48   * The `next_tag()` function moves the internal cursor through
  49   * your input HTML document until it finds a tag meeting any of
  50   * the supplied restrictions in the optional query argument. If
  51   * no argument is provided then it will find the next HTML tag,
  52   * regardless of what kind it is.
  53   *
  54   * If you want to _find whatever the next tag is_:
  55   *
  56   *     $tags->next_tag();
  57   *
  58   * | Goal                                                      | Query                                                                           |
  59   * |-----------------------------------------------------------|---------------------------------------------------------------------------------|
  60   * | Find any tag.                                             | `$tags->next_tag();`                                                            |
  61   * | Find next image tag.                                      | `$tags->next_tag( array( 'tag_name' => 'img' ) );`                              |
  62   * | Find next image tag (without passing the array).          | `$tags->next_tag( 'img' );`                                                     |
  63   * | Find next tag containing the `fullwidth` CSS class.       | `$tags->next_tag( array( 'class_name' => 'fullwidth' ) );`                      |
  64   * | Find next image tag containing the `fullwidth` CSS class. | `$tags->next_tag( array( 'tag_name' => 'img', 'class_name' => 'fullwidth' ) );` |
  65   *
  66   * If a tag was found meeting your criteria then `next_tag()`
  67   * will return `true` and you can proceed to modify it. If it
  68   * returns `false`, however, it failed to find the tag and
  69   * moved the cursor to the end of the file.
  70   *
  71   * Once the cursor reaches the end of the file the processor
  72   * is done and if you want to reach an earlier tag you will
  73   * need to recreate the processor and start over, as it's
  74   * unable to back up or move in reverse.
  75   *
  76   * See the section on bookmarks for an exception to this
  77   * no-backing-up rule.
  78   *
  79   * #### Custom queries
  80   *
  81   * Sometimes it's necessary to further inspect an HTML tag than
  82   * the query syntax here permits. In these cases one may further
  83   * inspect the search results using the read-only functions
  84   * provided by the processor or external state or variables.
  85   *
  86   * Example:
  87   *
  88   *     // Paint up to the first five DIV or SPAN tags marked with the "jazzy" style.
  89   *     $remaining_count = 5;
  90   *     while ( $remaining_count > 0 && $tags->next_tag() ) {
  91   *         if (
  92   *              ( 'DIV' === $tags->get_tag() || 'SPAN' === $tags->get_tag() ) &&
  93   *              'jazzy' === $tags->get_attribute( 'data-style' )
  94   *         ) {
  95   *             $tags->add_class( 'theme-style-everest-jazz' );
  96   *             $remaining_count--;
  97   *         }
  98   *     }
  99   *
 100   * `get_attribute()` will return `null` if the attribute wasn't present
 101   * on the tag when it was called. It may return `""` (the empty string)
 102   * in cases where the attribute was present but its value was empty.
 103   * For boolean attributes, those whose name is present but no value is
 104   * given, it will return `true` (the only way to set `false` for an
 105   * attribute is to remove it).
 106   *
 107   * #### When matching fails
 108   *
 109   * When `next_tag()` returns `false` it could mean different things:
 110   *
 111   *  - The requested tag wasn't found in the input document.
 112   *  - The input document ended in the middle of an HTML syntax element.
 113   *
 114   * When a document ends in the middle of a syntax element it will pause
 115   * the processor. This is to make it possible in the future to extend the
 116   * input document and proceed - an important requirement for chunked
 117   * streaming parsing of a document.
 118   *
 119   * Example:
 120   *
 121   *     $processor = new WP_HTML_Tag_Processor( 'This <div is="a" partial="token' );
 122   *     false === $processor->next_tag();
 123   *
 124   * If a special element (see next section) is encountered but no closing tag
 125   * is found it will count as an incomplete tag. The parser will pause as if
 126   * the opening tag were incomplete.
 127   *
 128   * Example:
 129   *
 130   *     $processor = new WP_HTML_Tag_Processor( '<style>// there could be more styling to come' );
 131   *     false === $processor->next_tag();
 132   *
 133   *     $processor = new WP_HTML_Tag_Processor( '<style>// this is everything</style><div>' );
 134   *     true === $processor->next_tag( 'DIV' );
 135   *
 136   * #### Special elements
 137   *
 138   * Some HTML elements are handled in a special way; their start and end tags
 139   * act like a void tag. These are special because their contents can't contain
 140   * HTML markup. Everything inside these elements is handled in a special way
 141   * and content that _appears_ like HTML tags inside of them isn't. There can
 142   * be no nesting in these elements.
 143   *
 144   * In the following list, "raw text" means that all of the content in the HTML
 145   * until the matching closing tag is treated verbatim without any replacements
 146   * and without any parsing.
 147   *
 148   *  - IFRAME allows no content but requires a closing tag.
 149   *  - NOEMBED (deprecated) content is raw text.
 150   *  - NOFRAMES (deprecated) content is raw text.
 151   *  - SCRIPT content is plaintext apart from legacy rules allowing `</script>` inside an HTML comment.
 152   *  - STYLE content is raw text.
 153   *  - TITLE content is plain text but character references are decoded.
 154   *  - TEXTAREA content is plain text but character references are decoded.
 155   *  - XMP (deprecated) content is raw text.
 156   *
 157   * ### Modifying HTML attributes for a found tag
 158   *
 159   * Once you've found the start of an opening tag you can modify
 160   * any number of the attributes on that tag. You can set a new
 161   * value for an attribute, remove the entire attribute, or do
 162   * nothing and move on to the next opening tag.
 163   *
 164   * Example:
 165   *
 166   *     if ( $tags->next_tag( array( 'class_name' => 'wp-group-block' ) ) ) {
 167   *         $tags->set_attribute( 'title', 'This groups the contained content.' );
 168   *         $tags->remove_attribute( 'data-test-id' );
 169   *     }
 170   *
 171   * If `set_attribute()` is called for an existing attribute it will
 172   * overwrite the existing value. Similarly, calling `remove_attribute()`
 173   * for a non-existing attribute has no effect on the document. Both
 174   * of these methods are safe to call without knowing if a given attribute
 175   * exists beforehand.
 176   *
 177   * ### Modifying CSS classes for a found tag
 178   *
 179   * The tag processor treats the `class` attribute as a special case.
 180   * Because it's a common operation to add or remove CSS classes, this
 181   * interface adds helper methods to make that easier.
 182   *
 183   * As with attribute values, adding or removing CSS classes is a safe
 184   * operation that doesn't require checking if the attribute or class
 185   * exists before making changes. If removing the only class then the
 186   * entire `class` attribute will be removed.
 187   *
 188   * Example:
 189   *
 190   *     // from `<span>Yippee!</span>`
 191   *     //   to `<span class="is-active">Yippee!</span>`
 192   *     $tags->add_class( 'is-active' );
 193   *
 194   *     // from `<span class="excited">Yippee!</span>`
 195   *     //   to `<span class="excited is-active">Yippee!</span>`
 196   *     $tags->add_class( 'is-active' );
 197   *
 198   *     // from `<span class="is-active heavy-accent">Yippee!</span>`
 199   *     //   to `<span class="is-active heavy-accent">Yippee!</span>`
 200   *     $tags->add_class( 'is-active' );
 201   *
 202   *     // from `<input type="text" class="is-active rugby not-disabled" length="24">`
 203   *     //   to `<input type="text" class="is-active not-disabled" length="24">
 204   *     $tags->remove_class( 'rugby' );
 205   *
 206   *     // from `<input type="text" class="rugby" length="24">`
 207   *     //   to `<input type="text" length="24">
 208   *     $tags->remove_class( 'rugby' );
 209   *
 210   *     // from `<input type="text" length="24">`
 211   *     //   to `<input type="text" length="24">
 212   *     $tags->remove_class( 'rugby' );
 213   *
 214   * When class changes are enqueued but a direct change to `class` is made via
 215   * `set_attribute` then the changes to `set_attribute` (or `remove_attribute`)
 216   * will take precedence over those made through `add_class` and `remove_class`.
 217   *
 218   * ### Bookmarks
 219   *
 220   * While scanning through the input HTMl document it's possible to set
 221   * a named bookmark when a particular tag is found. Later on, after
 222   * continuing to scan other tags, it's possible to `seek` to one of
 223   * the set bookmarks and then proceed again from that point forward.
 224   *
 225   * Because bookmarks create processing overhead one should avoid
 226   * creating too many of them. As a rule, create only bookmarks
 227   * of known string literal names; avoid creating "mark_{$index}"
 228   * and so on. It's fine from a performance standpoint to create a
 229   * bookmark and update it frequently, such as within a loop.
 230   *
 231   *     $total_todos = 0;
 232   *     while ( $p->next_tag( array( 'tag_name' => 'UL', 'class_name' => 'todo' ) ) ) {
 233   *         $p->set_bookmark( 'list-start' );
 234   *         while ( $p->next_tag( array( 'tag_closers' => 'visit' ) ) ) {
 235   *             if ( 'UL' === $p->get_tag() && $p->is_tag_closer() ) {
 236   *                 $p->set_bookmark( 'list-end' );
 237   *                 $p->seek( 'list-start' );
 238   *                 $p->set_attribute( 'data-contained-todos', (string) $total_todos );
 239   *                 $total_todos = 0;
 240   *                 $p->seek( 'list-end' );
 241   *                 break;
 242   *             }
 243   *
 244   *             if ( 'LI' === $p->get_tag() && ! $p->is_tag_closer() ) {
 245   *                 $total_todos++;
 246   *             }
 247   *         }
 248   *     }
 249   *
 250   * ## Tokens and finer-grained processing.
 251   *
 252   * It's possible to scan through every lexical token in the
 253   * HTML document using the `next_token()` function. This
 254   * alternative form takes no argument and provides no built-in
 255   * query syntax.
 256   *
 257   * Example:
 258   *
 259   *      $title = '(untitled)';
 260   *      $text  = '';
 261   *      while ( $processor->next_token() ) {
 262   *          switch ( $processor->get_token_name() ) {
 263   *              case '#text':
 264   *                  $text .= $processor->get_modifiable_text();
 265   *                  break;
 266   *
 267   *              case 'BR':
 268   *                  $text .= "\n";
 269   *                  break;
 270   *
 271   *              case 'TITLE':
 272   *                  $title = $processor->get_modifiable_text();
 273   *                  break;
 274   *          }
 275   *      }
 276   *      return trim( "# {$title}\n\n{$text}" );
 277   *
 278   * ### Tokens and _modifiable text_.
 279   *
 280   * #### Special "atomic" HTML elements.
 281   *
 282   * Not all HTML elements are able to contain other elements inside of them.
 283   * For instance, the contents inside a TITLE element are plaintext (except
 284   * that character references like &amp; will be decoded). This means that
 285   * if the string `<img>` appears inside a TITLE element, then it's not an
 286   * image tag, but rather it's text describing an image tag. Likewise, the
 287   * contents of a SCRIPT or STYLE element are handled entirely separately in
 288   * a browser than the contents of other elements because they represent a
 289   * different language than HTML.
 290   *
 291   * For these elements the Tag Processor treats the entire sequence as one,
 292   * from the opening tag, including its contents, through its closing tag.
 293   * This means that the it's not possible to match the closing tag for a
 294   * SCRIPT element unless it's unexpected; the Tag Processor already matched
 295   * it when it found the opening tag.
 296   *
 297   * The inner contents of these elements are that element's _modifiable text_.
 298   *
 299   * The special elements are:
 300   *  - `SCRIPT` whose contents are treated as raw plaintext but supports a legacy
 301   *    style of including Javascript inside of HTML comments to avoid accidentally
 302   *    closing the SCRIPT from inside a Javascript string. E.g. `console.log( '</script>' )`.
 303   *  - `TITLE` and `TEXTAREA` whose contents are treated as plaintext and then any
 304   *    character references are decoded. E.g. `1 &lt; 2 < 3` becomes `1 < 2 < 3`.
 305   *  - `IFRAME`, `NOSCRIPT`, `NOEMBED`, `NOFRAME`, `STYLE` whose contents are treated as
 306   *    raw plaintext and left as-is. E.g. `1 &lt; 2 < 3` remains `1 &lt; 2 < 3`.
 307   *
 308   * #### Other tokens with modifiable text.
 309   *
 310   * There are also non-elements which are void/self-closing in nature and contain
 311   * modifiable text that is part of that individual syntax token itself.
 312   *
 313   *  - `#text` nodes, whose entire token _is_ the modifiable text.
 314   *  - HTML comments and tokens that become comments due to some syntax error. The
 315   *    text for these tokens is the portion of the comment inside of the syntax.
 316   *    E.g. for `<!-- comment -->` the text is `" comment "` (note the spaces are included).
 317   *  - `CDATA` sections, whose text is the content inside of the section itself. E.g. for
 318   *    `<![CDATA[some content]]>` the text is `"some content"` (with restrictions [1]).
 319   *  - "Funky comments," which are a special case of invalid closing tags whose name is
 320   *    invalid. The text for these nodes is the text that a browser would transform into
 321   *    an HTML comment when parsing. E.g. for `</%post_author>` the text is `%post_author`.
 322   *  - `DOCTYPE` declarations like `<DOCTYPE html>` which have no closing tag.
 323   *  - XML Processing instruction nodes like `<?wp __( "Like" ); ?>` (with restrictions [2]).
 324   *  - The empty end tag `</>` which is ignored in the browser and DOM.
 325   *
 326   * [1]: There are no CDATA sections in HTML. When encountering `<![CDATA[`, everything
 327   *      until the next `>` becomes a bogus HTML comment, meaning there can be no CDATA
 328   *      section in an HTML document containing `>`. The Tag Processor will first find
 329   *      all valid and bogus HTML comments, and then if the comment _would_ have been a
 330   *      CDATA section _were they to exist_, it will indicate this as the type of comment.
 331   *
 332   * [2]: XML allows a broader range of characters in a processing instruction's target name
 333   *      and disallows "xml" as a name, since it's special. The Tag Processor only recognizes
 334   *      target names with an ASCII-representable subset of characters. It also exhibits the
 335   *      same constraint as with CDATA sections, in that `>` cannot exist within the token
 336   *      since Processing Instructions do no exist within HTML and their syntax transforms
 337   *      into a bogus comment in the DOM.
 338   *
 339   * ## Design and limitations
 340   *
 341   * The Tag Processor is designed to linearly scan HTML documents and tokenize
 342   * HTML tags and their attributes. It's designed to do this as efficiently as
 343   * possible without compromising parsing integrity. Therefore it will be
 344   * slower than some methods of modifying HTML, such as those incorporating
 345   * over-simplified PCRE patterns, but will not introduce the defects and
 346   * failures that those methods bring in, which lead to broken page renders
 347   * and often to security vulnerabilities. On the other hand, it will be faster
 348   * than full-blown HTML parsers such as DOMDocument and use considerably
 349   * less memory. It requires a negligible memory overhead, enough to consider
 350   * it a zero-overhead system.
 351   *
 352   * The performance characteristics are maintained by avoiding tree construction
 353   * and semantic cleanups which are specified in HTML5. Because of this, for
 354   * example, it's not possible for the Tag Processor to associate any given
 355   * opening tag with its corresponding closing tag, or to return the inner markup
 356   * inside an element. Systems may be built on top of the Tag Processor to do
 357   * this, but the Tag Processor is and should be constrained so it can remain an
 358   * efficient, low-level, and reliable HTML scanner.
 359   *
 360   * The Tag Processor's design incorporates a "garbage-in-garbage-out" philosophy.
 361   * HTML5 specifies that certain invalid content be transformed into different forms
 362   * for display, such as removing null bytes from an input document and replacing
 363   * invalid characters with the Unicode replacement character `U+FFFD` (visually "�").
 364   * Where errors or transformations exist within the HTML5 specification, the Tag Processor
 365   * leaves those invalid inputs untouched, passing them through to the final browser
 366   * to handle. While this implies that certain operations will be non-spec-compliant,
 367   * such as reading the value of an attribute with invalid content, it also preserves a
 368   * simplicity and efficiency for handling those error cases.
 369   *
 370   * Most operations within the Tag Processor are designed to minimize the difference
 371   * between an input and output document for any given change. For example, the
 372   * `add_class` and `remove_class` methods preserve whitespace and the class ordering
 373   * within the `class` attribute; and when encountering tags with duplicated attributes,
 374   * the Tag Processor will leave those invalid duplicate attributes where they are but
 375   * update the proper attribute which the browser will read for parsing its value. An
 376   * exception to this rule is that all attribute updates store their values as
 377   * double-quoted strings, meaning that attributes on input with single-quoted or
 378   * unquoted values will appear in the output with double-quotes.
 379   *
 380   * ### Scripting Flag
 381   *
 382   * The Tag Processor parses HTML with the "scripting flag" disabled. This means
 383   * that it doesn't run any scripts while parsing the page. In a browser with
 384   * JavaScript enabled, for example, the script can change the parse of the
 385   * document as it loads. On the server, however, evaluating JavaScript is not
 386   * only impractical, but also unwanted.
 387   *
 388   * Practically this means that the Tag Processor will descend into NOSCRIPT
 389   * elements and process its child tags. Were the scripting flag enabled, such
 390   * as in a typical browser, the contents of NOSCRIPT are skipped entirely.
 391   *
 392   * This allows the HTML API to process the content that will be presented in
 393   * a browser when scripting is disabled, but it offers a different view of a
 394   * page than most browser sessions will experience. E.g. the tags inside the
 395   * NOSCRIPT disappear.
 396   *
 397   * ### Text Encoding
 398   *
 399   * The Tag Processor assumes that the input HTML document is encoded with a
 400   * text encoding compatible with 7-bit ASCII's '<', '>', '&', ';', '/', '=',
 401   * "'", '"', 'a' - 'z', 'A' - 'Z', and the whitespace characters ' ', tab,
 402   * carriage-return, newline, and form-feed.
 403   *
 404   * In practice, this includes almost every single-byte encoding as well as
 405   * UTF-8. Notably, however, it does not include UTF-16. If providing input
 406   * that's incompatible, then convert the encoding beforehand.
 407   *
 408   * @since 6.2.0
 409   * @since 6.2.1 Fix: Support for various invalid comments; attribute updates are case-insensitive.
 410   * @since 6.3.2 Fix: Skip HTML-like content inside rawtext elements such as STYLE.
 411   * @since 6.5.0 Pauses processor when input ends in an incomplete syntax token.
 412   *              Introduces "special" elements which act like void elements, e.g. TITLE, STYLE.
 413   *              Allows scanning through all tokens and processing modifiable text, where applicable.
 414   */
 415  class WP_HTML_Tag_Processor {
 416      /**
 417       * The maximum number of bookmarks allowed to exist at
 418       * any given time.
 419       *
 420       * @since 6.2.0
 421       * @var int
 422       *
 423       * @see WP_HTML_Tag_Processor::set_bookmark()
 424       */
 425      const MAX_BOOKMARKS = 10;
 426  
 427      /**
 428       * Maximum number of times seek() can be called.
 429       * Prevents accidental infinite loops.
 430       *
 431       * @since 6.2.0
 432       * @var int
 433       *
 434       * @see WP_HTML_Tag_Processor::seek()
 435       */
 436      const MAX_SEEK_OPS = 1000;
 437  
 438      /**
 439       * The HTML document to parse.
 440       *
 441       * @since 6.2.0
 442       * @var string
 443       */
 444      protected $html;
 445  
 446      /**
 447       * The last query passed to next_tag().
 448       *
 449       * @since 6.2.0
 450       * @var array|null
 451       */
 452      private $last_query;
 453  
 454      /**
 455       * The tag name this processor currently scans for.
 456       *
 457       * @since 6.2.0
 458       * @var string|null
 459       */
 460      private $sought_tag_name;
 461  
 462      /**
 463       * The CSS class name this processor currently scans for.
 464       *
 465       * @since 6.2.0
 466       * @var string|null
 467       */
 468      private $sought_class_name;
 469  
 470      /**
 471       * The match offset this processor currently scans for.
 472       *
 473       * @since 6.2.0
 474       * @var int|null
 475       */
 476      private $sought_match_offset;
 477  
 478      /**
 479       * Whether to visit tag closers, e.g. </div>, when walking an input document.
 480       *
 481       * @since 6.2.0
 482       * @var bool
 483       */
 484      private $stop_on_tag_closers;
 485  
 486      /**
 487       * Specifies mode of operation of the parser at any given time.
 488       *
 489       * | State           | Meaning                                                              |
 490       * | ----------------|----------------------------------------------------------------------|
 491       * | *Ready*         | The parser is ready to run.                                          |
 492       * | *Complete*      | There is nothing left to parse.                                      |
 493       * | *Incomplete*    | The HTML ended in the middle of a token; nothing more can be parsed. |
 494       * | *Matched tag*   | Found an HTML tag; it's possible to modify its attributes.           |
 495       * | *Text node*     | Found a #text node; this is plaintext and modifiable.                |
 496       * | *CDATA node*    | Found a CDATA section; this is modifiable.                           |
 497       * | *Comment*       | Found a comment or bogus comment; this is modifiable.                |
 498       * | *Presumptuous*  | Found an empty tag closer: `</>`.                                    |
 499       * | *Funky comment* | Found a tag closer with an invalid tag name; this is modifiable.     |
 500       *
 501       * @since 6.5.0
 502       *
 503       * @see WP_HTML_Tag_Processor::STATE_READY
 504       * @see WP_HTML_Tag_Processor::STATE_COMPLETE
 505       * @see WP_HTML_Tag_Processor::STATE_INCOMPLETE_INPUT
 506       * @see WP_HTML_Tag_Processor::STATE_MATCHED_TAG
 507       * @see WP_HTML_Tag_Processor::STATE_TEXT_NODE
 508       * @see WP_HTML_Tag_Processor::STATE_CDATA_NODE
 509       * @see WP_HTML_Tag_Processor::STATE_COMMENT
 510       * @see WP_HTML_Tag_Processor::STATE_DOCTYPE
 511       * @see WP_HTML_Tag_Processor::STATE_PRESUMPTUOUS_TAG
 512       * @see WP_HTML_Tag_Processor::STATE_FUNKY_COMMENT
 513       *
 514       * @var string
 515       */
 516      protected $parser_state = self::STATE_READY;
 517  
 518      /**
 519       * What kind of syntax token became an HTML comment.
 520       *
 521       * Since there are many ways in which HTML syntax can create an HTML comment,
 522       * this indicates which of those caused it. This allows the Tag Processor to
 523       * represent more from the original input document than would appear in the DOM.
 524       *
 525       * @since 6.5.0
 526       *
 527       * @var string|null
 528       */
 529      protected $comment_type = null;
 530  
 531      /**
 532       * How many bytes from the original HTML document have been read and parsed.
 533       *
 534       * This value points to the latest byte offset in the input document which
 535       * has been already parsed. It is the internal cursor for the Tag Processor
 536       * and updates while scanning through the HTML tokens.
 537       *
 538       * @since 6.2.0
 539       * @var int
 540       */
 541      private $bytes_already_parsed = 0;
 542  
 543      /**
 544       * Byte offset in input document where current token starts.
 545       *
 546       * Example:
 547       *
 548       *     <div id="test">...
 549       *     01234
 550       *     - token starts at 0
 551       *
 552       * @since 6.5.0
 553       *
 554       * @var int|null
 555       */
 556      private $token_starts_at;
 557  
 558      /**
 559       * Byte length of current token.
 560       *
 561       * Example:
 562       *
 563       *     <div id="test">...
 564       *     012345678901234
 565       *     - token length is 14 - 0 = 14
 566       *
 567       *     a <!-- comment --> is a token.
 568       *     0123456789 123456789 123456789
 569       *     - token length is 17 - 2 = 15
 570       *
 571       * @since 6.5.0
 572       *
 573       * @var int|null
 574       */
 575      private $token_length;
 576  
 577      /**
 578       * Byte offset in input document where current tag name starts.
 579       *
 580       * Example:
 581       *
 582       *     <div id="test">...
 583       *     01234
 584       *      - tag name starts at 1
 585       *
 586       * @since 6.2.0
 587       *
 588       * @var int|null
 589       */
 590      private $tag_name_starts_at;
 591  
 592      /**
 593       * Byte length of current tag name.
 594       *
 595       * Example:
 596       *
 597       *     <div id="test">...
 598       *     01234
 599       *      --- tag name length is 3
 600       *
 601       * @since 6.2.0
 602       *
 603       * @var int|null
 604       */
 605      private $tag_name_length;
 606  
 607      /**
 608       * Byte offset into input document where current modifiable text starts.
 609       *
 610       * @since 6.5.0
 611       *
 612       * @var int
 613       */
 614      private $text_starts_at;
 615  
 616      /**
 617       * Byte length of modifiable text.
 618       *
 619       * @since 6.5.0
 620       *
 621       * @var string
 622       */
 623      private $text_length;
 624  
 625      /**
 626       * Whether the current tag is an opening tag, e.g. <div>, or a closing tag, e.g. </div>.
 627       *
 628       * @var bool
 629       */
 630      private $is_closing_tag;
 631  
 632      /**
 633       * Lazily-built index of attributes found within an HTML tag, keyed by the attribute name.
 634       *
 635       * Example:
 636       *
 637       *     // Supposing the parser is working through this content
 638       *     // and stops after recognizing the `id` attribute.
 639       *     // <div id="test-4" class=outline title="data:text/plain;base64=asdk3nk1j3fo8">
 640       *     //                 ^ parsing will continue from this point.
 641       *     $this->attributes = array(
 642       *         'id' => new WP_HTML_Attribute_Token( 'id', 9, 6, 5, 11, false )
 643       *     );
 644       *
 645       *     // When picking up parsing again, or when asking to find the
 646       *     // `class` attribute we will continue and add to this array.
 647       *     $this->attributes = array(
 648       *         'id'    => new WP_HTML_Attribute_Token( 'id', 9, 6, 5, 11, false ),
 649       *         'class' => new WP_HTML_Attribute_Token( 'class', 23, 7, 17, 13, false )
 650       *     );
 651       *
 652       *     // Note that only the `class` attribute value is stored in the index.
 653       *     // That's because it is the only value used by this class at the moment.
 654       *
 655       * @since 6.2.0
 656       * @var WP_HTML_Attribute_Token[]
 657       */
 658      private $attributes = array();
 659  
 660      /**
 661       * Tracks spans of duplicate attributes on a given tag, used for removing
 662       * all copies of an attribute when calling `remove_attribute()`.
 663       *
 664       * @since 6.3.2
 665       *
 666       * @var (WP_HTML_Span[])[]|null
 667       */
 668      private $duplicate_attributes = null;
 669  
 670      /**
 671       * Which class names to add or remove from a tag.
 672       *
 673       * These are tracked separately from attribute updates because they are
 674       * semantically distinct, whereas this interface exists for the common
 675       * case of adding and removing class names while other attributes are
 676       * generally modified as with DOM `setAttribute` calls.
 677       *
 678       * When modifying an HTML document these will eventually be collapsed
 679       * into a single `set_attribute( 'class', $changes )` call.
 680       *
 681       * Example:
 682       *
 683       *     // Add the `wp-block-group` class, remove the `wp-group` class.
 684       *     $classname_updates = array(
 685       *         // Indexed by a comparable class name.
 686       *         'wp-block-group' => WP_HTML_Tag_Processor::ADD_CLASS,
 687       *         'wp-group'       => WP_HTML_Tag_Processor::REMOVE_CLASS
 688       *     );
 689       *
 690       * @since 6.2.0
 691       * @var bool[]
 692       */
 693      private $classname_updates = array();
 694  
 695      /**
 696       * Tracks a semantic location in the original HTML which
 697       * shifts with updates as they are applied to the document.
 698       *
 699       * @since 6.2.0
 700       * @var WP_HTML_Span[]
 701       */
 702      protected $bookmarks = array();
 703  
 704      const ADD_CLASS    = true;
 705      const REMOVE_CLASS = false;
 706      const SKIP_CLASS   = null;
 707  
 708      /**
 709       * Lexical replacements to apply to input HTML document.
 710       *
 711       * "Lexical" in this class refers to the part of this class which
 712       * operates on pure text _as text_ and not as HTML. There's a line
 713       * between the public interface, with HTML-semantic methods like
 714       * `set_attribute` and `add_class`, and an internal state that tracks
 715       * text offsets in the input document.
 716       *
 717       * When higher-level HTML methods are called, those have to transform their
 718       * operations (such as setting an attribute's value) into text diffing
 719       * operations (such as replacing the sub-string from indices A to B with
 720       * some given new string). These text-diffing operations are the lexical
 721       * updates.
 722       *
 723       * As new higher-level methods are added they need to collapse their
 724       * operations into these lower-level lexical updates since that's the
 725       * Tag Processor's internal language of change. Any code which creates
 726       * these lexical updates must ensure that they do not cross HTML syntax
 727       * boundaries, however, so these should never be exposed outside of this
 728       * class or any classes which intentionally expand its functionality.
 729       *
 730       * These are enqueued while editing the document instead of being immediately
 731       * applied to avoid processing overhead, string allocations, and string
 732       * copies when applying many updates to a single document.
 733       *
 734       * Example:
 735       *
 736       *     // Replace an attribute stored with a new value, indices
 737       *     // sourced from the lazily-parsed HTML recognizer.
 738       *     $start  = $attributes['src']->start;
 739       *     $length = $attributes['src']->length;
 740       *     $modifications[] = new WP_HTML_Text_Replacement( $start, $length, $new_value );
 741       *
 742       *     // Correspondingly, something like this will appear in this array.
 743       *     $lexical_updates = array(
 744       *         WP_HTML_Text_Replacement( 14, 28, 'https://my-site.my-domain/wp-content/uploads/2014/08/kittens.jpg' )
 745       *     );
 746       *
 747       * @since 6.2.0
 748       * @var WP_HTML_Text_Replacement[]
 749       */
 750      protected $lexical_updates = array();
 751  
 752      /**
 753       * Tracks and limits `seek()` calls to prevent accidental infinite loops.
 754       *
 755       * @since 6.2.0
 756       * @var int
 757       *
 758       * @see WP_HTML_Tag_Processor::seek()
 759       */
 760      protected $seek_count = 0;
 761  
 762      /**
 763       * Constructor.
 764       *
 765       * @since 6.2.0
 766       *
 767       * @param string $html HTML to process.
 768       */
 769  	public function __construct( $html ) {
 770          $this->html = $html;
 771      }
 772  
 773      /**
 774       * Finds the next tag matching the $query.
 775       *
 776       * @since 6.2.0
 777       * @since 6.5.0 No longer processes incomplete tokens at end of document; pauses the processor at start of token.
 778       *
 779       * @param array|string|null $query {
 780       *     Optional. Which tag name to find, having which class, etc. Default is to find any tag.
 781       *
 782       *     @type string|null $tag_name     Which tag to find, or `null` for "any tag."
 783       *     @type int|null    $match_offset Find the Nth tag matching all search criteria.
 784       *                                     1 for "first" tag, 3 for "third," etc.
 785       *                                     Defaults to first tag.
 786       *     @type string|null $class_name   Tag must contain this whole class name to match.
 787       *     @type string|null $tag_closers  "visit" or "skip": whether to stop on tag closers, e.g. </div>.
 788       * }
 789       * @return bool Whether a tag was matched.
 790       */
 791  	public function next_tag( $query = null ) {
 792          $this->parse_query( $query );
 793          $already_found = 0;
 794  
 795          do {
 796              if ( false === $this->next_token() ) {
 797                  return false;
 798              }
 799  
 800              if ( self::STATE_MATCHED_TAG !== $this->parser_state ) {
 801                  continue;
 802              }
 803  
 804              if ( $this->matches() ) {
 805                  ++$already_found;
 806              }
 807          } while ( $already_found < $this->sought_match_offset );
 808  
 809          return true;
 810      }
 811  
 812      /**
 813       * Finds the next token in the HTML document.
 814       *
 815       * An HTML document can be viewed as a stream of tokens,
 816       * where tokens are things like HTML tags, HTML comments,
 817       * text nodes, etc. This method finds the next token in
 818       * the HTML document and returns whether it found one.
 819       *
 820       * If it starts parsing a token and reaches the end of the
 821       * document then it will seek to the start of the last
 822       * token and pause, returning `false` to indicate that it
 823       * failed to find a complete token.
 824       *
 825       * Possible token types, based on the HTML specification:
 826       *
 827       *  - an HTML tag, whether opening, closing, or void.
 828       *  - a text node - the plaintext inside tags.
 829       *  - an HTML comment.
 830       *  - a DOCTYPE declaration.
 831       *  - a processing instruction, e.g. `<?xml version="1.0" ?>`.
 832       *
 833       * The Tag Processor currently only supports the tag token.
 834       *
 835       * @since 6.5.0
 836       *
 837       * @return bool Whether a token was parsed.
 838       */
 839  	public function next_token() {
 840          return $this->base_class_next_token();
 841      }
 842  
 843      /**
 844       * Internal method which finds the next token in the HTML document.
 845       *
 846       * This method is a protected internal function which implements the logic for
 847       * finding the next token in a document. It exists so that the parser can update
 848       * its state without affecting the location of the cursor in the document and
 849       * without triggering subclass methods for things like `next_token()`, e.g. when
 850       * applying patches before searching for the next token.
 851       *
 852       * @since 6.5.0
 853       *
 854       * @access private
 855       *
 856       * @return bool Whether a token was parsed.
 857       */
 858  	private function base_class_next_token() {
 859          $was_at = $this->bytes_already_parsed;
 860          $this->after_tag();
 861  
 862          // Don't proceed if there's nothing more to scan.
 863          if (
 864              self::STATE_COMPLETE === $this->parser_state ||
 865              self::STATE_INCOMPLETE_INPUT === $this->parser_state
 866          ) {
 867              return false;
 868          }
 869  
 870          /*
 871           * The next step in the parsing loop determines the parsing state;
 872           * clear it so that state doesn't linger from the previous step.
 873           */
 874          $this->parser_state = self::STATE_READY;
 875  
 876          if ( $this->bytes_already_parsed >= strlen( $this->html ) ) {
 877              $this->parser_state = self::STATE_COMPLETE;
 878              return false;
 879          }
 880  
 881          // Find the next tag if it exists.
 882          if ( false === $this->parse_next_tag() ) {
 883              if ( self::STATE_INCOMPLETE_INPUT === $this->parser_state ) {
 884                  $this->bytes_already_parsed = $was_at;
 885              }
 886  
 887              return false;
 888          }
 889  
 890          /*
 891           * For legacy reasons the rest of this function handles tags and their
 892           * attributes. If the processor has reached the end of the document
 893           * or if it matched any other token then it should return here to avoid
 894           * attempting to process tag-specific syntax.
 895           */
 896          if (
 897              self::STATE_INCOMPLETE_INPUT !== $this->parser_state &&
 898              self::STATE_COMPLETE !== $this->parser_state &&
 899              self::STATE_MATCHED_TAG !== $this->parser_state
 900          ) {
 901              return true;
 902          }
 903  
 904          // Parse all of its attributes.
 905          while ( $this->parse_next_attribute() ) {
 906              continue;
 907          }
 908  
 909          // Ensure that the tag closes before the end of the document.
 910          if (
 911              self::STATE_INCOMPLETE_INPUT === $this->parser_state ||
 912              $this->bytes_already_parsed >= strlen( $this->html )
 913          ) {
 914              // Does this appropriately clear state (parsed attributes)?
 915              $this->parser_state         = self::STATE_INCOMPLETE_INPUT;
 916              $this->bytes_already_parsed = $was_at;
 917  
 918              return false;
 919          }
 920  
 921          $tag_ends_at = strpos( $this->html, '>', $this->bytes_already_parsed );
 922          if ( false === $tag_ends_at ) {
 923              $this->parser_state         = self::STATE_INCOMPLETE_INPUT;
 924              $this->bytes_already_parsed = $was_at;
 925  
 926              return false;
 927          }
 928          $this->parser_state         = self::STATE_MATCHED_TAG;
 929          $this->token_length         = $tag_ends_at - $this->token_starts_at;
 930          $this->bytes_already_parsed = $tag_ends_at + 1;
 931  
 932          /*
 933           * For non-DATA sections which might contain text that looks like HTML tags but
 934           * isn't, scan with the appropriate alternative mode. Looking at the first letter
 935           * of the tag name as a pre-check avoids a string allocation when it's not needed.
 936           */
 937          $t = $this->html[ $this->tag_name_starts_at ];
 938          if (
 939              $this->is_closing_tag ||
 940              ! (
 941                  'i' === $t || 'I' === $t ||
 942                  'n' === $t || 'N' === $t ||
 943                  's' === $t || 'S' === $t ||
 944                  't' === $t || 'T' === $t ||
 945                  'x' === $t || 'X' === $t
 946              )
 947          ) {
 948              return true;
 949          }
 950  
 951          $tag_name = $this->get_tag();
 952  
 953          /*
 954           * Preserve the opening tag pointers, as these will be overwritten
 955           * when finding the closing tag. They will be reset after finding
 956           * the closing to tag to point to the opening of the special atomic
 957           * tag sequence.
 958           */
 959          $tag_name_starts_at   = $this->tag_name_starts_at;
 960          $tag_name_length      = $this->tag_name_length;
 961          $tag_ends_at          = $this->token_starts_at + $this->token_length;
 962          $attributes           = $this->attributes;
 963          $duplicate_attributes = $this->duplicate_attributes;
 964  
 965          // Find the closing tag if necessary.
 966          $found_closer = false;
 967          switch ( $tag_name ) {
 968              case 'SCRIPT':
 969                  $found_closer = $this->skip_script_data();
 970                  break;
 971  
 972              case 'TEXTAREA':
 973              case 'TITLE':
 974                  $found_closer = $this->skip_rcdata( $tag_name );
 975                  break;
 976  
 977              /*
 978               * In the browser this list would include the NOSCRIPT element,
 979               * but the Tag Processor is an environment with the scripting
 980               * flag disabled, meaning that it needs to descend into the
 981               * NOSCRIPT element to be able to properly process what will be
 982               * sent to a browser.
 983               *
 984               * Note that this rule makes HTML5 syntax incompatible with XML,
 985               * because the parsing of this token depends on client application.
 986               * The NOSCRIPT element cannot be represented in the XHTML syntax.
 987               */
 988              case 'IFRAME':
 989              case 'NOEMBED':
 990              case 'NOFRAMES':
 991              case 'STYLE':
 992              case 'XMP':
 993                  $found_closer = $this->skip_rawtext( $tag_name );
 994                  break;
 995  
 996              // No other tags should be treated in their entirety here.
 997              default:
 998                  return true;
 999          }
1000  
1001          if ( ! $found_closer ) {
1002              $this->parser_state         = self::STATE_INCOMPLETE_INPUT;
1003              $this->bytes_already_parsed = $was_at;
1004              return false;
1005          }
1006  
1007          /*
1008           * The values here look like they reference the opening tag but they reference
1009           * the closing tag instead. This is why the opening tag values were stored
1010           * above in a variable. It reads confusingly here, but that's because the
1011           * functions that skip the contents have moved all the internal cursors past
1012           * the inner content of the tag.
1013           */
1014          $this->token_starts_at      = $was_at;
1015          $this->token_length         = $this->bytes_already_parsed - $this->token_starts_at;
1016          $this->text_starts_at       = $tag_ends_at + 1;
1017          $this->text_length          = $this->tag_name_starts_at - $this->text_starts_at;
1018          $this->tag_name_starts_at   = $tag_name_starts_at;
1019          $this->tag_name_length      = $tag_name_length;
1020          $this->attributes           = $attributes;
1021          $this->duplicate_attributes = $duplicate_attributes;
1022  
1023          return true;
1024      }
1025  
1026      /**
1027       * Whether the processor paused because the input HTML document ended
1028       * in the middle of a syntax element, such as in the middle of a tag.
1029       *
1030       * Example:
1031       *
1032       *     $processor = new WP_HTML_Tag_Processor( '<input type="text" value="Th' );
1033       *     false      === $processor->get_next_tag();
1034       *     true       === $processor->paused_at_incomplete_token();
1035       *
1036       * @since 6.5.0
1037       *
1038       * @return bool Whether the parse paused at the start of an incomplete token.
1039       */
1040  	public function paused_at_incomplete_token() {
1041          return self::STATE_INCOMPLETE_INPUT === $this->parser_state;
1042      }
1043  
1044      /**
1045       * Generator for a foreach loop to step through each class name for the matched tag.
1046       *
1047       * This generator function is designed to be used inside a "foreach" loop.
1048       *
1049       * Example:
1050       *
1051       *     $p = new WP_HTML_Tag_Processor( "<div class='free &lt;egg&lt;\tlang-en'>" );
1052       *     $p->next_tag();
1053       *     foreach ( $p->class_list() as $class_name ) {
1054       *         echo "{$class_name} ";
1055       *     }
1056       *     // Outputs: "free <egg> lang-en "
1057       *
1058       * @since 6.4.0
1059       */
1060  	public function class_list() {
1061          if ( self::STATE_MATCHED_TAG !== $this->parser_state ) {
1062              return;
1063          }
1064  
1065          /** @var string $class contains the string value of the class attribute, with character references decoded. */
1066          $class = $this->get_attribute( 'class' );
1067  
1068          if ( ! is_string( $class ) ) {
1069              return;
1070          }
1071  
1072          $seen = array();
1073  
1074          $at = 0;
1075          while ( $at < strlen( $class ) ) {
1076              // Skip past any initial boundary characters.
1077              $at += strspn( $class, " \t\f\r\n", $at );
1078              if ( $at >= strlen( $class ) ) {
1079                  return;
1080              }
1081  
1082              // Find the byte length until the next boundary.
1083              $length = strcspn( $class, " \t\f\r\n", $at );
1084              if ( 0 === $length ) {
1085                  return;
1086              }
1087  
1088              /*
1089               * CSS class names are case-insensitive in the ASCII range.
1090               *
1091               * @see https://www.w3.org/TR/CSS2/syndata.html#x1
1092               */
1093              $name = strtolower( substr( $class, $at, $length ) );
1094              $at  += $length;
1095  
1096              /*
1097               * It's expected that the number of class names for a given tag is relatively small.
1098               * Given this, it is probably faster overall to scan an array for a value rather
1099               * than to use the class name as a key and check if it's a key of $seen.
1100               */
1101              if ( in_array( $name, $seen, true ) ) {
1102                  continue;
1103              }
1104  
1105              $seen[] = $name;
1106              yield $name;
1107          }
1108      }
1109  
1110  
1111      /**
1112       * Returns if a matched tag contains the given ASCII case-insensitive class name.
1113       *
1114       * @since 6.4.0
1115       *
1116       * @param string $wanted_class Look for this CSS class name, ASCII case-insensitive.
1117       * @return bool|null Whether the matched tag contains the given class name, or null if not matched.
1118       */
1119  	public function has_class( $wanted_class ) {
1120          if ( self::STATE_MATCHED_TAG !== $this->parser_state ) {
1121              return null;
1122          }
1123  
1124          $wanted_class = strtolower( $wanted_class );
1125  
1126          foreach ( $this->class_list() as $class_name ) {
1127              if ( $class_name === $wanted_class ) {
1128                  return true;
1129              }
1130          }
1131  
1132          return false;
1133      }
1134  
1135  
1136      /**
1137       * Sets a bookmark in the HTML document.
1138       *
1139       * Bookmarks represent specific places or tokens in the HTML
1140       * document, such as a tag opener or closer. When applying
1141       * edits to a document, such as setting an attribute, the
1142       * text offsets of that token may shift; the bookmark is
1143       * kept updated with those shifts and remains stable unless
1144       * the entire span of text in which the token sits is removed.
1145       *
1146       * Release bookmarks when they are no longer needed.
1147       *
1148       * Example:
1149       *
1150       *     <main><h2>Surprising fact you may not know!</h2></main>
1151       *           ^  ^
1152       *            \-|-- this `H2` opener bookmark tracks the token
1153       *
1154       *     <main class="clickbait"><h2>Surprising fact you may no…
1155       *                             ^  ^
1156       *                              \-|-- it shifts with edits
1157       *
1158       * Bookmarks provide the ability to seek to a previously-scanned
1159       * place in the HTML document. This avoids the need to re-scan
1160       * the entire document.
1161       *
1162       * Example:
1163       *
1164       *     <ul><li>One</li><li>Two</li><li>Three</li></ul>
1165       *                                 ^^^^
1166       *                                 want to note this last item
1167       *
1168       *     $p = new WP_HTML_Tag_Processor( $html );
1169       *     $in_list = false;
1170       *     while ( $p->next_tag( array( 'tag_closers' => $in_list ? 'visit' : 'skip' ) ) ) {
1171       *         if ( 'UL' === $p->get_tag() ) {
1172       *             if ( $p->is_tag_closer() ) {
1173       *                 $in_list = false;
1174       *                 $p->set_bookmark( 'resume' );
1175       *                 if ( $p->seek( 'last-li' ) ) {
1176       *                     $p->add_class( 'last-li' );
1177       *                 }
1178       *                 $p->seek( 'resume' );
1179       *                 $p->release_bookmark( 'last-li' );
1180       *                 $p->release_bookmark( 'resume' );
1181       *             } else {
1182       *                 $in_list = true;
1183       *             }
1184       *         }
1185       *
1186       *         if ( 'LI' === $p->get_tag() ) {
1187       *             $p->set_bookmark( 'last-li' );
1188       *         }
1189       *     }
1190       *
1191       * Bookmarks intentionally hide the internal string offsets
1192       * to which they refer. They are maintained internally as
1193       * updates are applied to the HTML document and therefore
1194       * retain their "position" - the location to which they
1195       * originally pointed. The inability to use bookmarks with
1196       * functions like `substr` is therefore intentional to guard
1197       * against accidentally breaking the HTML.
1198       *
1199       * Because bookmarks allocate memory and require processing
1200       * for every applied update, they are limited and require
1201       * a name. They should not be created with programmatically-made
1202       * names, such as "li_{$index}" with some loop. As a general
1203       * rule they should only be created with string-literal names
1204       * like "start-of-section" or "last-paragraph".
1205       *
1206       * Bookmarks are a powerful tool to enable complicated behavior.
1207       * Consider double-checking that you need this tool if you are
1208       * reaching for it, as inappropriate use could lead to broken
1209       * HTML structure or unwanted processing overhead.
1210       *
1211       * @since 6.2.0
1212       *
1213       * @param string $name Identifies this particular bookmark.
1214       * @return bool Whether the bookmark was successfully created.
1215       */
1216  	public function set_bookmark( $name ) {
1217          // It only makes sense to set a bookmark if the parser has paused on a concrete token.
1218          if (
1219              self::STATE_COMPLETE === $this->parser_state ||
1220              self::STATE_INCOMPLETE_INPUT === $this->parser_state
1221          ) {
1222              return false;
1223          }
1224  
1225          if ( ! array_key_exists( $name, $this->bookmarks ) && count( $this->bookmarks ) >= static::MAX_BOOKMARKS ) {
1226              _doing_it_wrong(
1227                  __METHOD__,
1228                  __( 'Too many bookmarks: cannot create any more.' ),
1229                  '6.2.0'
1230              );
1231              return false;
1232          }
1233  
1234          $this->bookmarks[ $name ] = new WP_HTML_Span( $this->token_starts_at, $this->token_length );
1235  
1236          return true;
1237      }
1238  
1239  
1240      /**
1241       * Removes a bookmark that is no longer needed.
1242       *
1243       * Releasing a bookmark frees up the small
1244       * performance overhead it requires.
1245       *
1246       * @param string $name Name of the bookmark to remove.
1247       * @return bool Whether the bookmark already existed before removal.
1248       */
1249  	public function release_bookmark( $name ) {
1250          if ( ! array_key_exists( $name, $this->bookmarks ) ) {
1251              return false;
1252          }
1253  
1254          unset( $this->bookmarks[ $name ] );
1255  
1256          return true;
1257      }
1258  
1259      /**
1260       * Skips contents of generic rawtext elements.
1261       *
1262       * @since 6.3.2
1263       *
1264       * @see https://html.spec.whatwg.org/#generic-raw-text-element-parsing-algorithm
1265       *
1266       * @param string $tag_name The uppercase tag name which will close the RAWTEXT region.
1267       * @return bool Whether an end to the RAWTEXT region was found before the end of the document.
1268       */
1269  	private function skip_rawtext( $tag_name ) {
1270          /*
1271           * These two functions distinguish themselves on whether character references are
1272           * decoded, and since functionality to read the inner markup isn't supported, it's
1273           * not necessary to implement these two functions separately.
1274           */
1275          return $this->skip_rcdata( $tag_name );
1276      }
1277  
1278      /**
1279       * Skips contents of RCDATA elements, namely title and textarea tags.
1280       *
1281       * @since 6.2.0
1282       *
1283       * @see https://html.spec.whatwg.org/multipage/parsing.html#rcdata-state
1284       *
1285       * @param string $tag_name The uppercase tag name which will close the RCDATA region.
1286       * @return bool Whether an end to the RCDATA region was found before the end of the document.
1287       */
1288  	private function skip_rcdata( $tag_name ) {
1289          $html       = $this->html;
1290          $doc_length = strlen( $html );
1291          $tag_length = strlen( $tag_name );
1292  
1293          $at = $this->bytes_already_parsed;
1294  
1295          while ( false !== $at && $at < $doc_length ) {
1296              $at                       = strpos( $this->html, '</', $at );
1297              $this->tag_name_starts_at = $at;
1298  
1299              // Fail if there is no possible tag closer.
1300              if ( false === $at || ( $at + $tag_length ) >= $doc_length ) {
1301                  return false;
1302              }
1303  
1304              $at += 2;
1305  
1306              /*
1307               * Find a case-insensitive match to the tag name.
1308               *
1309               * Because tag names are limited to US-ASCII there is no
1310               * need to perform any kind of Unicode normalization when
1311               * comparing; any character which could be impacted by such
1312               * normalization could not be part of a tag name.
1313               */
1314              for ( $i = 0; $i < $tag_length; $i++ ) {
1315                  $tag_char  = $tag_name[ $i ];
1316                  $html_char = $html[ $at + $i ];
1317  
1318                  if ( $html_char !== $tag_char && strtoupper( $html_char ) !== $tag_char ) {
1319                      $at += $i;
1320                      continue 2;
1321                  }
1322              }
1323  
1324              $at                        += $tag_length;
1325              $this->bytes_already_parsed = $at;
1326  
1327              if ( $at >= strlen( $html ) ) {
1328                  return false;
1329              }
1330  
1331              /*
1332               * Ensure that the tag name terminates to avoid matching on
1333               * substrings of a longer tag name. For example, the sequence
1334               * "</textarearug" should not match for "</textarea" even
1335               * though "textarea" is found within the text.
1336               */
1337              $c = $html[ $at ];
1338              if ( ' ' !== $c && "\t" !== $c && "\r" !== $c && "\n" !== $c && '/' !== $c && '>' !== $c ) {
1339                  continue;
1340              }
1341  
1342              while ( $this->parse_next_attribute() ) {
1343                  continue;
1344              }
1345  
1346              $at = $this->bytes_already_parsed;
1347              if ( $at >= strlen( $this->html ) ) {
1348                  return false;
1349              }
1350  
1351              if ( '>' === $html[ $at ] ) {
1352                  $this->bytes_already_parsed = $at + 1;
1353                  return true;
1354              }
1355  
1356              if ( $at + 1 >= strlen( $this->html ) ) {
1357                  return false;
1358              }
1359  
1360              if ( '/' === $html[ $at ] && '>' === $html[ $at + 1 ] ) {
1361                  $this->bytes_already_parsed = $at + 2;
1362                  return true;
1363              }
1364          }
1365  
1366          return false;
1367      }
1368  
1369      /**
1370       * Skips contents of script tags.
1371       *
1372       * @since 6.2.0
1373       *
1374       * @return bool Whether the script tag was closed before the end of the document.
1375       */
1376  	private function skip_script_data() {
1377          $state      = 'unescaped';
1378          $html       = $this->html;
1379          $doc_length = strlen( $html );
1380          $at         = $this->bytes_already_parsed;
1381  
1382          while ( false !== $at && $at < $doc_length ) {
1383              $at += strcspn( $html, '-<', $at );
1384  
1385              /*
1386               * For all script states a "-->"  transitions
1387               * back into the normal unescaped script mode,
1388               * even if that's the current state.
1389               */
1390              if (
1391                  $at + 2 < $doc_length &&
1392                  '-' === $html[ $at ] &&
1393                  '-' === $html[ $at + 1 ] &&
1394                  '>' === $html[ $at + 2 ]
1395              ) {
1396                  $at   += 3;
1397                  $state = 'unescaped';
1398                  continue;
1399              }
1400  
1401              // Everything of interest past here starts with "<".
1402              if ( $at + 1 >= $doc_length || '<' !== $html[ $at++ ] ) {
1403                  continue;
1404              }
1405  
1406              /*
1407               * Unlike with "-->", the "<!--" only transitions
1408               * into the escaped mode if not already there.
1409               *
1410               * Inside the escaped modes it will be ignored; and
1411               * should never break out of the double-escaped
1412               * mode and back into the escaped mode.
1413               *
1414               * While this requires a mode change, it does not
1415               * impact the parsing otherwise, so continue
1416               * parsing after updating the state.
1417               */
1418              if (
1419                  $at + 2 < $doc_length &&
1420                  '!' === $html[ $at ] &&
1421                  '-' === $html[ $at + 1 ] &&
1422                  '-' === $html[ $at + 2 ]
1423              ) {
1424                  $at   += 3;
1425                  $state = 'unescaped' === $state ? 'escaped' : $state;
1426                  continue;
1427              }
1428  
1429              if ( '/' === $html[ $at ] ) {
1430                  $closer_potentially_starts_at = $at - 1;
1431                  $is_closing                   = true;
1432                  ++$at;
1433              } else {
1434                  $is_closing = false;
1435              }
1436  
1437              /*
1438               * At this point the only remaining state-changes occur with the
1439               * <script> and </script> tags; unless one of these appears next,
1440               * proceed scanning to the next potential token in the text.
1441               */
1442              if ( ! (
1443                  $at + 6 < $doc_length &&
1444                  ( 's' === $html[ $at ] || 'S' === $html[ $at ] ) &&
1445                  ( 'c' === $html[ $at + 1 ] || 'C' === $html[ $at + 1 ] ) &&
1446                  ( 'r' === $html[ $at + 2 ] || 'R' === $html[ $at + 2 ] ) &&
1447                  ( 'i' === $html[ $at + 3 ] || 'I' === $html[ $at + 3 ] ) &&
1448                  ( 'p' === $html[ $at + 4 ] || 'P' === $html[ $at + 4 ] ) &&
1449                  ( 't' === $html[ $at + 5 ] || 'T' === $html[ $at + 5 ] )
1450              ) ) {
1451                  ++$at;
1452                  continue;
1453              }
1454  
1455              /*
1456               * Ensure that the script tag terminates to avoid matching on
1457               * substrings of a non-match. For example, the sequence
1458               * "<script123" should not end a script region even though
1459               * "<script" is found within the text.
1460               */
1461              if ( $at + 6 >= $doc_length ) {
1462                  continue;
1463              }
1464              $at += 6;
1465              $c   = $html[ $at ];
1466              if ( ' ' !== $c && "\t" !== $c && "\r" !== $c && "\n" !== $c && '/' !== $c && '>' !== $c ) {
1467                  ++$at;
1468                  continue;
1469              }
1470  
1471              if ( 'escaped' === $state && ! $is_closing ) {
1472                  $state = 'double-escaped';
1473                  continue;
1474              }
1475  
1476              if ( 'double-escaped' === $state && $is_closing ) {
1477                  $state = 'escaped';
1478                  continue;
1479              }
1480  
1481              if ( $is_closing ) {
1482                  $this->bytes_already_parsed = $closer_potentially_starts_at;
1483                  $this->tag_name_starts_at   = $closer_potentially_starts_at;
1484                  if ( $this->bytes_already_parsed >= $doc_length ) {
1485                      return false;
1486                  }
1487  
1488                  while ( $this->parse_next_attribute() ) {
1489                      continue;
1490                  }
1491  
1492                  if ( $this->bytes_already_parsed >= $doc_length ) {
1493                      $this->parser_state = self::STATE_INCOMPLETE_INPUT;
1494  
1495                      return false;
1496                  }
1497  
1498                  if ( '>' === $html[ $this->bytes_already_parsed ] ) {
1499                      ++$this->bytes_already_parsed;
1500                      return true;
1501                  }
1502              }
1503  
1504              ++$at;
1505          }
1506  
1507          return false;
1508      }
1509  
1510      /**
1511       * Parses the next tag.
1512       *
1513       * This will find and start parsing the next tag, including
1514       * the opening `<`, the potential closer `/`, and the tag
1515       * name. It does not parse the attributes or scan to the
1516       * closing `>`; these are left for other methods.
1517       *
1518       * @since 6.2.0
1519       * @since 6.2.1 Support abruptly-closed comments, invalid-tag-closer-comments, and empty elements.
1520       *
1521       * @return bool Whether a tag was found before the end of the document.
1522       */
1523  	private function parse_next_tag() {
1524          $this->after_tag();
1525  
1526          $html       = $this->html;
1527          $doc_length = strlen( $html );
1528          $was_at     = $this->bytes_already_parsed;
1529          $at         = $was_at;
1530  
1531          while ( false !== $at && $at < $doc_length ) {
1532              $at = strpos( $html, '<', $at );
1533  
1534              /*
1535               * This does not imply an incomplete parse; it indicates that there
1536               * can be nothing left in the document other than a #text node.
1537               */
1538              if ( false === $at ) {
1539                  $this->parser_state         = self::STATE_TEXT_NODE;
1540                  $this->token_starts_at      = $was_at;
1541                  $this->token_length         = strlen( $html ) - $was_at;
1542                  $this->text_starts_at       = $was_at;
1543                  $this->text_length          = $this->token_length;
1544                  $this->bytes_already_parsed = strlen( $html );
1545                  return true;
1546              }
1547  
1548              if ( $at > $was_at ) {
1549                  /*
1550                   * A "<" normally starts a new HTML tag or syntax token, but in cases where the
1551                   * following character can't produce a valid token, the "<" is instead treated
1552                   * as plaintext and the parser should skip over it. This avoids a problem when
1553                   * following earlier practices of typing emoji with text, e.g. "<3". This
1554                   * should be a heart, not a tag. It's supposed to be rendered, not hidden.
1555                   *
1556                   * At this point the parser checks if this is one of those cases and if it is
1557                   * will continue searching for the next "<" in search of a token boundary.
1558                   *
1559                   * @see https://html.spec.whatwg.org/#tag-open-state
1560                   */
1561                  if ( strlen( $html ) > $at + 1 ) {
1562                      $next_character  = $html[ $at + 1 ];
1563                      $at_another_node = (
1564                          '!' === $next_character ||
1565                          '/' === $next_character ||
1566                          '?' === $next_character ||
1567                          ( 'A' <= $next_character && $next_character <= 'Z' ) ||
1568                          ( 'a' <= $next_character && $next_character <= 'z' )
1569                      );
1570                      if ( ! $at_another_node ) {
1571                          ++$at;
1572                          continue;
1573                      }
1574                  }
1575  
1576                  $this->parser_state         = self::STATE_TEXT_NODE;
1577                  $this->token_starts_at      = $was_at;
1578                  $this->token_length         = $at - $was_at;
1579                  $this->text_starts_at       = $was_at;
1580                  $this->text_length          = $this->token_length;
1581                  $this->bytes_already_parsed = $at;
1582                  return true;
1583              }
1584  
1585              $this->token_starts_at = $at;
1586  
1587              if ( $at + 1 < $doc_length && '/' === $this->html[ $at + 1 ] ) {
1588                  $this->is_closing_tag = true;
1589                  ++$at;
1590              } else {
1591                  $this->is_closing_tag = false;
1592              }
1593  
1594              /*
1595               * HTML tag names must start with [a-zA-Z] otherwise they are not tags.
1596               * For example, "<3" is rendered as text, not a tag opener. If at least
1597               * one letter follows the "<" then _it is_ a tag, but if the following
1598               * character is anything else it _is not a tag_.
1599               *
1600               * It's not uncommon to find non-tags starting with `<` in an HTML
1601               * document, so it's good for performance to make this pre-check before
1602               * continuing to attempt to parse a tag name.
1603               *
1604               * Reference:
1605               * * https://html.spec.whatwg.org/multipage/parsing.html#data-state
1606               * * https://html.spec.whatwg.org/multipage/parsing.html#tag-open-state
1607               */
1608              $tag_name_prefix_length = strspn( $html, 'abcdefghijklmnopqrstuvwxyzABCDEFGHIJKLMNOPQRSTUVWXYZ', $at + 1 );
1609              if ( $tag_name_prefix_length > 0 ) {
1610                  ++$at;
1611                  $this->parser_state         = self::STATE_MATCHED_TAG;
1612                  $this->tag_name_starts_at   = $at;
1613                  $this->tag_name_length      = $tag_name_prefix_length + strcspn( $html, " \t\f\r\n/>", $at + $tag_name_prefix_length );
1614                  $this->bytes_already_parsed = $at + $this->tag_name_length;
1615                  return true;
1616              }
1617  
1618              /*
1619               * Abort if no tag is found before the end of
1620               * the document. There is nothing left to parse.
1621               */
1622              if ( $at + 1 >= $doc_length ) {
1623                  $this->parser_state = self::STATE_INCOMPLETE_INPUT;
1624  
1625                  return false;
1626              }
1627  
1628              /*
1629               * `<!` transitions to markup declaration open state
1630               * https://html.spec.whatwg.org/multipage/parsing.html#markup-declaration-open-state
1631               */
1632              if ( ! $this->is_closing_tag && '!' === $html[ $at + 1 ] ) {
1633                  /*
1634                   * `<!--` transitions to a comment state – apply further comment rules.
1635                   * https://html.spec.whatwg.org/multipage/parsing.html#tag-open-state
1636                   */
1637                  if (
1638                      $doc_length > $at + 3 &&
1639                      '-' === $html[ $at + 2 ] &&
1640                      '-' === $html[ $at + 3 ]
1641                  ) {
1642                      $closer_at = $at + 4;
1643                      // If it's not possible to close the comment then there is nothing more to scan.
1644                      if ( $doc_length <= $closer_at ) {
1645                          $this->parser_state = self::STATE_INCOMPLETE_INPUT;
1646  
1647                          return false;
1648                      }
1649  
1650                      // Abruptly-closed empty comments are a sequence of dashes followed by `>`.
1651                      $span_of_dashes = strspn( $html, '-', $closer_at );
1652                      if ( '>' === $html[ $closer_at + $span_of_dashes ] ) {
1653                          /*
1654                           * @todo When implementing `set_modifiable_text()` ensure that updates to this token
1655                           *       don't break the syntax for short comments, e.g. `<!--->`. Unlike other comment
1656                           *       and bogus comment syntax, these leave no clear insertion point for text and
1657                           *       they need to be modified specially in order to contain text. E.g. to store
1658                           *       `?` as the modifiable text, the `<!--->` needs to become `<!--?-->`, which
1659                           *       involves inserting an additional `-` into the token after the modifiable text.
1660                           */
1661                          $this->parser_state = self::STATE_COMMENT;
1662                          $this->comment_type = self::COMMENT_AS_ABRUPTLY_CLOSED_COMMENT;
1663                          $this->token_length = $closer_at + $span_of_dashes + 1 - $this->token_starts_at;
1664  
1665                          // Only provide modifiable text if the token is long enough to contain it.
1666                          if ( $span_of_dashes >= 2 ) {
1667                              $this->comment_type   = self::COMMENT_AS_HTML_COMMENT;
1668                              $this->text_starts_at = $this->token_starts_at + 4;
1669                              $this->text_length    = $span_of_dashes - 2;
1670                          }
1671  
1672                          $this->bytes_already_parsed = $closer_at + $span_of_dashes + 1;
1673                          return true;
1674                      }
1675  
1676                      /*
1677                       * Comments may be closed by either a --> or an invalid --!>.
1678                       * The first occurrence closes the comment.
1679                       *
1680                       * See https://html.spec.whatwg.org/#parse-error-incorrectly-closed-comment
1681                       */
1682                      --$closer_at; // Pre-increment inside condition below reduces risk of accidental infinite looping.
1683                      while ( ++$closer_at < $doc_length ) {
1684                          $closer_at = strpos( $html, '--', $closer_at );
1685                          if ( false === $closer_at ) {
1686                              $this->parser_state = self::STATE_INCOMPLETE_INPUT;
1687  
1688                              return false;
1689                          }
1690  
1691                          if ( $closer_at + 2 < $doc_length && '>' === $html[ $closer_at + 2 ] ) {
1692                              $this->parser_state         = self::STATE_COMMENT;
1693                              $this->comment_type         = self::COMMENT_AS_HTML_COMMENT;
1694                              $this->token_length         = $closer_at + 3 - $this->token_starts_at;
1695                              $this->text_starts_at       = $this->token_starts_at + 4;
1696                              $this->text_length          = $closer_at - $this->text_starts_at;
1697                              $this->bytes_already_parsed = $closer_at + 3;
1698                              return true;
1699                          }
1700  
1701                          if (
1702                              $closer_at + 3 < $doc_length &&
1703                              '!' === $html[ $closer_at + 2 ] &&
1704                              '>' === $html[ $closer_at + 3 ]
1705                          ) {
1706                              $this->parser_state         = self::STATE_COMMENT;
1707                              $this->comment_type         = self::COMMENT_AS_HTML_COMMENT;
1708                              $this->token_length         = $closer_at + 4 - $this->token_starts_at;
1709                              $this->text_starts_at       = $this->token_starts_at + 4;
1710                              $this->text_length          = $closer_at - $this->text_starts_at;
1711                              $this->bytes_already_parsed = $closer_at + 4;
1712                              return true;
1713                          }
1714                      }
1715                  }
1716  
1717                  /*
1718                   * `<!DOCTYPE` transitions to DOCTYPE state – skip to the nearest >
1719                   * These are ASCII-case-insensitive.
1720                   * https://html.spec.whatwg.org/multipage/parsing.html#tag-open-state
1721                   */
1722                  if (
1723                      $doc_length > $at + 8 &&
1724                      ( 'D' === $html[ $at + 2 ] || 'd' === $html[ $at + 2 ] ) &&
1725                      ( 'O' === $html[ $at + 3 ] || 'o' === $html[ $at + 3 ] ) &&
1726                      ( 'C' === $html[ $at + 4 ] || 'c' === $html[ $at + 4 ] ) &&
1727                      ( 'T' === $html[ $at + 5 ] || 't' === $html[ $at + 5 ] ) &&
1728                      ( 'Y' === $html[ $at + 6 ] || 'y' === $html[ $at + 6 ] ) &&
1729                      ( 'P' === $html[ $at + 7 ] || 'p' === $html[ $at + 7 ] ) &&
1730                      ( 'E' === $html[ $at + 8 ] || 'e' === $html[ $at + 8 ] )
1731                  ) {
1732                      $closer_at = strpos( $html, '>', $at + 9 );
1733                      if ( false === $closer_at ) {
1734                          $this->parser_state = self::STATE_INCOMPLETE_INPUT;
1735  
1736                          return false;
1737                      }
1738  
1739                      $this->parser_state         = self::STATE_DOCTYPE;
1740                      $this->token_length         = $closer_at + 1 - $this->token_starts_at;
1741                      $this->text_starts_at       = $this->token_starts_at + 9;
1742                      $this->text_length          = $closer_at - $this->text_starts_at;
1743                      $this->bytes_already_parsed = $closer_at + 1;
1744                      return true;
1745                  }
1746  
1747                  /*
1748                   * Anything else here is an incorrectly-opened comment and transitions
1749                   * to the bogus comment state - skip to the nearest >. If no closer is
1750                   * found then the HTML was truncated inside the markup declaration.
1751                   */
1752                  $closer_at = strpos( $html, '>', $at + 1 );
1753                  if ( false === $closer_at ) {
1754                      $this->parser_state = self::STATE_INCOMPLETE_INPUT;
1755  
1756                      return false;
1757                  }
1758  
1759                  $this->parser_state         = self::STATE_COMMENT;
1760                  $this->comment_type         = self::COMMENT_AS_INVALID_HTML;
1761                  $this->token_length         = $closer_at + 1 - $this->token_starts_at;
1762                  $this->text_starts_at       = $this->token_starts_at + 2;
1763                  $this->text_length          = $closer_at - $this->text_starts_at;
1764                  $this->bytes_already_parsed = $closer_at + 1;
1765  
1766                  /*
1767                   * Identify nodes that would be CDATA if HTML had CDATA sections.
1768                   *
1769                   * This section must occur after identifying the bogus comment end
1770                   * because in an HTML parser it will span to the nearest `>`, even
1771                   * if there's no `]]>` as would be required in an XML document. It
1772                   * is therefore not possible to parse a CDATA section containing
1773                   * a `>` in the HTML syntax.
1774                   *
1775                   * Inside foreign elements there is a discrepancy between browsers
1776                   * and the specification on this.
1777                   *
1778                   * @todo Track whether the Tag Processor is inside a foreign element
1779                   *       and require the proper closing `]]>` in those cases.
1780                   */
1781                  if (
1782                      $this->token_length >= 10 &&
1783                      '[' === $html[ $this->token_starts_at + 2 ] &&
1784                      'C' === $html[ $this->token_starts_at + 3 ] &&
1785                      'D' === $html[ $this->token_starts_at + 4 ] &&
1786                      'A' === $html[ $this->token_starts_at + 5 ] &&
1787                      'T' === $html[ $this->token_starts_at + 6 ] &&
1788                      'A' === $html[ $this->token_starts_at + 7 ] &&
1789                      '[' === $html[ $this->token_starts_at + 8 ] &&
1790                      ']' === $html[ $closer_at - 1 ] &&
1791                      ']' === $html[ $closer_at - 2 ]
1792                  ) {
1793                      $this->parser_state    = self::STATE_COMMENT;
1794                      $this->comment_type    = self::COMMENT_AS_CDATA_LOOKALIKE;
1795                      $this->text_starts_at += 7;
1796                      $this->text_length    -= 9;
1797                  }
1798  
1799                  return true;
1800              }
1801  
1802              /*
1803               * </> is a missing end tag name, which is ignored.
1804               *
1805               * This was also known as the "presumptuous empty tag"
1806               * in early discussions as it was proposed to close
1807               * the nearest previous opening tag.
1808               *
1809               * See https://html.spec.whatwg.org/#parse-error-missing-end-tag-name
1810               */
1811              if ( '>' === $html[ $at + 1 ] ) {
1812                  // `<>` is interpreted as plaintext.
1813                  if ( ! $this->is_closing_tag ) {
1814                      ++$at;
1815                      continue;
1816                  }
1817  
1818                  $this->parser_state         = self::STATE_PRESUMPTUOUS_TAG;
1819                  $this->token_length         = $at + 2 - $this->token_starts_at;
1820                  $this->bytes_already_parsed = $at + 2;
1821                  return true;
1822              }
1823  
1824              /*
1825               * `<?` transitions to a bogus comment state – skip to the nearest >
1826               * See https://html.spec.whatwg.org/multipage/parsing.html#tag-open-state
1827               */
1828              if ( ! $this->is_closing_tag && '?' === $html[ $at + 1 ] ) {
1829                  $closer_at = strpos( $html, '>', $at + 2 );
1830                  if ( false === $closer_at ) {
1831                      $this->parser_state = self::STATE_INCOMPLETE_INPUT;
1832  
1833                      return false;
1834                  }
1835  
1836                  $this->parser_state         = self::STATE_COMMENT;
1837                  $this->comment_type         = self::COMMENT_AS_INVALID_HTML;
1838                  $this->token_length         = $closer_at + 1 - $this->token_starts_at;
1839                  $this->text_starts_at       = $this->token_starts_at + 2;
1840                  $this->text_length          = $closer_at - $this->text_starts_at;
1841                  $this->bytes_already_parsed = $closer_at + 1;
1842  
1843                  /*
1844                   * Identify a Processing Instruction node were HTML to have them.
1845                   *
1846                   * This section must occur after identifying the bogus comment end
1847                   * because in an HTML parser it will span to the nearest `>`, even
1848                   * if there's no `?>` as would be required in an XML document. It
1849                   * is therefore not possible to parse a Processing Instruction node
1850                   * containing a `>` in the HTML syntax.
1851                   *
1852                   * XML allows for more target names, but this code only identifies
1853                   * those with ASCII-representable target names. This means that it
1854                   * may identify some Processing Instruction nodes as bogus comments,
1855                   * but it will not misinterpret the HTML structure. By limiting the
1856                   * identification to these target names the Tag Processor can avoid
1857                   * the need to start parsing UTF-8 sequences.
1858                   *
1859                   * > NameStartChar ::= ":" | [A-Z] | "_" | [a-z] | [#xC0-#xD6] | [#xD8-#xF6] | [#xF8-#x2FF] |
1860                   *                     [#x370-#x37D] | [#x37F-#x1FFF] | [#x200C-#x200D] | [#x2070-#x218F] |
1861                   *                     [#x2C00-#x2FEF] | [#x3001-#xD7FF] | [#xF900-#xFDCF] | [#xFDF0-#xFFFD] |
1862                   *                     [#x10000-#xEFFFF]
1863                   * > NameChar      ::= NameStartChar | "-" | "." | [0-9] | #xB7 | [#x0300-#x036F] | [#x203F-#x2040]
1864                   *
1865                   * @see https://www.w3.org/TR/2006/REC-xml11-20060816/#NT-PITarget
1866                   */
1867                  if ( $this->token_length >= 5 && '?' === $html[ $closer_at - 1 ] ) {
1868                      $comment_text     = substr( $html, $this->token_starts_at + 2, $this->token_length - 4 );
1869                      $pi_target_length = strspn( $comment_text, 'abcdefghijklmnopqrstuvwxyzABCDEFGHIJKLMNOPQRSTUVWXYZ:_' );
1870  
1871                      if ( 0 < $pi_target_length ) {
1872                          $pi_target_length += strspn( $comment_text, 'abcdefghijklmnopqrstuvwxyzABCDEFGHIJKLMNOPQRSTUVWXYZ0123456789:_-.', $pi_target_length );
1873  
1874                          $this->comment_type       = self::COMMENT_AS_PI_NODE_LOOKALIKE;
1875                          $this->tag_name_starts_at = $this->token_starts_at + 2;
1876                          $this->tag_name_length    = $pi_target_length;
1877                          $this->text_starts_at    += $pi_target_length;
1878                          $this->text_length       -= $pi_target_length + 1;
1879                      }
1880                  }
1881  
1882                  return true;
1883              }
1884  
1885              /*
1886               * If a non-alpha starts the tag name in a tag closer it's a comment.
1887               * Find the first `>`, which closes the comment.
1888               *
1889               * This parser classifies these particular comments as special "funky comments"
1890               * which are made available for further processing.
1891               *
1892               * See https://html.spec.whatwg.org/#parse-error-invalid-first-character-of-tag-name
1893               */
1894              if ( $this->is_closing_tag ) {
1895                  // No chance of finding a closer.
1896                  if ( $at + 3 > $doc_length ) {
1897                      return false;
1898                  }
1899  
1900                  $closer_at = strpos( $html, '>', $at + 2 );
1901                  if ( false === $closer_at ) {
1902                      $this->parser_state = self::STATE_INCOMPLETE_INPUT;
1903  
1904                      return false;
1905                  }
1906  
1907                  $this->parser_state         = self::STATE_FUNKY_COMMENT;
1908                  $this->token_length         = $closer_at + 1 - $this->token_starts_at;
1909                  $this->text_starts_at       = $this->token_starts_at + 2;
1910                  $this->text_length          = $closer_at - $this->text_starts_at;
1911                  $this->bytes_already_parsed = $closer_at + 1;
1912                  return true;
1913              }
1914  
1915              ++$at;
1916          }
1917  
1918          return false;
1919      }
1920  
1921      /**
1922       * Parses the next attribute.
1923       *
1924       * @since 6.2.0
1925       *
1926       * @return bool Whether an attribute was found before the end of the document.
1927       */
1928  	private function parse_next_attribute() {
1929          // Skip whitespace and slashes.
1930          $this->bytes_already_parsed += strspn( $this->html, " \t\f\r\n/", $this->bytes_already_parsed );
1931          if ( $this->bytes_already_parsed >= strlen( $this->html ) ) {
1932              $this->parser_state = self::STATE_INCOMPLETE_INPUT;
1933  
1934              return false;
1935          }
1936  
1937          /*
1938           * Treat the equal sign as a part of the attribute
1939           * name if it is the first encountered byte.
1940           *
1941           * @see https://html.spec.whatwg.org/multipage/parsing.html#before-attribute-name-state
1942           */
1943          $name_length = '=' === $this->html[ $this->bytes_already_parsed ]
1944              ? 1 + strcspn( $this->html, "=/> \t\f\r\n", $this->bytes_already_parsed + 1 )
1945              : strcspn( $this->html, "=/> \t\f\r\n", $this->bytes_already_parsed );
1946  
1947          // No attribute, just tag closer.
1948          if ( 0 === $name_length || $this->bytes_already_parsed + $name_length >= strlen( $this->html ) ) {
1949              return false;
1950          }
1951  
1952          $attribute_start             = $this->bytes_already_parsed;
1953          $attribute_name              = substr( $this->html, $attribute_start, $name_length );
1954          $this->bytes_already_parsed += $name_length;
1955          if ( $this->bytes_already_parsed >= strlen( $this->html ) ) {
1956              $this->parser_state = self::STATE_INCOMPLETE_INPUT;
1957  
1958              return false;
1959          }
1960  
1961          $this->skip_whitespace();
1962          if ( $this->bytes_already_parsed >= strlen( $this->html ) ) {
1963              $this->parser_state = self::STATE_INCOMPLETE_INPUT;
1964  
1965              return false;
1966          }
1967  
1968          $has_value = '=' === $this->html[ $this->bytes_already_parsed ];
1969          if ( $has_value ) {
1970              ++$this->bytes_already_parsed;
1971              $this->skip_whitespace();
1972              if ( $this->bytes_already_parsed >= strlen( $this->html ) ) {
1973                  $this->parser_state = self::STATE_INCOMPLETE_INPUT;
1974  
1975                  return false;
1976              }
1977  
1978              switch ( $this->html[ $this->bytes_already_parsed ] ) {
1979                  case "'":
1980                  case '"':
1981                      $quote                      = $this->html[ $this->bytes_already_parsed ];
1982                      $value_start                = $this->bytes_already_parsed + 1;
1983                      $value_length               = strcspn( $this->html, $quote, $value_start );
1984                      $attribute_end              = $value_start + $value_length + 1;
1985                      $this->bytes_already_parsed = $attribute_end;
1986                      break;
1987  
1988                  default:
1989                      $value_start                = $this->bytes_already_parsed;
1990                      $value_length               = strcspn( $this->html, "> \t\f\r\n", $value_start );
1991                      $attribute_end              = $value_start + $value_length;
1992                      $this->bytes_already_parsed = $attribute_end;
1993              }
1994          } else {
1995              $value_start   = $this->bytes_already_parsed;
1996              $value_length  = 0;
1997              $attribute_end = $attribute_start + $name_length;
1998          }
1999  
2000          if ( $attribute_end >= strlen( $this->html ) ) {
2001              $this->parser_state = self::STATE_INCOMPLETE_INPUT;
2002  
2003              return false;
2004          }
2005  
2006          if ( $this->is_closing_tag ) {
2007              return true;
2008          }
2009  
2010          /*
2011           * > There must never be two or more attributes on
2012           * > the same start tag whose names are an ASCII
2013           * > case-insensitive match for each other.
2014           *     - HTML 5 spec
2015           *
2016           * @see https://html.spec.whatwg.org/multipage/syntax.html#attributes-2:ascii-case-insensitive
2017           */
2018          $comparable_name = strtolower( $attribute_name );
2019  
2020          // If an attribute is listed many times, only use the first declaration and ignore the rest.
2021          if ( ! array_key_exists( $comparable_name, $this->attributes ) ) {
2022              $this->attributes[ $comparable_name ] = new WP_HTML_Attribute_Token(
2023                  $attribute_name,
2024                  $value_start,
2025                  $value_length,
2026                  $attribute_start,
2027                  $attribute_end - $attribute_start,
2028                  ! $has_value
2029              );
2030  
2031              return true;
2032          }
2033  
2034          /*
2035           * Track the duplicate attributes so if we remove it, all disappear together.
2036           *
2037           * While `$this->duplicated_attributes` could always be stored as an `array()`,
2038           * which would simplify the logic here, storing a `null` and only allocating
2039           * an array when encountering duplicates avoids needless allocations in the
2040           * normative case of parsing tags with no duplicate attributes.
2041           */
2042          $duplicate_span = new WP_HTML_Span( $attribute_start, $attribute_end - $attribute_start );
2043          if ( null === $this->duplicate_attributes ) {
2044              $this->duplicate_attributes = array( $comparable_name => array( $duplicate_span ) );
2045          } elseif ( ! array_key_exists( $comparable_name, $this->duplicate_attributes ) ) {
2046              $this->duplicate_attributes[ $comparable_name ] = array( $duplicate_span );
2047          } else {
2048              $this->duplicate_attributes[ $comparable_name ][] = $duplicate_span;
2049          }
2050  
2051          return true;
2052      }
2053  
2054      /**
2055       * Move the internal cursor past any immediate successive whitespace.
2056       *
2057       * @since 6.2.0
2058       */
2059  	private function skip_whitespace() {
2060          $this->bytes_already_parsed += strspn( $this->html, " \t\f\r\n", $this->bytes_already_parsed );
2061      }
2062  
2063      /**
2064       * Applies attribute updates and cleans up once a tag is fully parsed.
2065       *
2066       * @since 6.2.0
2067       */
2068  	private function after_tag() {
2069          /*
2070           * There could be lexical updates enqueued for an attribute that
2071           * also exists on the next tag. In order to avoid conflating the
2072           * attributes across the two tags, lexical updates with names
2073           * need to be flushed to raw lexical updates.
2074           */
2075          $this->class_name_updates_to_attributes_updates();
2076  
2077          /*
2078           * Purge updates if there are too many. The actual count isn't
2079           * scientific, but a few values from 100 to a few thousand were
2080           * tests to find a practically-useful limit.
2081           *
2082           * If the update queue grows too big, then the Tag Processor
2083           * will spend more time iterating through them and lose the
2084           * efficiency gains of deferring applying them.
2085           */
2086          if ( 1000 < count( $this->lexical_updates ) ) {
2087              $this->get_updated_html();
2088          }
2089  
2090          foreach ( $this->lexical_updates as $name => $update ) {
2091              /*
2092               * Any updates appearing after the cursor should be applied
2093               * before proceeding, otherwise they may be overlooked.
2094               */
2095              if ( $update->start >= $this->bytes_already_parsed ) {
2096                  $this->get_updated_html();
2097                  break;
2098              }
2099  
2100              if ( is_int( $name ) ) {
2101                  continue;
2102              }
2103  
2104              $this->lexical_updates[] = $update;
2105              unset( $this->lexical_updates[ $name ] );
2106          }
2107  
2108          $this->token_starts_at      = null;
2109          $this->token_length         = null;
2110          $this->tag_name_starts_at   = null;
2111          $this->tag_name_length      = null;
2112          $this->text_starts_at       = 0;
2113          $this->text_length          = 0;
2114          $this->is_closing_tag       = null;
2115          $this->attributes           = array();
2116          $this->comment_type         = null;
2117          $this->duplicate_attributes = null;
2118      }
2119  
2120      /**
2121       * Converts class name updates into tag attributes updates
2122       * (they are accumulated in different data formats for performance).
2123       *
2124       * @since 6.2.0
2125       *
2126       * @see WP_HTML_Tag_Processor::$lexical_updates
2127       * @see WP_HTML_Tag_Processor::$classname_updates
2128       */
2129  	private function class_name_updates_to_attributes_updates() {
2130          if ( count( $this->classname_updates ) === 0 ) {
2131              return;
2132          }
2133  
2134          $existing_class = $this->get_enqueued_attribute_value( 'class' );
2135          if ( null === $existing_class || true === $existing_class ) {
2136              $existing_class = '';
2137          }
2138  
2139          if ( false === $existing_class && isset( $this->attributes['class'] ) ) {
2140              $existing_class = substr(
2141                  $this->html,
2142                  $this->attributes['class']->value_starts_at,
2143                  $this->attributes['class']->value_length
2144              );
2145          }
2146  
2147          if ( false === $existing_class ) {
2148              $existing_class = '';
2149          }
2150  
2151          /**
2152           * Updated "class" attribute value.
2153           *
2154           * This is incrementally built while scanning through the existing class
2155           * attribute, skipping removed classes on the way, and then appending
2156           * added classes at the end. Only when finished processing will the
2157           * value contain the final new value.
2158  
2159           * @var string $class
2160           */
2161          $class = '';
2162  
2163          /**
2164           * Tracks the cursor position in the existing
2165           * class attribute value while parsing.
2166           *
2167           * @var int $at
2168           */
2169          $at = 0;
2170  
2171          /**
2172           * Indicates if there's any need to modify the existing class attribute.
2173           *
2174           * If a call to `add_class()` and `remove_class()` wouldn't impact
2175           * the `class` attribute value then there's no need to rebuild it.
2176           * For example, when adding a class that's already present or
2177           * removing one that isn't.
2178           *
2179           * This flag enables a performance optimization when none of the enqueued
2180           * class updates would impact the `class` attribute; namely, that the
2181           * processor can continue without modifying the input document, as if
2182           * none of the `add_class()` or `remove_class()` calls had been made.
2183           *
2184           * This flag is set upon the first change that requires a string update.
2185           *
2186           * @var bool $modified
2187           */
2188          $modified = false;
2189  
2190          // Remove unwanted classes by only copying the new ones.
2191          $existing_class_length = strlen( $existing_class );
2192          while ( $at < $existing_class_length ) {
2193              // Skip to the first non-whitespace character.
2194              $ws_at     = $at;
2195              $ws_length = strspn( $existing_class, " \t\f\r\n", $ws_at );
2196              $at       += $ws_length;
2197  
2198              // Capture the class name – it's everything until the next whitespace.
2199              $name_length = strcspn( $existing_class, " \t\f\r\n", $at );
2200              if ( 0 === $name_length ) {
2201                  // If no more class names are found then that's the end.
2202                  break;
2203              }
2204  
2205              $name = substr( $existing_class, $at, $name_length );
2206              $at  += $name_length;
2207  
2208              // If this class is marked for removal, start processing the next one.
2209              $remove_class = (
2210                  isset( $this->classname_updates[ $name ] ) &&
2211                  self::REMOVE_CLASS === $this->classname_updates[ $name ]
2212              );
2213  
2214              // If a class has already been seen then skip it; it should not be added twice.
2215              if ( ! $remove_class ) {
2216                  $this->classname_updates[ $name ] = self::SKIP_CLASS;
2217              }
2218  
2219              if ( $remove_class ) {
2220                  $modified = true;
2221                  continue;
2222              }
2223  
2224              /*
2225               * Otherwise, append it to the new "class" attribute value.
2226               *
2227               * There are options for handling whitespace between tags.
2228               * Preserving the existing whitespace produces fewer changes
2229               * to the HTML content and should clarify the before/after
2230               * content when debugging the modified output.
2231               *
2232               * This approach contrasts normalizing the inter-class
2233               * whitespace to a single space, which might appear cleaner
2234               * in the output HTML but produce a noisier change.
2235               */
2236              $class .= substr( $existing_class, $ws_at, $ws_length );
2237              $class .= $name;
2238          }
2239  
2240          // Add new classes by appending those which haven't already been seen.
2241          foreach ( $this->classname_updates as $name => $operation ) {
2242              if ( self::ADD_CLASS === $operation ) {
2243                  $modified = true;
2244  
2245                  $class .= strlen( $class ) > 0 ? ' ' : '';
2246                  $class .= $name;
2247              }
2248          }
2249  
2250          $this->classname_updates = array();
2251          if ( ! $modified ) {
2252              return;
2253          }
2254  
2255          if ( strlen( $class ) > 0 ) {
2256              $this->set_attribute( 'class', $class );
2257          } else {
2258              $this->remove_attribute( 'class' );
2259          }
2260      }
2261  
2262      /**
2263       * Applies attribute updates to HTML document.
2264       *
2265       * @since 6.2.0
2266       * @since 6.2.1 Accumulates shift for internal cursor and passed pointer.
2267       * @since 6.3.0 Invalidate any bookmarks whose targets are overwritten.
2268       *
2269       * @param int $shift_this_point Accumulate and return shift for this position.
2270       * @return int How many bytes the given pointer moved in response to the updates.
2271       */
2272  	private function apply_attributes_updates( $shift_this_point = 0 ) {
2273          if ( ! count( $this->lexical_updates ) ) {
2274              return 0;
2275          }
2276  
2277          $accumulated_shift_for_given_point = 0;
2278  
2279          /*
2280           * Attribute updates can be enqueued in any order but updates
2281           * to the document must occur in lexical order; that is, each
2282           * replacement must be made before all others which follow it
2283           * at later string indices in the input document.
2284           *
2285           * Sorting avoid making out-of-order replacements which
2286           * can lead to mangled output, partially-duplicated
2287           * attributes, and overwritten attributes.
2288           */
2289          usort( $this->lexical_updates, array( self::class, 'sort_start_ascending' ) );
2290  
2291          $bytes_already_copied = 0;
2292          $output_buffer        = '';
2293          foreach ( $this->lexical_updates as $diff ) {
2294              $shift = strlen( $diff->text ) - $diff->length;
2295  
2296              // Adjust the cursor position by however much an update affects it.
2297              if ( $diff->start < $this->bytes_already_parsed ) {
2298                  $this->bytes_already_parsed += $shift;
2299              }
2300  
2301              // Accumulate shift of the given pointer within this function call.
2302              if ( $diff->start <= $shift_this_point ) {
2303                  $accumulated_shift_for_given_point += $shift;
2304              }
2305  
2306              $output_buffer       .= substr( $this->html, $bytes_already_copied, $diff->start - $bytes_already_copied );
2307              $output_buffer       .= $diff->text;
2308              $bytes_already_copied = $diff->start + $diff->length;
2309          }
2310  
2311          $this->html = $output_buffer . substr( $this->html, $bytes_already_copied );
2312  
2313          /*
2314           * Adjust bookmark locations to account for how the text
2315           * replacements adjust offsets in the input document.
2316           */
2317          foreach ( $this->bookmarks as $bookmark_name => $bookmark ) {
2318              $bookmark_end = $bookmark->start + $bookmark->length;
2319  
2320              /*
2321               * Each lexical update which appears before the bookmark's endpoints
2322               * might shift the offsets for those endpoints. Loop through each change
2323               * and accumulate the total shift for each bookmark, then apply that
2324               * shift after tallying the full delta.
2325               */
2326              $head_delta = 0;
2327              $tail_delta = 0;
2328  
2329              foreach ( $this->lexical_updates as $diff ) {
2330                  $diff_end = $diff->start + $diff->length;
2331  
2332                  if ( $bookmark->start < $diff->start && $bookmark_end < $diff->start ) {
2333                      break;
2334                  }
2335  
2336                  if ( $bookmark->start >= $diff->start && $bookmark_end < $diff_end ) {
2337                      $this->release_bookmark( $bookmark_name );
2338                      continue 2;
2339                  }
2340  
2341                  $delta = strlen( $diff->text ) - $diff->length;
2342  
2343                  if ( $bookmark->start >= $diff->start ) {
2344                      $head_delta += $delta;
2345                  }
2346  
2347                  if ( $bookmark_end >= $diff_end ) {
2348                      $tail_delta += $delta;
2349                  }
2350              }
2351  
2352              $bookmark->start  += $head_delta;
2353              $bookmark->length += $tail_delta - $head_delta;
2354          }
2355  
2356          $this->lexical_updates = array();
2357  
2358          return $accumulated_shift_for_given_point;
2359      }
2360  
2361      /**
2362       * Checks whether a bookmark with the given name exists.
2363       *
2364       * @since 6.3.0
2365       *
2366       * @param string $bookmark_name Name to identify a bookmark that potentially exists.
2367       * @return bool Whether that bookmark exists.
2368       */
2369  	public function has_bookmark( $bookmark_name ) {
2370          return array_key_exists( $bookmark_name, $this->bookmarks );
2371      }
2372  
2373      /**
2374       * Move the internal cursor in the Tag Processor to a given bookmark's location.
2375       *
2376       * In order to prevent accidental infinite loops, there's a
2377       * maximum limit on the number of times seek() can be called.
2378       *
2379       * @since 6.2.0
2380       *
2381       * @param string $bookmark_name Jump to the place in the document identified by this bookmark name.
2382       * @return bool Whether the internal cursor was successfully moved to the bookmark's location.
2383       */
2384  	public function seek( $bookmark_name ) {
2385          if ( ! array_key_exists( $bookmark_name, $this->bookmarks ) ) {
2386              _doing_it_wrong(
2387                  __METHOD__,
2388                  __( 'Unknown bookmark name.' ),
2389                  '6.2.0'
2390              );
2391              return false;
2392          }
2393  
2394          if ( ++$this->seek_count > static::MAX_SEEK_OPS ) {
2395              _doing_it_wrong(
2396                  __METHOD__,
2397                  __( 'Too many calls to seek() - this can lead to performance issues.' ),
2398                  '6.2.0'
2399              );
2400              return false;
2401          }
2402  
2403          // Flush out any pending updates to the document.
2404          $this->get_updated_html();
2405  
2406          // Point this tag processor before the sought tag opener and consume it.
2407          $this->bytes_already_parsed = $this->bookmarks[ $bookmark_name ]->start;
2408          $this->parser_state         = self::STATE_READY;
2409          return $this->next_token();
2410      }
2411  
2412      /**
2413       * Compare two WP_HTML_Text_Replacement objects.
2414       *
2415       * @since 6.2.0
2416       *
2417       * @param WP_HTML_Text_Replacement $a First attribute update.
2418       * @param WP_HTML_Text_Replacement $b Second attribute update.
2419       * @return int Comparison value for string order.
2420       */
2421  	private static function sort_start_ascending( $a, $b ) {
2422          $by_start = $a->start - $b->start;
2423          if ( 0 !== $by_start ) {
2424              return $by_start;
2425          }
2426  
2427          $by_text = isset( $a->text, $b->text ) ? strcmp( $a->text, $b->text ) : 0;
2428          if ( 0 !== $by_text ) {
2429              return $by_text;
2430          }
2431  
2432          /*
2433           * This code should be unreachable, because it implies the two replacements
2434           * start at the same location and contain the same text.
2435           */
2436          return $a->length - $b->length;
2437      }
2438  
2439      /**
2440       * Return the enqueued value for a given attribute, if one exists.
2441       *
2442       * Enqueued updates can take different data types:
2443       *  - If an update is enqueued and is boolean, the return will be `true`
2444       *  - If an update is otherwise enqueued, the return will be the string value of that update.
2445       *  - If an attribute is enqueued to be removed, the return will be `null` to indicate that.
2446       *  - If no updates are enqueued, the return will be `false` to differentiate from "removed."
2447       *
2448       * @since 6.2.0
2449       *
2450       * @param string $comparable_name The attribute name in its comparable form.
2451       * @return string|boolean|null Value of enqueued update if present, otherwise false.
2452       */
2453  	private function get_enqueued_attribute_value( $comparable_name ) {
2454          if ( self::STATE_MATCHED_TAG !== $this->parser_state ) {
2455              return false;
2456          }
2457  
2458          if ( ! isset( $this->lexical_updates[ $comparable_name ] ) ) {
2459              return false;
2460          }
2461  
2462          $enqueued_text = $this->lexical_updates[ $comparable_name ]->text;
2463  
2464          // Removed attributes erase the entire span.
2465          if ( '' === $enqueued_text ) {
2466              return null;
2467          }
2468  
2469          /*
2470           * Boolean attribute updates are just the attribute name without a corresponding value.
2471           *
2472           * This value might differ from the given comparable name in that there could be leading
2473           * or trailing whitespace, and that the casing follows the name given in `set_attribute`.
2474           *
2475           * Example:
2476           *
2477           *     $p->set_attribute( 'data-TEST-id', 'update' );
2478           *     'update' === $p->get_enqueued_attribute_value( 'data-test-id' );
2479           *
2480           * Detect this difference based on the absence of the `=`, which _must_ exist in any
2481           * attribute containing a value, e.g. `<input type="text" enabled />`.
2482           *                                            ¹           ²
2483           *                                       1. Attribute with a string value.
2484           *                                       2. Boolean attribute whose value is `true`.
2485           */
2486          $equals_at = strpos( $enqueued_text, '=' );
2487          if ( false === $equals_at ) {
2488              return true;
2489          }
2490  
2491          /*
2492           * Finally, a normal update's value will appear after the `=` and
2493           * be double-quoted, as performed incidentally by `set_attribute`.
2494           *
2495           * e.g. `type="text"`
2496           *           ¹²    ³
2497           *        1. Equals is here.
2498           *        2. Double-quoting starts one after the equals sign.
2499           *        3. Double-quoting ends at the last character in the update.
2500           */
2501          $enqueued_value = substr( $enqueued_text, $equals_at + 2, -1 );
2502          return html_entity_decode( $enqueued_value );
2503      }
2504  
2505      /**
2506       * Returns the value of a requested attribute from a matched tag opener if that attribute exists.
2507       *
2508       * Example:
2509       *
2510       *     $p = new WP_HTML_Tag_Processor( '<div enabled class="test" data-test-id="14">Test</div>' );
2511       *     $p->next_tag( array( 'class_name' => 'test' ) ) === true;
2512       *     $p->get_attribute( 'data-test-id' ) === '14';
2513       *     $p->get_attribute( 'enabled' ) === true;
2514       *     $p->get_attribute( 'aria-label' ) === null;
2515       *
2516       *     $p->next_tag() === false;
2517       *     $p->get_attribute( 'class' ) === null;
2518       *
2519       * @since 6.2.0
2520       *
2521       * @param string $name Name of attribute whose value is requested.
2522       * @return string|true|null Value of attribute or `null` if not available. Boolean attributes return `true`.
2523       */
2524  	public function get_attribute( $name ) {
2525          if ( self::STATE_MATCHED_TAG !== $this->parser_state ) {
2526              return null;
2527          }
2528  
2529          $comparable = strtolower( $name );
2530  
2531          /*
2532           * For every attribute other than `class` it's possible to perform a quick check if
2533           * there's an enqueued lexical update whose value takes priority over what's found in
2534           * the input document.
2535           *
2536           * The `class` attribute is special though because of the exposed helpers `add_class`
2537           * and `remove_class`. These form a builder for the `class` attribute, so an additional
2538           * check for enqueued class changes is required in addition to the check for any enqueued
2539           * attribute values. If any exist, those enqueued class changes must first be flushed out
2540           * into an attribute value update.
2541           */
2542          if ( 'class' === $name ) {
2543              $this->class_name_updates_to_attributes_updates();
2544          }
2545  
2546          // Return any enqueued attribute value updates if they exist.
2547          $enqueued_value = $this->get_enqueued_attribute_value( $comparable );
2548          if ( false !== $enqueued_value ) {
2549              return $enqueued_value;
2550          }
2551  
2552          if ( ! isset( $this->attributes[ $comparable ] ) ) {
2553              return null;
2554          }
2555  
2556          $attribute = $this->attributes[ $comparable ];
2557  
2558          /*
2559           * This flag distinguishes an attribute with no value
2560           * from an attribute with an empty string value. For
2561           * unquoted attributes this could look very similar.
2562           * It refers to whether an `=` follows the name.
2563           *
2564           * e.g. <div boolean-attribute empty-attribute=></div>
2565           *           ¹                 ²
2566           *        1. Attribute `boolean-attribute` is `true`.
2567           *        2. Attribute `empty-attribute` is `""`.
2568           */
2569          if ( true === $attribute->is_true ) {
2570              return true;
2571          }
2572  
2573          $raw_value = substr( $this->html, $attribute->value_starts_at, $attribute->value_length );
2574  
2575          return html_entity_decode( $raw_value );
2576      }
2577  
2578      /**
2579       * Gets lowercase names of all attributes matching a given prefix in the current tag.
2580       *
2581       * Note that matching is case-insensitive. This is in accordance with the spec:
2582       *
2583       * > There must never be two or more attributes on
2584       * > the same start tag whose names are an ASCII
2585       * > case-insensitive match for each other.
2586       *     - HTML 5 spec
2587       *
2588       * Example:
2589       *
2590       *     $p = new WP_HTML_Tag_Processor( '<div data-ENABLED class="test" DATA-test-id="14">Test</div>' );
2591       *     $p->next_tag( array( 'class_name' => 'test' ) ) === true;
2592       *     $p->get_attribute_names_with_prefix( 'data-' ) === array( 'data-enabled', 'data-test-id' );
2593       *
2594       *     $p->next_tag() === false;
2595       *     $p->get_attribute_names_with_prefix( 'data-' ) === null;
2596       *
2597       * @since 6.2.0
2598       *
2599       * @see https://html.spec.whatwg.org/multipage/syntax.html#attributes-2:ascii-case-insensitive
2600       *
2601       * @param string $prefix Prefix of requested attribute names.
2602       * @return array|null List of attribute names, or `null` when no tag opener is matched.
2603       */
2604  	public function get_attribute_names_with_prefix( $prefix ) {
2605          if (
2606              self::STATE_MATCHED_TAG !== $this->parser_state ||
2607              $this->is_closing_tag
2608          ) {
2609              return null;
2610          }
2611  
2612          $comparable = strtolower( $prefix );
2613  
2614          $matches = array();
2615          foreach ( array_keys( $this->attributes ) as $attr_name ) {
2616              if ( str_starts_with( $attr_name, $comparable ) ) {
2617                  $matches[] = $attr_name;
2618              }
2619          }
2620          return $matches;
2621      }
2622  
2623      /**
2624       * Returns the uppercase name of the matched tag.
2625       *
2626       * Example:
2627       *
2628       *     $p = new WP_HTML_Tag_Processor( '<div class="test">Test</div>' );
2629       *     $p->next_tag() === true;
2630       *     $p->get_tag() === 'DIV';
2631       *
2632       *     $p->next_tag() === false;
2633       *     $p->get_tag() === null;
2634       *
2635       * @since 6.2.0
2636       *
2637       * @return string|null Name of currently matched tag in input HTML, or `null` if none found.
2638       */
2639  	public function get_tag() {
2640          if ( null === $this->tag_name_starts_at ) {
2641              return null;
2642          }
2643  
2644          $tag_name = substr( $this->html, $this->tag_name_starts_at, $this->tag_name_length );
2645  
2646          if ( self::STATE_MATCHED_TAG === $this->parser_state ) {
2647              return strtoupper( $tag_name );
2648          }
2649  
2650          if (
2651              self::STATE_COMMENT === $this->parser_state &&
2652              self::COMMENT_AS_PI_NODE_LOOKALIKE === $this->get_comment_type()
2653          ) {
2654              return $tag_name;
2655          }
2656  
2657          return null;
2658      }
2659  
2660      /**
2661       * Indicates if the currently matched tag contains the self-closing flag.
2662       *
2663       * No HTML elements ought to have the self-closing flag and for those, the self-closing
2664       * flag will be ignored. For void elements this is benign because they "self close"
2665       * automatically. For non-void HTML elements though problems will appear if someone
2666       * intends to use a self-closing element in place of that element with an empty body.
2667       * For HTML foreign elements and custom elements the self-closing flag determines if
2668       * they self-close or not.
2669       *
2670       * This function does not determine if a tag is self-closing,
2671       * but only if the self-closing flag is present in the syntax.
2672       *
2673       * @since 6.3.0
2674       *
2675       * @return bool Whether the currently matched tag contains the self-closing flag.
2676       */
2677  	public function has_self_closing_flag() {
2678          if ( self::STATE_MATCHED_TAG !== $this->parser_state ) {
2679              return false;
2680          }
2681  
2682          /*
2683           * The self-closing flag is the solidus at the _end_ of the tag, not the beginning.
2684           *
2685           * Example:
2686           *
2687           *     <figure />
2688           *             ^ this appears one character before the end of the closing ">".
2689           */
2690          return '/' === $this->html[ $this->token_starts_at + $this->token_length - 1 ];
2691      }
2692  
2693      /**
2694       * Indicates if the current tag token is a tag closer.
2695       *
2696       * Example:
2697       *
2698       *     $p = new WP_HTML_Tag_Processor( '<div></div>' );
2699       *     $p->next_tag( array( 'tag_name' => 'div', 'tag_closers' => 'visit' ) );
2700       *     $p->is_tag_closer() === false;
2701       *
2702       *     $p->next_tag( array( 'tag_name' => 'div', 'tag_closers' => 'visit' ) );
2703       *     $p->is_tag_closer() === true;
2704       *
2705       * @since 6.2.0
2706       *
2707       * @return bool Whether the current tag is a tag closer.
2708       */
2709  	public function is_tag_closer() {
2710          return (
2711              self::STATE_MATCHED_TAG === $this->parser_state &&
2712              $this->is_closing_tag
2713          );
2714      }
2715  
2716      /**
2717       * Indicates the kind of matched token, if any.
2718       *
2719       * This differs from `get_token_name()` in that it always
2720       * returns a static string indicating the type, whereas
2721       * `get_token_name()` may return values derived from the
2722       * token itself, such as a tag name or processing
2723       * instruction tag.
2724       *
2725       * Possible values:
2726       *  - `#tag` when matched on a tag.
2727       *  - `#text` when matched on a text node.
2728       *  - `#cdata-section` when matched on a CDATA node.
2729       *  - `#comment` when matched on a comment.
2730       *  - `#doctype` when matched on a DOCTYPE declaration.
2731       *  - `#presumptuous-tag` when matched on an empty tag closer.
2732       *  - `#funky-comment` when matched on a funky comment.
2733       *
2734       * @since 6.5.0
2735       *
2736       * @return string|null What kind of token is matched, or null.
2737       */
2738  	public function get_token_type() {
2739          switch ( $this->parser_state ) {
2740              case self::STATE_MATCHED_TAG:
2741                  return '#tag';
2742  
2743              case self::STATE_DOCTYPE:
2744                  return '#doctype';
2745  
2746              default:
2747                  return $this->get_token_name();
2748          }
2749      }
2750  
2751      /**
2752       * Returns the node name represented by the token.
2753       *
2754       * This matches the DOM API value `nodeName`. Some values
2755       * are static, such as `#text` for a text node, while others
2756       * are dynamically generated from the token itself.
2757       *
2758       * Dynamic names:
2759       *  - Uppercase tag name for tag matches.
2760       *  - `html` for DOCTYPE declarations.
2761       *
2762       * Note that if the Tag Processor is not matched on a token
2763       * then this function will return `null`, either because it
2764       * hasn't yet found a token or because it reached the end
2765       * of the document without matching a token.
2766       *
2767       * @since 6.5.0
2768       *
2769       * @return string|null Name of the matched token.
2770       */
2771  	public function get_token_name() {
2772          switch ( $this->parser_state ) {
2773              case self::STATE_MATCHED_TAG:
2774                  return $this->get_tag();
2775  
2776              case self::STATE_TEXT_NODE:
2777                  return '#text';
2778  
2779              case self::STATE_CDATA_NODE:
2780                  return '#cdata-section';
2781  
2782              case self::STATE_COMMENT:
2783                  return '#comment';
2784  
2785              case self::STATE_DOCTYPE:
2786                  return 'html';
2787  
2788              case self::STATE_PRESUMPTUOUS_TAG:
2789                  return '#presumptuous-tag';
2790  
2791              case self::STATE_FUNKY_COMMENT:
2792                  return '#funky-comment';
2793          }
2794      }
2795  
2796      /**
2797       * Indicates what kind of comment produced the comment node.
2798       *
2799       * Because there are different kinds of HTML syntax which produce
2800       * comments, the Tag Processor tracks and exposes this as a type
2801       * for the comment. Nominally only regular HTML comments exist as
2802       * they are commonly known, but a number of unrelated syntax errors
2803       * also produce comments.
2804       *
2805       * @see self::COMMENT_AS_ABRUPTLY_CLOSED_COMMENT
2806       * @see self::COMMENT_AS_CDATA_LOOKALIKE
2807       * @see self::COMMENT_AS_INVALID_HTML
2808       * @see self::COMMENT_AS_HTML_COMMENT
2809       * @see self::COMMENT_AS_PI_NODE_LOOKALIKE
2810       *
2811       * @since 6.5.0
2812       *
2813       * @return string|null
2814       */
2815  	public function get_comment_type() {
2816          if ( self::STATE_COMMENT !== $this->parser_state ) {
2817              return null;
2818          }
2819  
2820          return $this->comment_type;
2821      }
2822  
2823      /**
2824       * Returns the modifiable text for a matched token, or an empty string.
2825       *
2826       * Modifiable text is text content that may be read and changed without
2827       * changing the HTML structure of the document around it. This includes
2828       * the contents of `#text` nodes in the HTML as well as the inner
2829       * contents of HTML comments, Processing Instructions, and others, even
2830       * though these nodes aren't part of a parsed DOM tree. They also contain
2831       * the contents of SCRIPT and STYLE tags, of TEXTAREA tags, and of any
2832       * other section in an HTML document which cannot contain HTML markup (DATA).
2833       *
2834       * If a token has no modifiable text then an empty string is returned to
2835       * avoid needless crashing or type errors. An empty string does not mean
2836       * that a token has modifiable text, and a token with modifiable text may
2837       * have an empty string (e.g. a comment with no contents).
2838       *
2839       * @since 6.5.0
2840       *
2841       * @return string
2842       */
2843  	public function get_modifiable_text() {
2844          if ( null === $this->text_starts_at ) {
2845              return '';
2846          }
2847  
2848          $text = substr( $this->html, $this->text_starts_at, $this->text_length );
2849  
2850          // Comment data is not decoded.
2851          if (
2852              self::STATE_CDATA_NODE === $this->parser_state ||
2853              self::STATE_COMMENT === $this->parser_state ||
2854              self::STATE_DOCTYPE === $this->parser_state ||
2855              self::STATE_FUNKY_COMMENT === $this->parser_state
2856          ) {
2857              return $text;
2858          }
2859  
2860          $tag_name = $this->get_tag();
2861          if (
2862              // Script data is not decoded.
2863              'SCRIPT' === $tag_name ||
2864  
2865              // RAWTEXT data is not decoded.
2866              'IFRAME' === $tag_name ||
2867              'NOEMBED' === $tag_name ||
2868              'NOFRAMES' === $tag_name ||
2869              'STYLE' === $tag_name ||
2870              'XMP' === $tag_name
2871          ) {
2872              return $text;
2873          }
2874  
2875          $decoded = html_entity_decode( $text, ENT_QUOTES | ENT_HTML5 | ENT_SUBSTITUTE );
2876  
2877          /*
2878           * TEXTAREA skips a leading newline, but this newline may appear not only as the
2879           * literal character `\n`, but also as a character reference, such as in the
2880           * following markup: `<textarea>&#x0a;Content</textarea>`.
2881           *
2882           * For these cases it's important to first decode the text content before checking
2883           * for a leading newline and removing it.
2884           */
2885          if (
2886              self::STATE_MATCHED_TAG === $this->parser_state &&
2887              'TEXTAREA' === $tag_name &&
2888              strlen( $decoded ) > 0 &&
2889              "\n" === $decoded[0]
2890          ) {
2891              return substr( $decoded, 1 );
2892          }
2893  
2894          return $decoded;
2895      }
2896  
2897      /**
2898       * Updates or creates a new attribute on the currently matched tag with the passed value.
2899       *
2900       * For boolean attributes special handling is provided:
2901       *  - When `true` is passed as the value, then only the attribute name is added to the tag.
2902       *  - When `false` is passed, the attribute gets removed if it existed before.
2903       *
2904       * For string attributes, the value is escaped using the `esc_attr` function.
2905       *
2906       * @since 6.2.0
2907       * @since 6.2.1 Fix: Only create a single update for multiple calls with case-variant attribute names.
2908       *
2909       * @param string      $name  The attribute name to target.
2910       * @param string|bool $value The new attribute value.
2911       * @return bool Whether an attribute value was set.
2912       */
2913  	public function set_attribute( $name, $value ) {
2914          if (
2915              self::STATE_MATCHED_TAG !== $this->parser_state ||
2916              $this->is_closing_tag
2917          ) {
2918              return false;
2919          }
2920  
2921          /*
2922           * WordPress rejects more characters than are strictly forbidden
2923           * in HTML5. This is to prevent additional security risks deeper
2924           * in the WordPress and plugin stack. Specifically the
2925           * less-than (<) greater-than (>) and ampersand (&) aren't allowed.
2926           *
2927           * The use of a PCRE match enables looking for specific Unicode
2928           * code points without writing a UTF-8 decoder. Whereas scanning
2929           * for one-byte characters is trivial (with `strcspn`), scanning
2930           * for the longer byte sequences would be more complicated. Given
2931           * that this shouldn't be in the hot path for execution, it's a
2932           * reasonable compromise in efficiency without introducing a
2933           * noticeable impact on the overall system.
2934           *
2935           * @see https://html.spec.whatwg.org/#attributes-2
2936           *
2937           * @todo As the only regex pattern maybe we should take it out?
2938           *       Are Unicode patterns available broadly in Core?
2939           */
2940          if ( preg_match(
2941              '~[' .
2942                  // Syntax-like characters.
2943                  '"\'>&</ =' .
2944                  // Control characters.
2945                  '\x{00}-\x{1F}' .
2946                  // HTML noncharacters.
2947                  '\x{FDD0}-\x{FDEF}' .
2948                  '\x{FFFE}\x{FFFF}\x{1FFFE}\x{1FFFF}\x{2FFFE}\x{2FFFF}\x{3FFFE}\x{3FFFF}' .
2949                  '\x{4FFFE}\x{4FFFF}\x{5FFFE}\x{5FFFF}\x{6FFFE}\x{6FFFF}\x{7FFFE}\x{7FFFF}' .
2950                  '\x{8FFFE}\x{8FFFF}\x{9FFFE}\x{9FFFF}\x{AFFFE}\x{AFFFF}\x{BFFFE}\x{BFFFF}' .
2951                  '\x{CFFFE}\x{CFFFF}\x{DFFFE}\x{DFFFF}\x{EFFFE}\x{EFFFF}\x{FFFFE}\x{FFFFF}' .
2952                  '\x{10FFFE}\x{10FFFF}' .
2953              ']~Ssu',
2954              $name
2955          ) ) {
2956              _doing_it_wrong(
2957                  __METHOD__,
2958                  __( 'Invalid attribute name.' ),
2959                  '6.2.0'
2960              );
2961  
2962              return false;
2963          }
2964  
2965          /*
2966           * > The values "true" and "false" are not allowed on boolean attributes.
2967           * > To represent a false value, the attribute has to be omitted altogether.
2968           *     - HTML5 spec, https://html.spec.whatwg.org/#boolean-attributes
2969           */
2970          if ( false === $value ) {
2971              return $this->remove_attribute( $name );
2972          }
2973  
2974          if ( true === $value ) {
2975              $updated_attribute = $name;
2976          } else {
2977              $escaped_new_value = esc_attr( $value );
2978              $updated_attribute = "{$name}=\"{$escaped_new_value}\"";
2979          }
2980  
2981          /*
2982           * > There must never be two or more attributes on
2983           * > the same start tag whose names are an ASCII
2984           * > case-insensitive match for each other.
2985           *     - HTML 5 spec
2986           *
2987           * @see https://html.spec.whatwg.org/multipage/syntax.html#attributes-2:ascii-case-insensitive
2988           */
2989          $comparable_name = strtolower( $name );
2990  
2991          if ( isset( $this->attributes[ $comparable_name ] ) ) {
2992              /*
2993               * Update an existing attribute.
2994               *
2995               * Example – set attribute id to "new" in <div id="initial_id" />:
2996               *
2997               *     <div id="initial_id"/>
2998               *          ^-------------^
2999               *          start         end
3000               *     replacement: `id="new"`
3001               *
3002               *     Result: <div id="new"/>
3003               */
3004              $existing_attribute                        = $this->attributes[ $comparable_name ];
3005              $this->lexical_updates[ $comparable_name ] = new WP_HTML_Text_Replacement(
3006                  $existing_attribute->start,
3007                  $existing_attribute->length,
3008                  $updated_attribute
3009              );
3010          } else {
3011              /*
3012               * Create a new attribute at the tag's name end.
3013               *
3014               * Example – add attribute id="new" to <div />:
3015               *
3016               *     <div/>
3017               *         ^
3018               *         start and end
3019               *     replacement: ` id="new"`
3020               *
3021               *     Result: <div id="new"/>
3022               */
3023              $this->lexical_updates[ $comparable_name ] = new WP_HTML_Text_Replacement(
3024                  $this->tag_name_starts_at + $this->tag_name_length,
3025                  0,
3026                  ' ' . $updated_attribute
3027              );
3028          }
3029  
3030          /*
3031           * Any calls to update the `class` attribute directly should wipe out any
3032           * enqueued class changes from `add_class` and `remove_class`.
3033           */
3034          if ( 'class' === $comparable_name && ! empty( $this->classname_updates ) ) {
3035              $this->classname_updates = array();
3036          }
3037  
3038          return true;
3039      }
3040  
3041      /**
3042       * Remove an attribute from the currently-matched tag.
3043       *
3044       * @since 6.2.0
3045       *
3046       * @param string $name The attribute name to remove.
3047       * @return bool Whether an attribute was removed.
3048       */
3049  	public function remove_attribute( $name ) {
3050          if (
3051              self::STATE_MATCHED_TAG !== $this->parser_state ||
3052              $this->is_closing_tag
3053          ) {
3054              return false;
3055          }
3056  
3057          /*
3058           * > There must never be two or more attributes on
3059           * > the same start tag whose names are an ASCII
3060           * > case-insensitive match for each other.
3061           *     - HTML 5 spec
3062           *
3063           * @see https://html.spec.whatwg.org/multipage/syntax.html#attributes-2:ascii-case-insensitive
3064           */
3065          $name = strtolower( $name );
3066  
3067          /*
3068           * Any calls to update the `class` attribute directly should wipe out any
3069           * enqueued class changes from `add_class` and `remove_class`.
3070           */
3071          if ( 'class' === $name && count( $this->classname_updates ) !== 0 ) {
3072              $this->classname_updates = array();
3073          }
3074  
3075          /*
3076           * If updating an attribute that didn't exist in the input
3077           * document, then remove the enqueued update and move on.
3078           *
3079           * For example, this might occur when calling `remove_attribute()`
3080           * after calling `set_attribute()` for the same attribute
3081           * and when that attribute wasn't originally present.
3082           */
3083          if ( ! isset( $this->attributes[ $name ] ) ) {
3084              if ( isset( $this->lexical_updates[ $name ] ) ) {
3085                  unset( $this->lexical_updates[ $name ] );
3086              }
3087              return false;
3088          }
3089  
3090          /*
3091           * Removes an existing tag attribute.
3092           *
3093           * Example – remove the attribute id from <div id="main"/>:
3094           *    <div id="initial_id"/>
3095           *         ^-------------^
3096           *         start         end
3097           *    replacement: ``
3098           *
3099           *    Result: <div />
3100           */
3101          $this->lexical_updates[ $name ] = new WP_HTML_Text_Replacement(
3102              $this->attributes[ $name ]->start,
3103              $this->attributes[ $name ]->length,
3104              ''
3105          );
3106  
3107          // Removes any duplicated attributes if they were also present.
3108          if ( null !== $this->duplicate_attributes && array_key_exists( $name, $this->duplicate_attributes ) ) {
3109              foreach ( $this->duplicate_attributes[ $name ] as $attribute_token ) {
3110                  $this->lexical_updates[] = new WP_HTML_Text_Replacement(
3111                      $attribute_token->start,
3112                      $attribute_token->length,
3113                      ''
3114                  );
3115              }
3116          }
3117  
3118          return true;
3119      }
3120  
3121      /**
3122       * Adds a new class name to the currently matched tag.
3123       *
3124       * @since 6.2.0
3125       *
3126       * @param string $class_name The class name to add.
3127       * @return bool Whether the class was set to be added.
3128       */
3129  	public function add_class( $class_name ) {
3130          if (
3131              self::STATE_MATCHED_TAG !== $this->parser_state ||
3132              $this->is_closing_tag
3133          ) {
3134              return false;
3135          }
3136  
3137          $this->classname_updates[ $class_name ] = self::ADD_CLASS;
3138  
3139          return true;
3140      }
3141  
3142      /**
3143       * Removes a class name from the currently matched tag.
3144       *
3145       * @since 6.2.0
3146       *
3147       * @param string $class_name The class name to remove.
3148       * @return bool Whether the class was set to be removed.
3149       */
3150  	public function remove_class( $class_name ) {
3151          if (
3152              self::STATE_MATCHED_TAG !== $this->parser_state ||
3153              $this->is_closing_tag
3154          ) {
3155              return false;
3156          }
3157  
3158          if ( null !== $this->tag_name_starts_at ) {
3159              $this->classname_updates[ $class_name ] = self::REMOVE_CLASS;
3160          }
3161  
3162          return true;
3163      }
3164  
3165      /**
3166       * Returns the string representation of the HTML Tag Processor.
3167       *
3168       * @since 6.2.0
3169       *
3170       * @see WP_HTML_Tag_Processor::get_updated_html()
3171       *
3172       * @return string The processed HTML.
3173       */
3174  	public function __toString() {
3175          return $this->get_updated_html();
3176      }
3177  
3178      /**
3179       * Returns the string representation of the HTML Tag Processor.
3180       *
3181       * @since 6.2.0
3182       * @since 6.2.1 Shifts the internal cursor corresponding to the applied updates.
3183       * @since 6.4.0 No longer calls subclass method `next_tag()` after updating HTML.
3184       *
3185       * @return string The processed HTML.
3186       */
3187  	public function get_updated_html() {
3188          $requires_no_updating = 0 === count( $this->classname_updates ) && 0 === count( $this->lexical_updates );
3189  
3190          /*
3191           * When there is nothing more to update and nothing has already been
3192           * updated, return the original document and avoid a string copy.
3193           */
3194          if ( $requires_no_updating ) {
3195              return $this->html;
3196          }
3197  
3198          /*
3199           * Keep track of the position right before the current tag. This will
3200           * be necessary for reparsing the current tag after updating the HTML.
3201           */
3202          $before_current_tag = $this->token_starts_at;
3203  
3204          /*
3205           * 1. Apply the enqueued edits and update all the pointers to reflect those changes.
3206           */
3207          $this->class_name_updates_to_attributes_updates();
3208          $before_current_tag += $this->apply_attributes_updates( $before_current_tag );
3209  
3210          /*
3211           * 2. Rewind to before the current tag and reparse to get updated attributes.
3212           *
3213           * At this point the internal cursor points to the end of the tag name.
3214           * Rewind before the tag name starts so that it's as if the cursor didn't
3215           * move; a call to `next_tag()` will reparse the recently-updated attributes
3216           * and additional calls to modify the attributes will apply at this same
3217           * location, but in order to avoid issues with subclasses that might add
3218           * behaviors to `next_tag()`, the internal methods should be called here
3219           * instead.
3220           *
3221           * It's important to note that in this specific place there will be no change
3222           * because the processor was already at a tag when this was called and it's
3223           * rewinding only to the beginning of this very tag before reprocessing it
3224           * and its attributes.
3225           *
3226           * <p>Previous HTML<em>More HTML</em></p>
3227           *                 ↑  │ back up by the length of the tag name plus the opening <
3228           *                 └←─┘ back up by strlen("em") + 1 ==> 3
3229           */
3230          $this->bytes_already_parsed = $before_current_tag;
3231          $this->base_class_next_token();
3232  
3233          return $this->html;
3234      }
3235  
3236      /**
3237       * Parses tag query input into internal search criteria.
3238       *
3239       * @since 6.2.0
3240       *
3241       * @param array|string|null $query {
3242       *     Optional. Which tag name to find, having which class, etc. Default is to find any tag.
3243       *
3244       *     @type string|null $tag_name     Which tag to find, or `null` for "any tag."
3245       *     @type int|null    $match_offset Find the Nth tag matching all search criteria.
3246       *                                     1 for "first" tag, 3 for "third," etc.
3247       *                                     Defaults to first tag.
3248       *     @type string|null $class_name   Tag must contain this class name to match.
3249       *     @type string      $tag_closers  "visit" or "skip": whether to stop on tag closers, e.g. </div>.
3250       * }
3251       */
3252  	private function parse_query( $query ) {
3253          if ( null !== $query && $query === $this->last_query ) {
3254              return;
3255          }
3256  
3257          $this->last_query          = $query;
3258          $this->sought_tag_name     = null;
3259          $this->sought_class_name   = null;
3260          $this->sought_match_offset = 1;
3261          $this->stop_on_tag_closers = false;
3262  
3263          // A single string value means "find the tag of this name".
3264          if ( is_string( $query ) ) {
3265              $this->sought_tag_name = $query;
3266              return;
3267          }
3268  
3269          // An empty query parameter applies no restrictions on the search.
3270          if ( null === $query ) {
3271              return;
3272          }
3273  
3274          // If not using the string interface, an associative array is required.
3275          if ( ! is_array( $query ) ) {
3276              _doing_it_wrong(
3277                  __METHOD__,
3278                  __( 'The query argument must be an array or a tag name.' ),
3279                  '6.2.0'
3280              );
3281              return;
3282          }
3283  
3284          if ( isset( $query['tag_name'] ) && is_string( $query['tag_name'] ) ) {
3285              $this->sought_tag_name = $query['tag_name'];
3286          }
3287  
3288          if ( isset( $query['class_name'] ) && is_string( $query['class_name'] ) ) {
3289              $this->sought_class_name = $query['class_name'];
3290          }
3291  
3292          if ( isset( $query['match_offset'] ) && is_int( $query['match_offset'] ) && 0 < $query['match_offset'] ) {
3293              $this->sought_match_offset = $query['match_offset'];
3294          }
3295  
3296          if ( isset( $query['tag_closers'] ) ) {
3297              $this->stop_on_tag_closers = 'visit' === $query['tag_closers'];
3298          }
3299      }
3300  
3301  
3302      /**
3303       * Checks whether a given tag and its attributes match the search criteria.
3304       *
3305       * @since 6.2.0
3306       *
3307       * @return bool Whether the given tag and its attribute match the search criteria.
3308       */
3309  	private function matches() {
3310          if ( $this->is_closing_tag && ! $this->stop_on_tag_closers ) {
3311              return false;
3312          }
3313  
3314          // Does the tag name match the requested tag name in a case-insensitive manner?
3315          if ( null !== $this->sought_tag_name ) {
3316              /*
3317               * String (byte) length lookup is fast. If they aren't the
3318               * same length then they can't be the same string values.
3319               */
3320              if ( strlen( $this->sought_tag_name ) !== $this->tag_name_length ) {
3321                  return false;
3322              }
3323  
3324              /*
3325               * Check each character to determine if they are the same.
3326               * Defer calls to `strtoupper()` to avoid them when possible.
3327               * Calling `strcasecmp()` here tested slowed than comparing each
3328               * character, so unless benchmarks show otherwise, it should
3329               * not be used.
3330               *
3331               * It's expected that most of the time that this runs, a
3332               * lower-case tag name will be supplied and the input will
3333               * contain lower-case tag names, thus normally bypassing
3334               * the case comparison code.
3335               */
3336              for ( $i = 0; $i < $this->tag_name_length; $i++ ) {
3337                  $html_char = $this->html[ $this->tag_name_starts_at + $i ];
3338                  $tag_char  = $this->sought_tag_name[ $i ];
3339  
3340                  if ( $html_char !== $tag_char && strtoupper( $html_char ) !== $tag_char ) {
3341                      return false;
3342                  }
3343              }
3344          }
3345  
3346          if ( null !== $this->sought_class_name && ! $this->has_class( $this->sought_class_name ) ) {
3347              return false;
3348          }
3349  
3350          return true;
3351      }
3352  
3353      /**
3354       * Parser Ready State.
3355       *
3356       * Indicates that the parser is ready to run and waiting for a state transition.
3357       * It may not have started yet, or it may have just finished parsing a token and
3358       * is ready to find the next one.
3359       *
3360       * @since 6.5.0
3361       *
3362       * @access private
3363       */
3364      const STATE_READY = 'STATE_READY';
3365  
3366      /**
3367       * Parser Complete State.
3368       *
3369       * Indicates that the parser has reached the end of the document and there is
3370       * nothing left to scan. It finished parsing the last token completely.
3371       *
3372       * @since 6.5.0
3373       *
3374       * @access private
3375       */
3376      const STATE_COMPLETE = 'STATE_COMPLETE';
3377  
3378      /**
3379       * Parser Incomplete Input State.
3380       *
3381       * Indicates that the parser has reached the end of the document before finishing
3382       * a token. It started parsing a token but there is a possibility that the input
3383       * HTML document was truncated in the middle of a token.
3384       *
3385       * The parser is reset at the start of the incomplete token and has paused. There
3386       * is nothing more than can be scanned unless provided a more complete document.
3387       *
3388       * @since 6.5.0
3389       *
3390       * @access private
3391       */
3392      const STATE_INCOMPLETE_INPUT = 'STATE_INCOMPLETE_INPUT';
3393  
3394      /**
3395       * Parser Matched Tag State.
3396       *
3397       * Indicates that the parser has found an HTML tag and it's possible to get
3398       * the tag name and read or modify its attributes (if it's not a closing tag).
3399       *
3400       * @since 6.5.0
3401       *
3402       * @access private
3403       */
3404      const STATE_MATCHED_TAG = 'STATE_MATCHED_TAG';
3405  
3406      /**
3407       * Parser Text Node State.
3408       *
3409       * Indicates that the parser has found a text node and it's possible
3410       * to read and modify that text.
3411       *
3412       * @since 6.5.0
3413       *
3414       * @access private
3415       */
3416      const STATE_TEXT_NODE = 'STATE_TEXT_NODE';
3417  
3418      /**
3419       * Parser CDATA Node State.
3420       *
3421       * Indicates that the parser has found a CDATA node and it's possible
3422       * to read and modify its modifiable text. Note that in HTML there are
3423       * no CDATA nodes outside of foreign content (SVG and MathML). Outside
3424       * of foreign content, they are treated as HTML comments.
3425       *
3426       * @since 6.5.0
3427       *
3428       * @access private
3429       */
3430      const STATE_CDATA_NODE = 'STATE_CDATA_NODE';
3431  
3432      /**
3433       * Indicates that the parser has found an HTML comment and it's
3434       * possible to read and modify its modifiable text.
3435       *
3436       * @since 6.5.0
3437       *
3438       * @access private
3439       */
3440      const STATE_COMMENT = 'STATE_COMMENT';
3441  
3442      /**
3443       * Indicates that the parser has found a DOCTYPE node and it's
3444       * possible to read and modify its modifiable text.
3445       *
3446       * @since 6.5.0
3447       *
3448       * @access private
3449       */
3450      const STATE_DOCTYPE = 'STATE_DOCTYPE';
3451  
3452      /**
3453       * Indicates that the parser has found an empty tag closer `</>`.
3454       *
3455       * Note that in HTML there are no empty tag closers, and they
3456       * are ignored. Nonetheless, the Tag Processor still
3457       * recognizes them as they appear in the HTML stream.
3458       *
3459       * These were historically discussed as a "presumptuous tag
3460       * closer," which would close the nearest open tag, but were
3461       * dismissed in favor of explicitly-closing tags.
3462       *
3463       * @since 6.5.0
3464       *
3465       * @access private
3466       */
3467      const STATE_PRESUMPTUOUS_TAG = 'STATE_PRESUMPTUOUS_TAG';
3468  
3469      /**
3470       * Indicates that the parser has found a "funky comment"
3471       * and it's possible to read and modify its modifiable text.
3472       *
3473       * Example:
3474       *
3475       *     </%url>
3476       *     </{"wp-bit":"query/post-author"}>
3477       *     </2>
3478       *
3479       * Funky comments are tag closers with invalid tag names. Note
3480       * that in HTML these are turn into bogus comments. Nonetheless,
3481       * the Tag Processor recognizes them in a stream of HTML and
3482       * exposes them for inspection and modification.
3483       *
3484       * @since 6.5.0
3485       *
3486       * @access private
3487       */
3488      const STATE_FUNKY_COMMENT = 'STATE_WP_FUNKY';
3489  
3490      /**
3491       * Indicates that a comment was created when encountering abruptly-closed HTML comment.
3492       *
3493       * Example:
3494       *
3495       *     <!-->
3496       *     <!--->
3497       *
3498       * @since 6.5.0
3499       */
3500      const COMMENT_AS_ABRUPTLY_CLOSED_COMMENT = 'COMMENT_AS_ABRUPTLY_CLOSED_COMMENT';
3501  
3502      /**
3503       * Indicates that a comment would be parsed as a CDATA node,
3504       * were HTML to allow CDATA nodes outside of foreign content.
3505       *
3506       * Example:
3507       *
3508       *     <![CDATA[This is a CDATA node.]]>
3509       *
3510       * This is an HTML comment, but it looks like a CDATA node.
3511       *
3512       * @since 6.5.0
3513       */
3514      const COMMENT_AS_CDATA_LOOKALIKE = 'COMMENT_AS_CDATA_LOOKALIKE';
3515  
3516      /**
3517       * Indicates that a comment was created when encountering
3518       * normative HTML comment syntax.
3519       *
3520       * Example:
3521       *
3522       *     <!-- this is a comment -->
3523       *
3524       * @since 6.5.0
3525       */
3526      const COMMENT_AS_HTML_COMMENT = 'COMMENT_AS_HTML_COMMENT';
3527  
3528      /**
3529       * Indicates that a comment would be parsed as a Processing
3530       * Instruction node, were they to exist within HTML.
3531       *
3532       * Example:
3533       *
3534       *     <?wp __( 'Like' ) ?>
3535       *
3536       * This is an HTML comment, but it looks like a CDATA node.
3537       *
3538       * @since 6.5.0
3539       */
3540      const COMMENT_AS_PI_NODE_LOOKALIKE = 'COMMENT_AS_PI_NODE_LOOKALIKE';
3541  
3542      /**
3543       * Indicates that a comment was created when encountering invalid
3544       * HTML input, a so-called "bogus comment."
3545       *
3546       * Example:
3547       *
3548       *     <?nothing special>
3549       *     <!{nothing special}>
3550       *
3551       * @since 6.5.0
3552       */
3553      const COMMENT_AS_INVALID_HTML = 'COMMENT_AS_INVALID_HTML';
3554  }


Generated : Fri Apr 26 08:20:02 2024 Cross-referenced by PHPXref