[ Index ] |
PHP Cross Reference of WordPress Trunk (Updated Daily) |
[Summary view] [Print] [Text view]
1 <?php 2 /** 3 * HTML API: WP_HTML_Tag_Processor class 4 * 5 * Scans through an HTML document to find specific tags, then 6 * transforms those tags by adding, removing, or updating the 7 * values of the HTML attributes within that tag (opener). 8 * 9 * Does not fully parse HTML or _recurse_ into the HTML structure 10 * Instead this scans linearly through a document and only parses 11 * the HTML tag openers. 12 * 13 * ### Possible future direction for this module 14 * 15 * - Prune the whitespace when removing classes/attributes: e.g. "a b c" -> "c" not " c". 16 * This would increase the size of the changes for some operations but leave more 17 * natural-looking output HTML. 18 * - Properly decode HTML character references in `get_attribute()`. PHP's 19 * `html_entity_decode()` is wrong in a couple ways: it doesn't account for the 20 * no-ambiguous-ampersand rule, and it improperly handles the way semicolons may 21 * or may not terminate a character reference. 22 * 23 * @package WordPress 24 * @subpackage HTML-API 25 * @since 6.2.0 26 */ 27 28 /** 29 * Core class used to modify attributes in an HTML document for tags matching a query. 30 * 31 * ## Usage 32 * 33 * Use of this class requires three steps: 34 * 35 * 1. Create a new class instance with your input HTML document. 36 * 2. Find the tag(s) you are looking for. 37 * 3. Request changes to the attributes in those tag(s). 38 * 39 * Example: 40 * 41 * $tags = new WP_HTML_Tag_Processor( $html ); 42 * if ( $tags->next_tag( 'option' ) ) { 43 * $tags->set_attribute( 'selected', true ); 44 * } 45 * 46 * ### Finding tags 47 * 48 * The `next_tag()` function moves the internal cursor through 49 * your input HTML document until it finds a tag meeting any of 50 * the supplied restrictions in the optional query argument. If 51 * no argument is provided then it will find the next HTML tag, 52 * regardless of what kind it is. 53 * 54 * If you want to _find whatever the next tag is_: 55 * 56 * $tags->next_tag(); 57 * 58 * | Goal | Query | 59 * |-----------------------------------------------------------|---------------------------------------------------------------------------------| 60 * | Find any tag. | `$tags->next_tag();` | 61 * | Find next image tag. | `$tags->next_tag( array( 'tag_name' => 'img' ) );` | 62 * | Find next image tag (without passing the array). | `$tags->next_tag( 'img' );` | 63 * | Find next tag containing the `fullwidth` CSS class. | `$tags->next_tag( array( 'class_name' => 'fullwidth' ) );` | 64 * | Find next image tag containing the `fullwidth` CSS class. | `$tags->next_tag( array( 'tag_name' => 'img', 'class_name' => 'fullwidth' ) );` | 65 * 66 * If a tag was found meeting your criteria then `next_tag()` 67 * will return `true` and you can proceed to modify it. If it 68 * returns `false`, however, it failed to find the tag and 69 * moved the cursor to the end of the file. 70 * 71 * Once the cursor reaches the end of the file the processor 72 * is done and if you want to reach an earlier tag you will 73 * need to recreate the processor and start over, as it's 74 * unable to back up or move in reverse. 75 * 76 * See the section on bookmarks for an exception to this 77 * no-backing-up rule. 78 * 79 * #### Custom queries 80 * 81 * Sometimes it's necessary to further inspect an HTML tag than 82 * the query syntax here permits. In these cases one may further 83 * inspect the search results using the read-only functions 84 * provided by the processor or external state or variables. 85 * 86 * Example: 87 * 88 * // Paint up to the first five DIV or SPAN tags marked with the "jazzy" style. 89 * $remaining_count = 5; 90 * while ( $remaining_count > 0 && $tags->next_tag() ) { 91 * if ( 92 * ( 'DIV' === $tags->get_tag() || 'SPAN' === $tags->get_tag() ) && 93 * 'jazzy' === $tags->get_attribute( 'data-style' ) 94 * ) { 95 * $tags->add_class( 'theme-style-everest-jazz' ); 96 * $remaining_count--; 97 * } 98 * } 99 * 100 * `get_attribute()` will return `null` if the attribute wasn't present 101 * on the tag when it was called. It may return `""` (the empty string) 102 * in cases where the attribute was present but its value was empty. 103 * For boolean attributes, those whose name is present but no value is 104 * given, it will return `true` (the only way to set `false` for an 105 * attribute is to remove it). 106 * 107 * #### When matching fails 108 * 109 * When `next_tag()` returns `false` it could mean different things: 110 * 111 * - The requested tag wasn't found in the input document. 112 * - The input document ended in the middle of an HTML syntax element. 113 * 114 * When a document ends in the middle of a syntax element it will pause 115 * the processor. This is to make it possible in the future to extend the 116 * input document and proceed - an important requirement for chunked 117 * streaming parsing of a document. 118 * 119 * Example: 120 * 121 * $processor = new WP_HTML_Tag_Processor( 'This <div is="a" partial="token' ); 122 * false === $processor->next_tag(); 123 * 124 * If a special element (see next section) is encountered but no closing tag 125 * is found it will count as an incomplete tag. The parser will pause as if 126 * the opening tag were incomplete. 127 * 128 * Example: 129 * 130 * $processor = new WP_HTML_Tag_Processor( '<style>// there could be more styling to come' ); 131 * false === $processor->next_tag(); 132 * 133 * $processor = new WP_HTML_Tag_Processor( '<style>// this is everything</style><div>' ); 134 * true === $processor->next_tag( 'DIV' ); 135 * 136 * #### Special elements 137 * 138 * Some HTML elements are handled in a special way; their start and end tags 139 * act like a void tag. These are special because their contents can't contain 140 * HTML markup. Everything inside these elements is handled in a special way 141 * and content that _appears_ like HTML tags inside of them isn't. There can 142 * be no nesting in these elements. 143 * 144 * In the following list, "raw text" means that all of the content in the HTML 145 * until the matching closing tag is treated verbatim without any replacements 146 * and without any parsing. 147 * 148 * - IFRAME allows no content but requires a closing tag. 149 * - NOEMBED (deprecated) content is raw text. 150 * - NOFRAMES (deprecated) content is raw text. 151 * - SCRIPT content is plaintext apart from legacy rules allowing `</script>` inside an HTML comment. 152 * - STYLE content is raw text. 153 * - TITLE content is plain text but character references are decoded. 154 * - TEXTAREA content is plain text but character references are decoded. 155 * - XMP (deprecated) content is raw text. 156 * 157 * ### Modifying HTML attributes for a found tag 158 * 159 * Once you've found the start of an opening tag you can modify 160 * any number of the attributes on that tag. You can set a new 161 * value for an attribute, remove the entire attribute, or do 162 * nothing and move on to the next opening tag. 163 * 164 * Example: 165 * 166 * if ( $tags->next_tag( array( 'class_name' => 'wp-group-block' ) ) ) { 167 * $tags->set_attribute( 'title', 'This groups the contained content.' ); 168 * $tags->remove_attribute( 'data-test-id' ); 169 * } 170 * 171 * If `set_attribute()` is called for an existing attribute it will 172 * overwrite the existing value. Similarly, calling `remove_attribute()` 173 * for a non-existing attribute has no effect on the document. Both 174 * of these methods are safe to call without knowing if a given attribute 175 * exists beforehand. 176 * 177 * ### Modifying CSS classes for a found tag 178 * 179 * The tag processor treats the `class` attribute as a special case. 180 * Because it's a common operation to add or remove CSS classes, this 181 * interface adds helper methods to make that easier. 182 * 183 * As with attribute values, adding or removing CSS classes is a safe 184 * operation that doesn't require checking if the attribute or class 185 * exists before making changes. If removing the only class then the 186 * entire `class` attribute will be removed. 187 * 188 * Example: 189 * 190 * // from `<span>Yippee!</span>` 191 * // to `<span class="is-active">Yippee!</span>` 192 * $tags->add_class( 'is-active' ); 193 * 194 * // from `<span class="excited">Yippee!</span>` 195 * // to `<span class="excited is-active">Yippee!</span>` 196 * $tags->add_class( 'is-active' ); 197 * 198 * // from `<span class="is-active heavy-accent">Yippee!</span>` 199 * // to `<span class="is-active heavy-accent">Yippee!</span>` 200 * $tags->add_class( 'is-active' ); 201 * 202 * // from `<input type="text" class="is-active rugby not-disabled" length="24">` 203 * // to `<input type="text" class="is-active not-disabled" length="24"> 204 * $tags->remove_class( 'rugby' ); 205 * 206 * // from `<input type="text" class="rugby" length="24">` 207 * // to `<input type="text" length="24"> 208 * $tags->remove_class( 'rugby' ); 209 * 210 * // from `<input type="text" length="24">` 211 * // to `<input type="text" length="24"> 212 * $tags->remove_class( 'rugby' ); 213 * 214 * When class changes are enqueued but a direct change to `class` is made via 215 * `set_attribute` then the changes to `set_attribute` (or `remove_attribute`) 216 * will take precedence over those made through `add_class` and `remove_class`. 217 * 218 * ### Bookmarks 219 * 220 * While scanning through the input HTMl document it's possible to set 221 * a named bookmark when a particular tag is found. Later on, after 222 * continuing to scan other tags, it's possible to `seek` to one of 223 * the set bookmarks and then proceed again from that point forward. 224 * 225 * Because bookmarks create processing overhead one should avoid 226 * creating too many of them. As a rule, create only bookmarks 227 * of known string literal names; avoid creating "mark_{$index}" 228 * and so on. It's fine from a performance standpoint to create a 229 * bookmark and update it frequently, such as within a loop. 230 * 231 * $total_todos = 0; 232 * while ( $p->next_tag( array( 'tag_name' => 'UL', 'class_name' => 'todo' ) ) ) { 233 * $p->set_bookmark( 'list-start' ); 234 * while ( $p->next_tag( array( 'tag_closers' => 'visit' ) ) ) { 235 * if ( 'UL' === $p->get_tag() && $p->is_tag_closer() ) { 236 * $p->set_bookmark( 'list-end' ); 237 * $p->seek( 'list-start' ); 238 * $p->set_attribute( 'data-contained-todos', (string) $total_todos ); 239 * $total_todos = 0; 240 * $p->seek( 'list-end' ); 241 * break; 242 * } 243 * 244 * if ( 'LI' === $p->get_tag() && ! $p->is_tag_closer() ) { 245 * $total_todos++; 246 * } 247 * } 248 * } 249 * 250 * ## Tokens and finer-grained processing. 251 * 252 * It's possible to scan through every lexical token in the 253 * HTML document using the `next_token()` function. This 254 * alternative form takes no argument and provides no built-in 255 * query syntax. 256 * 257 * Example: 258 * 259 * $title = '(untitled)'; 260 * $text = ''; 261 * while ( $processor->next_token() ) { 262 * switch ( $processor->get_token_name() ) { 263 * case '#text': 264 * $text .= $processor->get_modifiable_text(); 265 * break; 266 * 267 * case 'BR': 268 * $text .= "\n"; 269 * break; 270 * 271 * case 'TITLE': 272 * $title = $processor->get_modifiable_text(); 273 * break; 274 * } 275 * } 276 * return trim( "# {$title}\n\n{$text}" ); 277 * 278 * ### Tokens and _modifiable text_. 279 * 280 * #### Special "atomic" HTML elements. 281 * 282 * Not all HTML elements are able to contain other elements inside of them. 283 * For instance, the contents inside a TITLE element are plaintext (except 284 * that character references like & will be decoded). This means that 285 * if the string `<img>` appears inside a TITLE element, then it's not an 286 * image tag, but rather it's text describing an image tag. Likewise, the 287 * contents of a SCRIPT or STYLE element are handled entirely separately in 288 * a browser than the contents of other elements because they represent a 289 * different language than HTML. 290 * 291 * For these elements the Tag Processor treats the entire sequence as one, 292 * from the opening tag, including its contents, through its closing tag. 293 * This means that the it's not possible to match the closing tag for a 294 * SCRIPT element unless it's unexpected; the Tag Processor already matched 295 * it when it found the opening tag. 296 * 297 * The inner contents of these elements are that element's _modifiable text_. 298 * 299 * The special elements are: 300 * - `SCRIPT` whose contents are treated as raw plaintext but supports a legacy 301 * style of including Javascript inside of HTML comments to avoid accidentally 302 * closing the SCRIPT from inside a Javascript string. E.g. `console.log( '</script>' )`. 303 * - `TITLE` and `TEXTAREA` whose contents are treated as plaintext and then any 304 * character references are decoded. E.g. `1 < 2 < 3` becomes `1 < 2 < 3`. 305 * - `IFRAME`, `NOSCRIPT`, `NOEMBED`, `NOFRAME`, `STYLE` whose contents are treated as 306 * raw plaintext and left as-is. E.g. `1 < 2 < 3` remains `1 < 2 < 3`. 307 * 308 * #### Other tokens with modifiable text. 309 * 310 * There are also non-elements which are void/self-closing in nature and contain 311 * modifiable text that is part of that individual syntax token itself. 312 * 313 * - `#text` nodes, whose entire token _is_ the modifiable text. 314 * - HTML comments and tokens that become comments due to some syntax error. The 315 * text for these tokens is the portion of the comment inside of the syntax. 316 * E.g. for `<!-- comment -->` the text is `" comment "` (note the spaces are included). 317 * - `CDATA` sections, whose text is the content inside of the section itself. E.g. for 318 * `<![CDATA[some content]]>` the text is `"some content"` (with restrictions [1]). 319 * - "Funky comments," which are a special case of invalid closing tags whose name is 320 * invalid. The text for these nodes is the text that a browser would transform into 321 * an HTML comment when parsing. E.g. for `</%post_author>` the text is `%post_author`. 322 * - `DOCTYPE` declarations like `<DOCTYPE html>` which have no closing tag. 323 * - XML Processing instruction nodes like `<?wp __( "Like" ); ?>` (with restrictions [2]). 324 * - The empty end tag `</>` which is ignored in the browser and DOM. 325 * 326 * [1]: There are no CDATA sections in HTML. When encountering `<![CDATA[`, everything 327 * until the next `>` becomes a bogus HTML comment, meaning there can be no CDATA 328 * section in an HTML document containing `>`. The Tag Processor will first find 329 * all valid and bogus HTML comments, and then if the comment _would_ have been a 330 * CDATA section _were they to exist_, it will indicate this as the type of comment. 331 * 332 * [2]: XML allows a broader range of characters in a processing instruction's target name 333 * and disallows "xml" as a name, since it's special. The Tag Processor only recognizes 334 * target names with an ASCII-representable subset of characters. It also exhibits the 335 * same constraint as with CDATA sections, in that `>` cannot exist within the token 336 * since Processing Instructions do no exist within HTML and their syntax transforms 337 * into a bogus comment in the DOM. 338 * 339 * ## Design and limitations 340 * 341 * The Tag Processor is designed to linearly scan HTML documents and tokenize 342 * HTML tags and their attributes. It's designed to do this as efficiently as 343 * possible without compromising parsing integrity. Therefore it will be 344 * slower than some methods of modifying HTML, such as those incorporating 345 * over-simplified PCRE patterns, but will not introduce the defects and 346 * failures that those methods bring in, which lead to broken page renders 347 * and often to security vulnerabilities. On the other hand, it will be faster 348 * than full-blown HTML parsers such as DOMDocument and use considerably 349 * less memory. It requires a negligible memory overhead, enough to consider 350 * it a zero-overhead system. 351 * 352 * The performance characteristics are maintained by avoiding tree construction 353 * and semantic cleanups which are specified in HTML5. Because of this, for 354 * example, it's not possible for the Tag Processor to associate any given 355 * opening tag with its corresponding closing tag, or to return the inner markup 356 * inside an element. Systems may be built on top of the Tag Processor to do 357 * this, but the Tag Processor is and should be constrained so it can remain an 358 * efficient, low-level, and reliable HTML scanner. 359 * 360 * The Tag Processor's design incorporates a "garbage-in-garbage-out" philosophy. 361 * HTML5 specifies that certain invalid content be transformed into different forms 362 * for display, such as removing null bytes from an input document and replacing 363 * invalid characters with the Unicode replacement character `U+FFFD` (visually "�"). 364 * Where errors or transformations exist within the HTML5 specification, the Tag Processor 365 * leaves those invalid inputs untouched, passing them through to the final browser 366 * to handle. While this implies that certain operations will be non-spec-compliant, 367 * such as reading the value of an attribute with invalid content, it also preserves a 368 * simplicity and efficiency for handling those error cases. 369 * 370 * Most operations within the Tag Processor are designed to minimize the difference 371 * between an input and output document for any given change. For example, the 372 * `add_class` and `remove_class` methods preserve whitespace and the class ordering 373 * within the `class` attribute; and when encountering tags with duplicated attributes, 374 * the Tag Processor will leave those invalid duplicate attributes where they are but 375 * update the proper attribute which the browser will read for parsing its value. An 376 * exception to this rule is that all attribute updates store their values as 377 * double-quoted strings, meaning that attributes on input with single-quoted or 378 * unquoted values will appear in the output with double-quotes. 379 * 380 * ### Scripting Flag 381 * 382 * The Tag Processor parses HTML with the "scripting flag" disabled. This means 383 * that it doesn't run any scripts while parsing the page. In a browser with 384 * JavaScript enabled, for example, the script can change the parse of the 385 * document as it loads. On the server, however, evaluating JavaScript is not 386 * only impractical, but also unwanted. 387 * 388 * Practically this means that the Tag Processor will descend into NOSCRIPT 389 * elements and process its child tags. Were the scripting flag enabled, such 390 * as in a typical browser, the contents of NOSCRIPT are skipped entirely. 391 * 392 * This allows the HTML API to process the content that will be presented in 393 * a browser when scripting is disabled, but it offers a different view of a 394 * page than most browser sessions will experience. E.g. the tags inside the 395 * NOSCRIPT disappear. 396 * 397 * ### Text Encoding 398 * 399 * The Tag Processor assumes that the input HTML document is encoded with a 400 * text encoding compatible with 7-bit ASCII's '<', '>', '&', ';', '/', '=', 401 * "'", '"', 'a' - 'z', 'A' - 'Z', and the whitespace characters ' ', tab, 402 * carriage-return, newline, and form-feed. 403 * 404 * In practice, this includes almost every single-byte encoding as well as 405 * UTF-8. Notably, however, it does not include UTF-16. If providing input 406 * that's incompatible, then convert the encoding beforehand. 407 * 408 * @since 6.2.0 409 * @since 6.2.1 Fix: Support for various invalid comments; attribute updates are case-insensitive. 410 * @since 6.3.2 Fix: Skip HTML-like content inside rawtext elements such as STYLE. 411 * @since 6.5.0 Pauses processor when input ends in an incomplete syntax token. 412 * Introduces "special" elements which act like void elements, e.g. TITLE, STYLE. 413 * Allows scanning through all tokens and processing modifiable text, where applicable. 414 */ 415 class WP_HTML_Tag_Processor { 416 /** 417 * The maximum number of bookmarks allowed to exist at 418 * any given time. 419 * 420 * @since 6.2.0 421 * @var int 422 * 423 * @see WP_HTML_Tag_Processor::set_bookmark() 424 */ 425 const MAX_BOOKMARKS = 10; 426 427 /** 428 * Maximum number of times seek() can be called. 429 * Prevents accidental infinite loops. 430 * 431 * @since 6.2.0 432 * @var int 433 * 434 * @see WP_HTML_Tag_Processor::seek() 435 */ 436 const MAX_SEEK_OPS = 1000; 437 438 /** 439 * The HTML document to parse. 440 * 441 * @since 6.2.0 442 * @var string 443 */ 444 protected $html; 445 446 /** 447 * The last query passed to next_tag(). 448 * 449 * @since 6.2.0 450 * @var array|null 451 */ 452 private $last_query; 453 454 /** 455 * The tag name this processor currently scans for. 456 * 457 * @since 6.2.0 458 * @var string|null 459 */ 460 private $sought_tag_name; 461 462 /** 463 * The CSS class name this processor currently scans for. 464 * 465 * @since 6.2.0 466 * @var string|null 467 */ 468 private $sought_class_name; 469 470 /** 471 * The match offset this processor currently scans for. 472 * 473 * @since 6.2.0 474 * @var int|null 475 */ 476 private $sought_match_offset; 477 478 /** 479 * Whether to visit tag closers, e.g. </div>, when walking an input document. 480 * 481 * @since 6.2.0 482 * @var bool 483 */ 484 private $stop_on_tag_closers; 485 486 /** 487 * Specifies mode of operation of the parser at any given time. 488 * 489 * | State | Meaning | 490 * | ----------------|----------------------------------------------------------------------| 491 * | *Ready* | The parser is ready to run. | 492 * | *Complete* | There is nothing left to parse. | 493 * | *Incomplete* | The HTML ended in the middle of a token; nothing more can be parsed. | 494 * | *Matched tag* | Found an HTML tag; it's possible to modify its attributes. | 495 * | *Text node* | Found a #text node; this is plaintext and modifiable. | 496 * | *CDATA node* | Found a CDATA section; this is modifiable. | 497 * | *Comment* | Found a comment or bogus comment; this is modifiable. | 498 * | *Presumptuous* | Found an empty tag closer: `</>`. | 499 * | *Funky comment* | Found a tag closer with an invalid tag name; this is modifiable. | 500 * 501 * @since 6.5.0 502 * 503 * @see WP_HTML_Tag_Processor::STATE_READY 504 * @see WP_HTML_Tag_Processor::STATE_COMPLETE 505 * @see WP_HTML_Tag_Processor::STATE_INCOMPLETE_INPUT 506 * @see WP_HTML_Tag_Processor::STATE_MATCHED_TAG 507 * @see WP_HTML_Tag_Processor::STATE_TEXT_NODE 508 * @see WP_HTML_Tag_Processor::STATE_CDATA_NODE 509 * @see WP_HTML_Tag_Processor::STATE_COMMENT 510 * @see WP_HTML_Tag_Processor::STATE_DOCTYPE 511 * @see WP_HTML_Tag_Processor::STATE_PRESUMPTUOUS_TAG 512 * @see WP_HTML_Tag_Processor::STATE_FUNKY_COMMENT 513 * 514 * @var string 515 */ 516 protected $parser_state = self::STATE_READY; 517 518 /** 519 * What kind of syntax token became an HTML comment. 520 * 521 * Since there are many ways in which HTML syntax can create an HTML comment, 522 * this indicates which of those caused it. This allows the Tag Processor to 523 * represent more from the original input document than would appear in the DOM. 524 * 525 * @since 6.5.0 526 * 527 * @var string|null 528 */ 529 protected $comment_type = null; 530 531 /** 532 * How many bytes from the original HTML document have been read and parsed. 533 * 534 * This value points to the latest byte offset in the input document which 535 * has been already parsed. It is the internal cursor for the Tag Processor 536 * and updates while scanning through the HTML tokens. 537 * 538 * @since 6.2.0 539 * @var int 540 */ 541 private $bytes_already_parsed = 0; 542 543 /** 544 * Byte offset in input document where current token starts. 545 * 546 * Example: 547 * 548 * <div id="test">... 549 * 01234 550 * - token starts at 0 551 * 552 * @since 6.5.0 553 * 554 * @var int|null 555 */ 556 private $token_starts_at; 557 558 /** 559 * Byte length of current token. 560 * 561 * Example: 562 * 563 * <div id="test">... 564 * 012345678901234 565 * - token length is 14 - 0 = 14 566 * 567 * a <!-- comment --> is a token. 568 * 0123456789 123456789 123456789 569 * - token length is 17 - 2 = 15 570 * 571 * @since 6.5.0 572 * 573 * @var int|null 574 */ 575 private $token_length; 576 577 /** 578 * Byte offset in input document where current tag name starts. 579 * 580 * Example: 581 * 582 * <div id="test">... 583 * 01234 584 * - tag name starts at 1 585 * 586 * @since 6.2.0 587 * 588 * @var int|null 589 */ 590 private $tag_name_starts_at; 591 592 /** 593 * Byte length of current tag name. 594 * 595 * Example: 596 * 597 * <div id="test">... 598 * 01234 599 * --- tag name length is 3 600 * 601 * @since 6.2.0 602 * 603 * @var int|null 604 */ 605 private $tag_name_length; 606 607 /** 608 * Byte offset into input document where current modifiable text starts. 609 * 610 * @since 6.5.0 611 * 612 * @var int 613 */ 614 private $text_starts_at; 615 616 /** 617 * Byte length of modifiable text. 618 * 619 * @since 6.5.0 620 * 621 * @var string 622 */ 623 private $text_length; 624 625 /** 626 * Whether the current tag is an opening tag, e.g. <div>, or a closing tag, e.g. </div>. 627 * 628 * @var bool 629 */ 630 private $is_closing_tag; 631 632 /** 633 * Lazily-built index of attributes found within an HTML tag, keyed by the attribute name. 634 * 635 * Example: 636 * 637 * // Supposing the parser is working through this content 638 * // and stops after recognizing the `id` attribute. 639 * // <div id="test-4" class=outline title="data:text/plain;base64=asdk3nk1j3fo8"> 640 * // ^ parsing will continue from this point. 641 * $this->attributes = array( 642 * 'id' => new WP_HTML_Attribute_Token( 'id', 9, 6, 5, 11, false ) 643 * ); 644 * 645 * // When picking up parsing again, or when asking to find the 646 * // `class` attribute we will continue and add to this array. 647 * $this->attributes = array( 648 * 'id' => new WP_HTML_Attribute_Token( 'id', 9, 6, 5, 11, false ), 649 * 'class' => new WP_HTML_Attribute_Token( 'class', 23, 7, 17, 13, false ) 650 * ); 651 * 652 * // Note that only the `class` attribute value is stored in the index. 653 * // That's because it is the only value used by this class at the moment. 654 * 655 * @since 6.2.0 656 * @var WP_HTML_Attribute_Token[] 657 */ 658 private $attributes = array(); 659 660 /** 661 * Tracks spans of duplicate attributes on a given tag, used for removing 662 * all copies of an attribute when calling `remove_attribute()`. 663 * 664 * @since 6.3.2 665 * 666 * @var (WP_HTML_Span[])[]|null 667 */ 668 private $duplicate_attributes = null; 669 670 /** 671 * Which class names to add or remove from a tag. 672 * 673 * These are tracked separately from attribute updates because they are 674 * semantically distinct, whereas this interface exists for the common 675 * case of adding and removing class names while other attributes are 676 * generally modified as with DOM `setAttribute` calls. 677 * 678 * When modifying an HTML document these will eventually be collapsed 679 * into a single `set_attribute( 'class', $changes )` call. 680 * 681 * Example: 682 * 683 * // Add the `wp-block-group` class, remove the `wp-group` class. 684 * $classname_updates = array( 685 * // Indexed by a comparable class name. 686 * 'wp-block-group' => WP_HTML_Tag_Processor::ADD_CLASS, 687 * 'wp-group' => WP_HTML_Tag_Processor::REMOVE_CLASS 688 * ); 689 * 690 * @since 6.2.0 691 * @var bool[] 692 */ 693 private $classname_updates = array(); 694 695 /** 696 * Tracks a semantic location in the original HTML which 697 * shifts with updates as they are applied to the document. 698 * 699 * @since 6.2.0 700 * @var WP_HTML_Span[] 701 */ 702 protected $bookmarks = array(); 703 704 const ADD_CLASS = true; 705 const REMOVE_CLASS = false; 706 const SKIP_CLASS = null; 707 708 /** 709 * Lexical replacements to apply to input HTML document. 710 * 711 * "Lexical" in this class refers to the part of this class which 712 * operates on pure text _as text_ and not as HTML. There's a line 713 * between the public interface, with HTML-semantic methods like 714 * `set_attribute` and `add_class`, and an internal state that tracks 715 * text offsets in the input document. 716 * 717 * When higher-level HTML methods are called, those have to transform their 718 * operations (such as setting an attribute's value) into text diffing 719 * operations (such as replacing the sub-string from indices A to B with 720 * some given new string). These text-diffing operations are the lexical 721 * updates. 722 * 723 * As new higher-level methods are added they need to collapse their 724 * operations into these lower-level lexical updates since that's the 725 * Tag Processor's internal language of change. Any code which creates 726 * these lexical updates must ensure that they do not cross HTML syntax 727 * boundaries, however, so these should never be exposed outside of this 728 * class or any classes which intentionally expand its functionality. 729 * 730 * These are enqueued while editing the document instead of being immediately 731 * applied to avoid processing overhead, string allocations, and string 732 * copies when applying many updates to a single document. 733 * 734 * Example: 735 * 736 * // Replace an attribute stored with a new value, indices 737 * // sourced from the lazily-parsed HTML recognizer. 738 * $start = $attributes['src']->start; 739 * $length = $attributes['src']->length; 740 * $modifications[] = new WP_HTML_Text_Replacement( $start, $length, $new_value ); 741 * 742 * // Correspondingly, something like this will appear in this array. 743 * $lexical_updates = array( 744 * WP_HTML_Text_Replacement( 14, 28, 'https://my-site.my-domain/wp-content/uploads/2014/08/kittens.jpg' ) 745 * ); 746 * 747 * @since 6.2.0 748 * @var WP_HTML_Text_Replacement[] 749 */ 750 protected $lexical_updates = array(); 751 752 /** 753 * Tracks and limits `seek()` calls to prevent accidental infinite loops. 754 * 755 * @since 6.2.0 756 * @var int 757 * 758 * @see WP_HTML_Tag_Processor::seek() 759 */ 760 protected $seek_count = 0; 761 762 /** 763 * Constructor. 764 * 765 * @since 6.2.0 766 * 767 * @param string $html HTML to process. 768 */ 769 public function __construct( $html ) { 770 $this->html = $html; 771 } 772 773 /** 774 * Finds the next tag matching the $query. 775 * 776 * @since 6.2.0 777 * @since 6.5.0 No longer processes incomplete tokens at end of document; pauses the processor at start of token. 778 * 779 * @param array|string|null $query { 780 * Optional. Which tag name to find, having which class, etc. Default is to find any tag. 781 * 782 * @type string|null $tag_name Which tag to find, or `null` for "any tag." 783 * @type int|null $match_offset Find the Nth tag matching all search criteria. 784 * 1 for "first" tag, 3 for "third," etc. 785 * Defaults to first tag. 786 * @type string|null $class_name Tag must contain this whole class name to match. 787 * @type string|null $tag_closers "visit" or "skip": whether to stop on tag closers, e.g. </div>. 788 * } 789 * @return bool Whether a tag was matched. 790 */ 791 public function next_tag( $query = null ) { 792 $this->parse_query( $query ); 793 $already_found = 0; 794 795 do { 796 if ( false === $this->next_token() ) { 797 return false; 798 } 799 800 if ( self::STATE_MATCHED_TAG !== $this->parser_state ) { 801 continue; 802 } 803 804 if ( $this->matches() ) { 805 ++$already_found; 806 } 807 } while ( $already_found < $this->sought_match_offset ); 808 809 return true; 810 } 811 812 /** 813 * Finds the next token in the HTML document. 814 * 815 * An HTML document can be viewed as a stream of tokens, 816 * where tokens are things like HTML tags, HTML comments, 817 * text nodes, etc. This method finds the next token in 818 * the HTML document and returns whether it found one. 819 * 820 * If it starts parsing a token and reaches the end of the 821 * document then it will seek to the start of the last 822 * token and pause, returning `false` to indicate that it 823 * failed to find a complete token. 824 * 825 * Possible token types, based on the HTML specification: 826 * 827 * - an HTML tag, whether opening, closing, or void. 828 * - a text node - the plaintext inside tags. 829 * - an HTML comment. 830 * - a DOCTYPE declaration. 831 * - a processing instruction, e.g. `<?xml version="1.0" ?>`. 832 * 833 * The Tag Processor currently only supports the tag token. 834 * 835 * @since 6.5.0 836 * 837 * @return bool Whether a token was parsed. 838 */ 839 public function next_token() { 840 return $this->base_class_next_token(); 841 } 842 843 /** 844 * Internal method which finds the next token in the HTML document. 845 * 846 * This method is a protected internal function which implements the logic for 847 * finding the next token in a document. It exists so that the parser can update 848 * its state without affecting the location of the cursor in the document and 849 * without triggering subclass methods for things like `next_token()`, e.g. when 850 * applying patches before searching for the next token. 851 * 852 * @since 6.5.0 853 * 854 * @access private 855 * 856 * @return bool Whether a token was parsed. 857 */ 858 private function base_class_next_token() { 859 $was_at = $this->bytes_already_parsed; 860 $this->after_tag(); 861 862 // Don't proceed if there's nothing more to scan. 863 if ( 864 self::STATE_COMPLETE === $this->parser_state || 865 self::STATE_INCOMPLETE_INPUT === $this->parser_state 866 ) { 867 return false; 868 } 869 870 /* 871 * The next step in the parsing loop determines the parsing state; 872 * clear it so that state doesn't linger from the previous step. 873 */ 874 $this->parser_state = self::STATE_READY; 875 876 if ( $this->bytes_already_parsed >= strlen( $this->html ) ) { 877 $this->parser_state = self::STATE_COMPLETE; 878 return false; 879 } 880 881 // Find the next tag if it exists. 882 if ( false === $this->parse_next_tag() ) { 883 if ( self::STATE_INCOMPLETE_INPUT === $this->parser_state ) { 884 $this->bytes_already_parsed = $was_at; 885 } 886 887 return false; 888 } 889 890 /* 891 * For legacy reasons the rest of this function handles tags and their 892 * attributes. If the processor has reached the end of the document 893 * or if it matched any other token then it should return here to avoid 894 * attempting to process tag-specific syntax. 895 */ 896 if ( 897 self::STATE_INCOMPLETE_INPUT !== $this->parser_state && 898 self::STATE_COMPLETE !== $this->parser_state && 899 self::STATE_MATCHED_TAG !== $this->parser_state 900 ) { 901 return true; 902 } 903 904 // Parse all of its attributes. 905 while ( $this->parse_next_attribute() ) { 906 continue; 907 } 908 909 // Ensure that the tag closes before the end of the document. 910 if ( 911 self::STATE_INCOMPLETE_INPUT === $this->parser_state || 912 $this->bytes_already_parsed >= strlen( $this->html ) 913 ) { 914 // Does this appropriately clear state (parsed attributes)? 915 $this->parser_state = self::STATE_INCOMPLETE_INPUT; 916 $this->bytes_already_parsed = $was_at; 917 918 return false; 919 } 920 921 $tag_ends_at = strpos( $this->html, '>', $this->bytes_already_parsed ); 922 if ( false === $tag_ends_at ) { 923 $this->parser_state = self::STATE_INCOMPLETE_INPUT; 924 $this->bytes_already_parsed = $was_at; 925 926 return false; 927 } 928 $this->parser_state = self::STATE_MATCHED_TAG; 929 $this->token_length = $tag_ends_at - $this->token_starts_at; 930 $this->bytes_already_parsed = $tag_ends_at + 1; 931 932 /* 933 * For non-DATA sections which might contain text that looks like HTML tags but 934 * isn't, scan with the appropriate alternative mode. Looking at the first letter 935 * of the tag name as a pre-check avoids a string allocation when it's not needed. 936 */ 937 $t = $this->html[ $this->tag_name_starts_at ]; 938 if ( 939 $this->is_closing_tag || 940 ! ( 941 'i' === $t || 'I' === $t || 942 'n' === $t || 'N' === $t || 943 's' === $t || 'S' === $t || 944 't' === $t || 'T' === $t || 945 'x' === $t || 'X' === $t 946 ) 947 ) { 948 return true; 949 } 950 951 $tag_name = $this->get_tag(); 952 953 /* 954 * Preserve the opening tag pointers, as these will be overwritten 955 * when finding the closing tag. They will be reset after finding 956 * the closing to tag to point to the opening of the special atomic 957 * tag sequence. 958 */ 959 $tag_name_starts_at = $this->tag_name_starts_at; 960 $tag_name_length = $this->tag_name_length; 961 $tag_ends_at = $this->token_starts_at + $this->token_length; 962 $attributes = $this->attributes; 963 $duplicate_attributes = $this->duplicate_attributes; 964 965 // Find the closing tag if necessary. 966 $found_closer = false; 967 switch ( $tag_name ) { 968 case 'SCRIPT': 969 $found_closer = $this->skip_script_data(); 970 break; 971 972 case 'TEXTAREA': 973 case 'TITLE': 974 $found_closer = $this->skip_rcdata( $tag_name ); 975 break; 976 977 /* 978 * In the browser this list would include the NOSCRIPT element, 979 * but the Tag Processor is an environment with the scripting 980 * flag disabled, meaning that it needs to descend into the 981 * NOSCRIPT element to be able to properly process what will be 982 * sent to a browser. 983 * 984 * Note that this rule makes HTML5 syntax incompatible with XML, 985 * because the parsing of this token depends on client application. 986 * The NOSCRIPT element cannot be represented in the XHTML syntax. 987 */ 988 case 'IFRAME': 989 case 'NOEMBED': 990 case 'NOFRAMES': 991 case 'STYLE': 992 case 'XMP': 993 $found_closer = $this->skip_rawtext( $tag_name ); 994 break; 995 996 // No other tags should be treated in their entirety here. 997 default: 998 return true; 999 } 1000 1001 if ( ! $found_closer ) { 1002 $this->parser_state = self::STATE_INCOMPLETE_INPUT; 1003 $this->bytes_already_parsed = $was_at; 1004 return false; 1005 } 1006 1007 /* 1008 * The values here look like they reference the opening tag but they reference 1009 * the closing tag instead. This is why the opening tag values were stored 1010 * above in a variable. It reads confusingly here, but that's because the 1011 * functions that skip the contents have moved all the internal cursors past 1012 * the inner content of the tag. 1013 */ 1014 $this->token_starts_at = $was_at; 1015 $this->token_length = $this->bytes_already_parsed - $this->token_starts_at; 1016 $this->text_starts_at = $tag_ends_at + 1; 1017 $this->text_length = $this->tag_name_starts_at - $this->text_starts_at; 1018 $this->tag_name_starts_at = $tag_name_starts_at; 1019 $this->tag_name_length = $tag_name_length; 1020 $this->attributes = $attributes; 1021 $this->duplicate_attributes = $duplicate_attributes; 1022 1023 return true; 1024 } 1025 1026 /** 1027 * Whether the processor paused because the input HTML document ended 1028 * in the middle of a syntax element, such as in the middle of a tag. 1029 * 1030 * Example: 1031 * 1032 * $processor = new WP_HTML_Tag_Processor( '<input type="text" value="Th' ); 1033 * false === $processor->get_next_tag(); 1034 * true === $processor->paused_at_incomplete_token(); 1035 * 1036 * @since 6.5.0 1037 * 1038 * @return bool Whether the parse paused at the start of an incomplete token. 1039 */ 1040 public function paused_at_incomplete_token() { 1041 return self::STATE_INCOMPLETE_INPUT === $this->parser_state; 1042 } 1043 1044 /** 1045 * Generator for a foreach loop to step through each class name for the matched tag. 1046 * 1047 * This generator function is designed to be used inside a "foreach" loop. 1048 * 1049 * Example: 1050 * 1051 * $p = new WP_HTML_Tag_Processor( "<div class='free <egg<\tlang-en'>" ); 1052 * $p->next_tag(); 1053 * foreach ( $p->class_list() as $class_name ) { 1054 * echo "{$class_name} "; 1055 * } 1056 * // Outputs: "free <egg> lang-en " 1057 * 1058 * @since 6.4.0 1059 */ 1060 public function class_list() { 1061 if ( self::STATE_MATCHED_TAG !== $this->parser_state ) { 1062 return; 1063 } 1064 1065 /** @var string $class contains the string value of the class attribute, with character references decoded. */ 1066 $class = $this->get_attribute( 'class' ); 1067 1068 if ( ! is_string( $class ) ) { 1069 return; 1070 } 1071 1072 $seen = array(); 1073 1074 $at = 0; 1075 while ( $at < strlen( $class ) ) { 1076 // Skip past any initial boundary characters. 1077 $at += strspn( $class, " \t\f\r\n", $at ); 1078 if ( $at >= strlen( $class ) ) { 1079 return; 1080 } 1081 1082 // Find the byte length until the next boundary. 1083 $length = strcspn( $class, " \t\f\r\n", $at ); 1084 if ( 0 === $length ) { 1085 return; 1086 } 1087 1088 /* 1089 * CSS class names are case-insensitive in the ASCII range. 1090 * 1091 * @see https://www.w3.org/TR/CSS2/syndata.html#x1 1092 */ 1093 $name = strtolower( substr( $class, $at, $length ) ); 1094 $at += $length; 1095 1096 /* 1097 * It's expected that the number of class names for a given tag is relatively small. 1098 * Given this, it is probably faster overall to scan an array for a value rather 1099 * than to use the class name as a key and check if it's a key of $seen. 1100 */ 1101 if ( in_array( $name, $seen, true ) ) { 1102 continue; 1103 } 1104 1105 $seen[] = $name; 1106 yield $name; 1107 } 1108 } 1109 1110 1111 /** 1112 * Returns if a matched tag contains the given ASCII case-insensitive class name. 1113 * 1114 * @since 6.4.0 1115 * 1116 * @param string $wanted_class Look for this CSS class name, ASCII case-insensitive. 1117 * @return bool|null Whether the matched tag contains the given class name, or null if not matched. 1118 */ 1119 public function has_class( $wanted_class ) { 1120 if ( self::STATE_MATCHED_TAG !== $this->parser_state ) { 1121 return null; 1122 } 1123 1124 $wanted_class = strtolower( $wanted_class ); 1125 1126 foreach ( $this->class_list() as $class_name ) { 1127 if ( $class_name === $wanted_class ) { 1128 return true; 1129 } 1130 } 1131 1132 return false; 1133 } 1134 1135 1136 /** 1137 * Sets a bookmark in the HTML document. 1138 * 1139 * Bookmarks represent specific places or tokens in the HTML 1140 * document, such as a tag opener or closer. When applying 1141 * edits to a document, such as setting an attribute, the 1142 * text offsets of that token may shift; the bookmark is 1143 * kept updated with those shifts and remains stable unless 1144 * the entire span of text in which the token sits is removed. 1145 * 1146 * Release bookmarks when they are no longer needed. 1147 * 1148 * Example: 1149 * 1150 * <main><h2>Surprising fact you may not know!</h2></main> 1151 * ^ ^ 1152 * \-|-- this `H2` opener bookmark tracks the token 1153 * 1154 * <main class="clickbait"><h2>Surprising fact you may no… 1155 * ^ ^ 1156 * \-|-- it shifts with edits 1157 * 1158 * Bookmarks provide the ability to seek to a previously-scanned 1159 * place in the HTML document. This avoids the need to re-scan 1160 * the entire document. 1161 * 1162 * Example: 1163 * 1164 * <ul><li>One</li><li>Two</li><li>Three</li></ul> 1165 * ^^^^ 1166 * want to note this last item 1167 * 1168 * $p = new WP_HTML_Tag_Processor( $html ); 1169 * $in_list = false; 1170 * while ( $p->next_tag( array( 'tag_closers' => $in_list ? 'visit' : 'skip' ) ) ) { 1171 * if ( 'UL' === $p->get_tag() ) { 1172 * if ( $p->is_tag_closer() ) { 1173 * $in_list = false; 1174 * $p->set_bookmark( 'resume' ); 1175 * if ( $p->seek( 'last-li' ) ) { 1176 * $p->add_class( 'last-li' ); 1177 * } 1178 * $p->seek( 'resume' ); 1179 * $p->release_bookmark( 'last-li' ); 1180 * $p->release_bookmark( 'resume' ); 1181 * } else { 1182 * $in_list = true; 1183 * } 1184 * } 1185 * 1186 * if ( 'LI' === $p->get_tag() ) { 1187 * $p->set_bookmark( 'last-li' ); 1188 * } 1189 * } 1190 * 1191 * Bookmarks intentionally hide the internal string offsets 1192 * to which they refer. They are maintained internally as 1193 * updates are applied to the HTML document and therefore 1194 * retain their "position" - the location to which they 1195 * originally pointed. The inability to use bookmarks with 1196 * functions like `substr` is therefore intentional to guard 1197 * against accidentally breaking the HTML. 1198 * 1199 * Because bookmarks allocate memory and require processing 1200 * for every applied update, they are limited and require 1201 * a name. They should not be created with programmatically-made 1202 * names, such as "li_{$index}" with some loop. As a general 1203 * rule they should only be created with string-literal names 1204 * like "start-of-section" or "last-paragraph". 1205 * 1206 * Bookmarks are a powerful tool to enable complicated behavior. 1207 * Consider double-checking that you need this tool if you are 1208 * reaching for it, as inappropriate use could lead to broken 1209 * HTML structure or unwanted processing overhead. 1210 * 1211 * @since 6.2.0 1212 * 1213 * @param string $name Identifies this particular bookmark. 1214 * @return bool Whether the bookmark was successfully created. 1215 */ 1216 public function set_bookmark( $name ) { 1217 // It only makes sense to set a bookmark if the parser has paused on a concrete token. 1218 if ( 1219 self::STATE_COMPLETE === $this->parser_state || 1220 self::STATE_INCOMPLETE_INPUT === $this->parser_state 1221 ) { 1222 return false; 1223 } 1224 1225 if ( ! array_key_exists( $name, $this->bookmarks ) && count( $this->bookmarks ) >= static::MAX_BOOKMARKS ) { 1226 _doing_it_wrong( 1227 __METHOD__, 1228 __( 'Too many bookmarks: cannot create any more.' ), 1229 '6.2.0' 1230 ); 1231 return false; 1232 } 1233 1234 $this->bookmarks[ $name ] = new WP_HTML_Span( $this->token_starts_at, $this->token_length ); 1235 1236 return true; 1237 } 1238 1239 1240 /** 1241 * Removes a bookmark that is no longer needed. 1242 * 1243 * Releasing a bookmark frees up the small 1244 * performance overhead it requires. 1245 * 1246 * @param string $name Name of the bookmark to remove. 1247 * @return bool Whether the bookmark already existed before removal. 1248 */ 1249 public function release_bookmark( $name ) { 1250 if ( ! array_key_exists( $name, $this->bookmarks ) ) { 1251 return false; 1252 } 1253 1254 unset( $this->bookmarks[ $name ] ); 1255 1256 return true; 1257 } 1258 1259 /** 1260 * Skips contents of generic rawtext elements. 1261 * 1262 * @since 6.3.2 1263 * 1264 * @see https://html.spec.whatwg.org/#generic-raw-text-element-parsing-algorithm 1265 * 1266 * @param string $tag_name The uppercase tag name which will close the RAWTEXT region. 1267 * @return bool Whether an end to the RAWTEXT region was found before the end of the document. 1268 */ 1269 private function skip_rawtext( $tag_name ) { 1270 /* 1271 * These two functions distinguish themselves on whether character references are 1272 * decoded, and since functionality to read the inner markup isn't supported, it's 1273 * not necessary to implement these two functions separately. 1274 */ 1275 return $this->skip_rcdata( $tag_name ); 1276 } 1277 1278 /** 1279 * Skips contents of RCDATA elements, namely title and textarea tags. 1280 * 1281 * @since 6.2.0 1282 * 1283 * @see https://html.spec.whatwg.org/multipage/parsing.html#rcdata-state 1284 * 1285 * @param string $tag_name The uppercase tag name which will close the RCDATA region. 1286 * @return bool Whether an end to the RCDATA region was found before the end of the document. 1287 */ 1288 private function skip_rcdata( $tag_name ) { 1289 $html = $this->html; 1290 $doc_length = strlen( $html ); 1291 $tag_length = strlen( $tag_name ); 1292 1293 $at = $this->bytes_already_parsed; 1294 1295 while ( false !== $at && $at < $doc_length ) { 1296 $at = strpos( $this->html, '</', $at ); 1297 $this->tag_name_starts_at = $at; 1298 1299 // Fail if there is no possible tag closer. 1300 if ( false === $at || ( $at + $tag_length ) >= $doc_length ) { 1301 return false; 1302 } 1303 1304 $at += 2; 1305 1306 /* 1307 * Find a case-insensitive match to the tag name. 1308 * 1309 * Because tag names are limited to US-ASCII there is no 1310 * need to perform any kind of Unicode normalization when 1311 * comparing; any character which could be impacted by such 1312 * normalization could not be part of a tag name. 1313 */ 1314 for ( $i = 0; $i < $tag_length; $i++ ) { 1315 $tag_char = $tag_name[ $i ]; 1316 $html_char = $html[ $at + $i ]; 1317 1318 if ( $html_char !== $tag_char && strtoupper( $html_char ) !== $tag_char ) { 1319 $at += $i; 1320 continue 2; 1321 } 1322 } 1323 1324 $at += $tag_length; 1325 $this->bytes_already_parsed = $at; 1326 1327 if ( $at >= strlen( $html ) ) { 1328 return false; 1329 } 1330 1331 /* 1332 * Ensure that the tag name terminates to avoid matching on 1333 * substrings of a longer tag name. For example, the sequence 1334 * "</textarearug" should not match for "</textarea" even 1335 * though "textarea" is found within the text. 1336 */ 1337 $c = $html[ $at ]; 1338 if ( ' ' !== $c && "\t" !== $c && "\r" !== $c && "\n" !== $c && '/' !== $c && '>' !== $c ) { 1339 continue; 1340 } 1341 1342 while ( $this->parse_next_attribute() ) { 1343 continue; 1344 } 1345 1346 $at = $this->bytes_already_parsed; 1347 if ( $at >= strlen( $this->html ) ) { 1348 return false; 1349 } 1350 1351 if ( '>' === $html[ $at ] ) { 1352 $this->bytes_already_parsed = $at + 1; 1353 return true; 1354 } 1355 1356 if ( $at + 1 >= strlen( $this->html ) ) { 1357 return false; 1358 } 1359 1360 if ( '/' === $html[ $at ] && '>' === $html[ $at + 1 ] ) { 1361 $this->bytes_already_parsed = $at + 2; 1362 return true; 1363 } 1364 } 1365 1366 return false; 1367 } 1368 1369 /** 1370 * Skips contents of script tags. 1371 * 1372 * @since 6.2.0 1373 * 1374 * @return bool Whether the script tag was closed before the end of the document. 1375 */ 1376 private function skip_script_data() { 1377 $state = 'unescaped'; 1378 $html = $this->html; 1379 $doc_length = strlen( $html ); 1380 $at = $this->bytes_already_parsed; 1381 1382 while ( false !== $at && $at < $doc_length ) { 1383 $at += strcspn( $html, '-<', $at ); 1384 1385 /* 1386 * For all script states a "-->" transitions 1387 * back into the normal unescaped script mode, 1388 * even if that's the current state. 1389 */ 1390 if ( 1391 $at + 2 < $doc_length && 1392 '-' === $html[ $at ] && 1393 '-' === $html[ $at + 1 ] && 1394 '>' === $html[ $at + 2 ] 1395 ) { 1396 $at += 3; 1397 $state = 'unescaped'; 1398 continue; 1399 } 1400 1401 // Everything of interest past here starts with "<". 1402 if ( $at + 1 >= $doc_length || '<' !== $html[ $at++ ] ) { 1403 continue; 1404 } 1405 1406 /* 1407 * Unlike with "-->", the "<!--" only transitions 1408 * into the escaped mode if not already there. 1409 * 1410 * Inside the escaped modes it will be ignored; and 1411 * should never break out of the double-escaped 1412 * mode and back into the escaped mode. 1413 * 1414 * While this requires a mode change, it does not 1415 * impact the parsing otherwise, so continue 1416 * parsing after updating the state. 1417 */ 1418 if ( 1419 $at + 2 < $doc_length && 1420 '!' === $html[ $at ] && 1421 '-' === $html[ $at + 1 ] && 1422 '-' === $html[ $at + 2 ] 1423 ) { 1424 $at += 3; 1425 $state = 'unescaped' === $state ? 'escaped' : $state; 1426 continue; 1427 } 1428 1429 if ( '/' === $html[ $at ] ) { 1430 $closer_potentially_starts_at = $at - 1; 1431 $is_closing = true; 1432 ++$at; 1433 } else { 1434 $is_closing = false; 1435 } 1436 1437 /* 1438 * At this point the only remaining state-changes occur with the 1439 * <script> and </script> tags; unless one of these appears next, 1440 * proceed scanning to the next potential token in the text. 1441 */ 1442 if ( ! ( 1443 $at + 6 < $doc_length && 1444 ( 's' === $html[ $at ] || 'S' === $html[ $at ] ) && 1445 ( 'c' === $html[ $at + 1 ] || 'C' === $html[ $at + 1 ] ) && 1446 ( 'r' === $html[ $at + 2 ] || 'R' === $html[ $at + 2 ] ) && 1447 ( 'i' === $html[ $at + 3 ] || 'I' === $html[ $at + 3 ] ) && 1448 ( 'p' === $html[ $at + 4 ] || 'P' === $html[ $at + 4 ] ) && 1449 ( 't' === $html[ $at + 5 ] || 'T' === $html[ $at + 5 ] ) 1450 ) ) { 1451 ++$at; 1452 continue; 1453 } 1454 1455 /* 1456 * Ensure that the script tag terminates to avoid matching on 1457 * substrings of a non-match. For example, the sequence 1458 * "<script123" should not end a script region even though 1459 * "<script" is found within the text. 1460 */ 1461 if ( $at + 6 >= $doc_length ) { 1462 continue; 1463 } 1464 $at += 6; 1465 $c = $html[ $at ]; 1466 if ( ' ' !== $c && "\t" !== $c && "\r" !== $c && "\n" !== $c && '/' !== $c && '>' !== $c ) { 1467 ++$at; 1468 continue; 1469 } 1470 1471 if ( 'escaped' === $state && ! $is_closing ) { 1472 $state = 'double-escaped'; 1473 continue; 1474 } 1475 1476 if ( 'double-escaped' === $state && $is_closing ) { 1477 $state = 'escaped'; 1478 continue; 1479 } 1480 1481 if ( $is_closing ) { 1482 $this->bytes_already_parsed = $closer_potentially_starts_at; 1483 $this->tag_name_starts_at = $closer_potentially_starts_at; 1484 if ( $this->bytes_already_parsed >= $doc_length ) { 1485 return false; 1486 } 1487 1488 while ( $this->parse_next_attribute() ) { 1489 continue; 1490 } 1491 1492 if ( $this->bytes_already_parsed >= $doc_length ) { 1493 $this->parser_state = self::STATE_INCOMPLETE_INPUT; 1494 1495 return false; 1496 } 1497 1498 if ( '>' === $html[ $this->bytes_already_parsed ] ) { 1499 ++$this->bytes_already_parsed; 1500 return true; 1501 } 1502 } 1503 1504 ++$at; 1505 } 1506 1507 return false; 1508 } 1509 1510 /** 1511 * Parses the next tag. 1512 * 1513 * This will find and start parsing the next tag, including 1514 * the opening `<`, the potential closer `/`, and the tag 1515 * name. It does not parse the attributes or scan to the 1516 * closing `>`; these are left for other methods. 1517 * 1518 * @since 6.2.0 1519 * @since 6.2.1 Support abruptly-closed comments, invalid-tag-closer-comments, and empty elements. 1520 * 1521 * @return bool Whether a tag was found before the end of the document. 1522 */ 1523 private function parse_next_tag() { 1524 $this->after_tag(); 1525 1526 $html = $this->html; 1527 $doc_length = strlen( $html ); 1528 $was_at = $this->bytes_already_parsed; 1529 $at = $was_at; 1530 1531 while ( false !== $at && $at < $doc_length ) { 1532 $at = strpos( $html, '<', $at ); 1533 1534 /* 1535 * This does not imply an incomplete parse; it indicates that there 1536 * can be nothing left in the document other than a #text node. 1537 */ 1538 if ( false === $at ) { 1539 $this->parser_state = self::STATE_TEXT_NODE; 1540 $this->token_starts_at = $was_at; 1541 $this->token_length = strlen( $html ) - $was_at; 1542 $this->text_starts_at = $was_at; 1543 $this->text_length = $this->token_length; 1544 $this->bytes_already_parsed = strlen( $html ); 1545 return true; 1546 } 1547 1548 if ( $at > $was_at ) { 1549 /* 1550 * A "<" normally starts a new HTML tag or syntax token, but in cases where the 1551 * following character can't produce a valid token, the "<" is instead treated 1552 * as plaintext and the parser should skip over it. This avoids a problem when 1553 * following earlier practices of typing emoji with text, e.g. "<3". This 1554 * should be a heart, not a tag. It's supposed to be rendered, not hidden. 1555 * 1556 * At this point the parser checks if this is one of those cases and if it is 1557 * will continue searching for the next "<" in search of a token boundary. 1558 * 1559 * @see https://html.spec.whatwg.org/#tag-open-state 1560 */ 1561 if ( strlen( $html ) > $at + 1 ) { 1562 $next_character = $html[ $at + 1 ]; 1563 $at_another_node = ( 1564 '!' === $next_character || 1565 '/' === $next_character || 1566 '?' === $next_character || 1567 ( 'A' <= $next_character && $next_character <= 'Z' ) || 1568 ( 'a' <= $next_character && $next_character <= 'z' ) 1569 ); 1570 if ( ! $at_another_node ) { 1571 ++$at; 1572 continue; 1573 } 1574 } 1575 1576 $this->parser_state = self::STATE_TEXT_NODE; 1577 $this->token_starts_at = $was_at; 1578 $this->token_length = $at - $was_at; 1579 $this->text_starts_at = $was_at; 1580 $this->text_length = $this->token_length; 1581 $this->bytes_already_parsed = $at; 1582 return true; 1583 } 1584 1585 $this->token_starts_at = $at; 1586 1587 if ( $at + 1 < $doc_length && '/' === $this->html[ $at + 1 ] ) { 1588 $this->is_closing_tag = true; 1589 ++$at; 1590 } else { 1591 $this->is_closing_tag = false; 1592 } 1593 1594 /* 1595 * HTML tag names must start with [a-zA-Z] otherwise they are not tags. 1596 * For example, "<3" is rendered as text, not a tag opener. If at least 1597 * one letter follows the "<" then _it is_ a tag, but if the following 1598 * character is anything else it _is not a tag_. 1599 * 1600 * It's not uncommon to find non-tags starting with `<` in an HTML 1601 * document, so it's good for performance to make this pre-check before 1602 * continuing to attempt to parse a tag name. 1603 * 1604 * Reference: 1605 * * https://html.spec.whatwg.org/multipage/parsing.html#data-state 1606 * * https://html.spec.whatwg.org/multipage/parsing.html#tag-open-state 1607 */ 1608 $tag_name_prefix_length = strspn( $html, 'abcdefghijklmnopqrstuvwxyzABCDEFGHIJKLMNOPQRSTUVWXYZ', $at + 1 ); 1609 if ( $tag_name_prefix_length > 0 ) { 1610 ++$at; 1611 $this->parser_state = self::STATE_MATCHED_TAG; 1612 $this->tag_name_starts_at = $at; 1613 $this->tag_name_length = $tag_name_prefix_length + strcspn( $html, " \t\f\r\n/>", $at + $tag_name_prefix_length ); 1614 $this->bytes_already_parsed = $at + $this->tag_name_length; 1615 return true; 1616 } 1617 1618 /* 1619 * Abort if no tag is found before the end of 1620 * the document. There is nothing left to parse. 1621 */ 1622 if ( $at + 1 >= $doc_length ) { 1623 $this->parser_state = self::STATE_INCOMPLETE_INPUT; 1624 1625 return false; 1626 } 1627 1628 /* 1629 * `<!` transitions to markup declaration open state 1630 * https://html.spec.whatwg.org/multipage/parsing.html#markup-declaration-open-state 1631 */ 1632 if ( ! $this->is_closing_tag && '!' === $html[ $at + 1 ] ) { 1633 /* 1634 * `<!--` transitions to a comment state – apply further comment rules. 1635 * https://html.spec.whatwg.org/multipage/parsing.html#tag-open-state 1636 */ 1637 if ( 1638 $doc_length > $at + 3 && 1639 '-' === $html[ $at + 2 ] && 1640 '-' === $html[ $at + 3 ] 1641 ) { 1642 $closer_at = $at + 4; 1643 // If it's not possible to close the comment then there is nothing more to scan. 1644 if ( $doc_length <= $closer_at ) { 1645 $this->parser_state = self::STATE_INCOMPLETE_INPUT; 1646 1647 return false; 1648 } 1649 1650 // Abruptly-closed empty comments are a sequence of dashes followed by `>`. 1651 $span_of_dashes = strspn( $html, '-', $closer_at ); 1652 if ( '>' === $html[ $closer_at + $span_of_dashes ] ) { 1653 /* 1654 * @todo When implementing `set_modifiable_text()` ensure that updates to this token 1655 * don't break the syntax for short comments, e.g. `<!--->`. Unlike other comment 1656 * and bogus comment syntax, these leave no clear insertion point for text and 1657 * they need to be modified specially in order to contain text. E.g. to store 1658 * `?` as the modifiable text, the `<!--->` needs to become `<!--?-->`, which 1659 * involves inserting an additional `-` into the token after the modifiable text. 1660 */ 1661 $this->parser_state = self::STATE_COMMENT; 1662 $this->comment_type = self::COMMENT_AS_ABRUPTLY_CLOSED_COMMENT; 1663 $this->token_length = $closer_at + $span_of_dashes + 1 - $this->token_starts_at; 1664 1665 // Only provide modifiable text if the token is long enough to contain it. 1666 if ( $span_of_dashes >= 2 ) { 1667 $this->comment_type = self::COMMENT_AS_HTML_COMMENT; 1668 $this->text_starts_at = $this->token_starts_at + 4; 1669 $this->text_length = $span_of_dashes - 2; 1670 } 1671 1672 $this->bytes_already_parsed = $closer_at + $span_of_dashes + 1; 1673 return true; 1674 } 1675 1676 /* 1677 * Comments may be closed by either a --> or an invalid --!>. 1678 * The first occurrence closes the comment. 1679 * 1680 * See https://html.spec.whatwg.org/#parse-error-incorrectly-closed-comment 1681 */ 1682 --$closer_at; // Pre-increment inside condition below reduces risk of accidental infinite looping. 1683 while ( ++$closer_at < $doc_length ) { 1684 $closer_at = strpos( $html, '--', $closer_at ); 1685 if ( false === $closer_at ) { 1686 $this->parser_state = self::STATE_INCOMPLETE_INPUT; 1687 1688 return false; 1689 } 1690 1691 if ( $closer_at + 2 < $doc_length && '>' === $html[ $closer_at + 2 ] ) { 1692 $this->parser_state = self::STATE_COMMENT; 1693 $this->comment_type = self::COMMENT_AS_HTML_COMMENT; 1694 $this->token_length = $closer_at + 3 - $this->token_starts_at; 1695 $this->text_starts_at = $this->token_starts_at + 4; 1696 $this->text_length = $closer_at - $this->text_starts_at; 1697 $this->bytes_already_parsed = $closer_at + 3; 1698 return true; 1699 } 1700 1701 if ( 1702 $closer_at + 3 < $doc_length && 1703 '!' === $html[ $closer_at + 2 ] && 1704 '>' === $html[ $closer_at + 3 ] 1705 ) { 1706 $this->parser_state = self::STATE_COMMENT; 1707 $this->comment_type = self::COMMENT_AS_HTML_COMMENT; 1708 $this->token_length = $closer_at + 4 - $this->token_starts_at; 1709 $this->text_starts_at = $this->token_starts_at + 4; 1710 $this->text_length = $closer_at - $this->text_starts_at; 1711 $this->bytes_already_parsed = $closer_at + 4; 1712 return true; 1713 } 1714 } 1715 } 1716 1717 /* 1718 * `<!DOCTYPE` transitions to DOCTYPE state – skip to the nearest > 1719 * These are ASCII-case-insensitive. 1720 * https://html.spec.whatwg.org/multipage/parsing.html#tag-open-state 1721 */ 1722 if ( 1723 $doc_length > $at + 8 && 1724 ( 'D' === $html[ $at + 2 ] || 'd' === $html[ $at + 2 ] ) && 1725 ( 'O' === $html[ $at + 3 ] || 'o' === $html[ $at + 3 ] ) && 1726 ( 'C' === $html[ $at + 4 ] || 'c' === $html[ $at + 4 ] ) && 1727 ( 'T' === $html[ $at + 5 ] || 't' === $html[ $at + 5 ] ) && 1728 ( 'Y' === $html[ $at + 6 ] || 'y' === $html[ $at + 6 ] ) && 1729 ( 'P' === $html[ $at + 7 ] || 'p' === $html[ $at + 7 ] ) && 1730 ( 'E' === $html[ $at + 8 ] || 'e' === $html[ $at + 8 ] ) 1731 ) { 1732 $closer_at = strpos( $html, '>', $at + 9 ); 1733 if ( false === $closer_at ) { 1734 $this->parser_state = self::STATE_INCOMPLETE_INPUT; 1735 1736 return false; 1737 } 1738 1739 $this->parser_state = self::STATE_DOCTYPE; 1740 $this->token_length = $closer_at + 1 - $this->token_starts_at; 1741 $this->text_starts_at = $this->token_starts_at + 9; 1742 $this->text_length = $closer_at - $this->text_starts_at; 1743 $this->bytes_already_parsed = $closer_at + 1; 1744 return true; 1745 } 1746 1747 /* 1748 * Anything else here is an incorrectly-opened comment and transitions 1749 * to the bogus comment state - skip to the nearest >. If no closer is 1750 * found then the HTML was truncated inside the markup declaration. 1751 */ 1752 $closer_at = strpos( $html, '>', $at + 1 ); 1753 if ( false === $closer_at ) { 1754 $this->parser_state = self::STATE_INCOMPLETE_INPUT; 1755 1756 return false; 1757 } 1758 1759 $this->parser_state = self::STATE_COMMENT; 1760 $this->comment_type = self::COMMENT_AS_INVALID_HTML; 1761 $this->token_length = $closer_at + 1 - $this->token_starts_at; 1762 $this->text_starts_at = $this->token_starts_at + 2; 1763 $this->text_length = $closer_at - $this->text_starts_at; 1764 $this->bytes_already_parsed = $closer_at + 1; 1765 1766 /* 1767 * Identify nodes that would be CDATA if HTML had CDATA sections. 1768 * 1769 * This section must occur after identifying the bogus comment end 1770 * because in an HTML parser it will span to the nearest `>`, even 1771 * if there's no `]]>` as would be required in an XML document. It 1772 * is therefore not possible to parse a CDATA section containing 1773 * a `>` in the HTML syntax. 1774 * 1775 * Inside foreign elements there is a discrepancy between browsers 1776 * and the specification on this. 1777 * 1778 * @todo Track whether the Tag Processor is inside a foreign element 1779 * and require the proper closing `]]>` in those cases. 1780 */ 1781 if ( 1782 $this->token_length >= 10 && 1783 '[' === $html[ $this->token_starts_at + 2 ] && 1784 'C' === $html[ $this->token_starts_at + 3 ] && 1785 'D' === $html[ $this->token_starts_at + 4 ] && 1786 'A' === $html[ $this->token_starts_at + 5 ] && 1787 'T' === $html[ $this->token_starts_at + 6 ] && 1788 'A' === $html[ $this->token_starts_at + 7 ] && 1789 '[' === $html[ $this->token_starts_at + 8 ] && 1790 ']' === $html[ $closer_at - 1 ] && 1791 ']' === $html[ $closer_at - 2 ] 1792 ) { 1793 $this->parser_state = self::STATE_COMMENT; 1794 $this->comment_type = self::COMMENT_AS_CDATA_LOOKALIKE; 1795 $this->text_starts_at += 7; 1796 $this->text_length -= 9; 1797 } 1798 1799 return true; 1800 } 1801 1802 /* 1803 * </> is a missing end tag name, which is ignored. 1804 * 1805 * This was also known as the "presumptuous empty tag" 1806 * in early discussions as it was proposed to close 1807 * the nearest previous opening tag. 1808 * 1809 * See https://html.spec.whatwg.org/#parse-error-missing-end-tag-name 1810 */ 1811 if ( '>' === $html[ $at + 1 ] ) { 1812 // `<>` is interpreted as plaintext. 1813 if ( ! $this->is_closing_tag ) { 1814 ++$at; 1815 continue; 1816 } 1817 1818 $this->parser_state = self::STATE_PRESUMPTUOUS_TAG; 1819 $this->token_length = $at + 2 - $this->token_starts_at; 1820 $this->bytes_already_parsed = $at + 2; 1821 return true; 1822 } 1823 1824 /* 1825 * `<?` transitions to a bogus comment state – skip to the nearest > 1826 * See https://html.spec.whatwg.org/multipage/parsing.html#tag-open-state 1827 */ 1828 if ( ! $this->is_closing_tag && '?' === $html[ $at + 1 ] ) { 1829 $closer_at = strpos( $html, '>', $at + 2 ); 1830 if ( false === $closer_at ) { 1831 $this->parser_state = self::STATE_INCOMPLETE_INPUT; 1832 1833 return false; 1834 } 1835 1836 $this->parser_state = self::STATE_COMMENT; 1837 $this->comment_type = self::COMMENT_AS_INVALID_HTML; 1838 $this->token_length = $closer_at + 1 - $this->token_starts_at; 1839 $this->text_starts_at = $this->token_starts_at + 2; 1840 $this->text_length = $closer_at - $this->text_starts_at; 1841 $this->bytes_already_parsed = $closer_at + 1; 1842 1843 /* 1844 * Identify a Processing Instruction node were HTML to have them. 1845 * 1846 * This section must occur after identifying the bogus comment end 1847 * because in an HTML parser it will span to the nearest `>`, even 1848 * if there's no `?>` as would be required in an XML document. It 1849 * is therefore not possible to parse a Processing Instruction node 1850 * containing a `>` in the HTML syntax. 1851 * 1852 * XML allows for more target names, but this code only identifies 1853 * those with ASCII-representable target names. This means that it 1854 * may identify some Processing Instruction nodes as bogus comments, 1855 * but it will not misinterpret the HTML structure. By limiting the 1856 * identification to these target names the Tag Processor can avoid 1857 * the need to start parsing UTF-8 sequences. 1858 * 1859 * > NameStartChar ::= ":" | [A-Z] | "_" | [a-z] | [#xC0-#xD6] | [#xD8-#xF6] | [#xF8-#x2FF] | 1860 * [#x370-#x37D] | [#x37F-#x1FFF] | [#x200C-#x200D] | [#x2070-#x218F] | 1861 * [#x2C00-#x2FEF] | [#x3001-#xD7FF] | [#xF900-#xFDCF] | [#xFDF0-#xFFFD] | 1862 * [#x10000-#xEFFFF] 1863 * > NameChar ::= NameStartChar | "-" | "." | [0-9] | #xB7 | [#x0300-#x036F] | [#x203F-#x2040] 1864 * 1865 * @see https://www.w3.org/TR/2006/REC-xml11-20060816/#NT-PITarget 1866 */ 1867 if ( $this->token_length >= 5 && '?' === $html[ $closer_at - 1 ] ) { 1868 $comment_text = substr( $html, $this->token_starts_at + 2, $this->token_length - 4 ); 1869 $pi_target_length = strspn( $comment_text, 'abcdefghijklmnopqrstuvwxyzABCDEFGHIJKLMNOPQRSTUVWXYZ:_' ); 1870 1871 if ( 0 < $pi_target_length ) { 1872 $pi_target_length += strspn( $comment_text, 'abcdefghijklmnopqrstuvwxyzABCDEFGHIJKLMNOPQRSTUVWXYZ0123456789:_-.', $pi_target_length ); 1873 1874 $this->comment_type = self::COMMENT_AS_PI_NODE_LOOKALIKE; 1875 $this->tag_name_starts_at = $this->token_starts_at + 2; 1876 $this->tag_name_length = $pi_target_length; 1877 $this->text_starts_at += $pi_target_length; 1878 $this->text_length -= $pi_target_length + 1; 1879 } 1880 } 1881 1882 return true; 1883 } 1884 1885 /* 1886 * If a non-alpha starts the tag name in a tag closer it's a comment. 1887 * Find the first `>`, which closes the comment. 1888 * 1889 * This parser classifies these particular comments as special "funky comments" 1890 * which are made available for further processing. 1891 * 1892 * See https://html.spec.whatwg.org/#parse-error-invalid-first-character-of-tag-name 1893 */ 1894 if ( $this->is_closing_tag ) { 1895 // No chance of finding a closer. 1896 if ( $at + 3 > $doc_length ) { 1897 return false; 1898 } 1899 1900 $closer_at = strpos( $html, '>', $at + 2 ); 1901 if ( false === $closer_at ) { 1902 $this->parser_state = self::STATE_INCOMPLETE_INPUT; 1903 1904 return false; 1905 } 1906 1907 $this->parser_state = self::STATE_FUNKY_COMMENT; 1908 $this->token_length = $closer_at + 1 - $this->token_starts_at; 1909 $this->text_starts_at = $this->token_starts_at + 2; 1910 $this->text_length = $closer_at - $this->text_starts_at; 1911 $this->bytes_already_parsed = $closer_at + 1; 1912 return true; 1913 } 1914 1915 ++$at; 1916 } 1917 1918 return false; 1919 } 1920 1921 /** 1922 * Parses the next attribute. 1923 * 1924 * @since 6.2.0 1925 * 1926 * @return bool Whether an attribute was found before the end of the document. 1927 */ 1928 private function parse_next_attribute() { 1929 // Skip whitespace and slashes. 1930 $this->bytes_already_parsed += strspn( $this->html, " \t\f\r\n/", $this->bytes_already_parsed ); 1931 if ( $this->bytes_already_parsed >= strlen( $this->html ) ) { 1932 $this->parser_state = self::STATE_INCOMPLETE_INPUT; 1933 1934 return false; 1935 } 1936 1937 /* 1938 * Treat the equal sign as a part of the attribute 1939 * name if it is the first encountered byte. 1940 * 1941 * @see https://html.spec.whatwg.org/multipage/parsing.html#before-attribute-name-state 1942 */ 1943 $name_length = '=' === $this->html[ $this->bytes_already_parsed ] 1944 ? 1 + strcspn( $this->html, "=/> \t\f\r\n", $this->bytes_already_parsed + 1 ) 1945 : strcspn( $this->html, "=/> \t\f\r\n", $this->bytes_already_parsed ); 1946 1947 // No attribute, just tag closer. 1948 if ( 0 === $name_length || $this->bytes_already_parsed + $name_length >= strlen( $this->html ) ) { 1949 return false; 1950 } 1951 1952 $attribute_start = $this->bytes_already_parsed; 1953 $attribute_name = substr( $this->html, $attribute_start, $name_length ); 1954 $this->bytes_already_parsed += $name_length; 1955 if ( $this->bytes_already_parsed >= strlen( $this->html ) ) { 1956 $this->parser_state = self::STATE_INCOMPLETE_INPUT; 1957 1958 return false; 1959 } 1960 1961 $this->skip_whitespace(); 1962 if ( $this->bytes_already_parsed >= strlen( $this->html ) ) { 1963 $this->parser_state = self::STATE_INCOMPLETE_INPUT; 1964 1965 return false; 1966 } 1967 1968 $has_value = '=' === $this->html[ $this->bytes_already_parsed ]; 1969 if ( $has_value ) { 1970 ++$this->bytes_already_parsed; 1971 $this->skip_whitespace(); 1972 if ( $this->bytes_already_parsed >= strlen( $this->html ) ) { 1973 $this->parser_state = self::STATE_INCOMPLETE_INPUT; 1974 1975 return false; 1976 } 1977 1978 switch ( $this->html[ $this->bytes_already_parsed ] ) { 1979 case "'": 1980 case '"': 1981 $quote = $this->html[ $this->bytes_already_parsed ]; 1982 $value_start = $this->bytes_already_parsed + 1; 1983 $value_length = strcspn( $this->html, $quote, $value_start ); 1984 $attribute_end = $value_start + $value_length + 1; 1985 $this->bytes_already_parsed = $attribute_end; 1986 break; 1987 1988 default: 1989 $value_start = $this->bytes_already_parsed; 1990 $value_length = strcspn( $this->html, "> \t\f\r\n", $value_start ); 1991 $attribute_end = $value_start + $value_length; 1992 $this->bytes_already_parsed = $attribute_end; 1993 } 1994 } else { 1995 $value_start = $this->bytes_already_parsed; 1996 $value_length = 0; 1997 $attribute_end = $attribute_start + $name_length; 1998 } 1999 2000 if ( $attribute_end >= strlen( $this->html ) ) { 2001 $this->parser_state = self::STATE_INCOMPLETE_INPUT; 2002 2003 return false; 2004 } 2005 2006 if ( $this->is_closing_tag ) { 2007 return true; 2008 } 2009 2010 /* 2011 * > There must never be two or more attributes on 2012 * > the same start tag whose names are an ASCII 2013 * > case-insensitive match for each other. 2014 * - HTML 5 spec 2015 * 2016 * @see https://html.spec.whatwg.org/multipage/syntax.html#attributes-2:ascii-case-insensitive 2017 */ 2018 $comparable_name = strtolower( $attribute_name ); 2019 2020 // If an attribute is listed many times, only use the first declaration and ignore the rest. 2021 if ( ! array_key_exists( $comparable_name, $this->attributes ) ) { 2022 $this->attributes[ $comparable_name ] = new WP_HTML_Attribute_Token( 2023 $attribute_name, 2024 $value_start, 2025 $value_length, 2026 $attribute_start, 2027 $attribute_end - $attribute_start, 2028 ! $has_value 2029 ); 2030 2031 return true; 2032 } 2033 2034 /* 2035 * Track the duplicate attributes so if we remove it, all disappear together. 2036 * 2037 * While `$this->duplicated_attributes` could always be stored as an `array()`, 2038 * which would simplify the logic here, storing a `null` and only allocating 2039 * an array when encountering duplicates avoids needless allocations in the 2040 * normative case of parsing tags with no duplicate attributes. 2041 */ 2042 $duplicate_span = new WP_HTML_Span( $attribute_start, $attribute_end - $attribute_start ); 2043 if ( null === $this->duplicate_attributes ) { 2044 $this->duplicate_attributes = array( $comparable_name => array( $duplicate_span ) ); 2045 } elseif ( ! array_key_exists( $comparable_name, $this->duplicate_attributes ) ) { 2046 $this->duplicate_attributes[ $comparable_name ] = array( $duplicate_span ); 2047 } else { 2048 $this->duplicate_attributes[ $comparable_name ][] = $duplicate_span; 2049 } 2050 2051 return true; 2052 } 2053 2054 /** 2055 * Move the internal cursor past any immediate successive whitespace. 2056 * 2057 * @since 6.2.0 2058 */ 2059 private function skip_whitespace() { 2060 $this->bytes_already_parsed += strspn( $this->html, " \t\f\r\n", $this->bytes_already_parsed ); 2061 } 2062 2063 /** 2064 * Applies attribute updates and cleans up once a tag is fully parsed. 2065 * 2066 * @since 6.2.0 2067 */ 2068 private function after_tag() { 2069 /* 2070 * There could be lexical updates enqueued for an attribute that 2071 * also exists on the next tag. In order to avoid conflating the 2072 * attributes across the two tags, lexical updates with names 2073 * need to be flushed to raw lexical updates. 2074 */ 2075 $this->class_name_updates_to_attributes_updates(); 2076 2077 /* 2078 * Purge updates if there are too many. The actual count isn't 2079 * scientific, but a few values from 100 to a few thousand were 2080 * tests to find a practically-useful limit. 2081 * 2082 * If the update queue grows too big, then the Tag Processor 2083 * will spend more time iterating through them and lose the 2084 * efficiency gains of deferring applying them. 2085 */ 2086 if ( 1000 < count( $this->lexical_updates ) ) { 2087 $this->get_updated_html(); 2088 } 2089 2090 foreach ( $this->lexical_updates as $name => $update ) { 2091 /* 2092 * Any updates appearing after the cursor should be applied 2093 * before proceeding, otherwise they may be overlooked. 2094 */ 2095 if ( $update->start >= $this->bytes_already_parsed ) { 2096 $this->get_updated_html(); 2097 break; 2098 } 2099 2100 if ( is_int( $name ) ) { 2101 continue; 2102 } 2103 2104 $this->lexical_updates[] = $update; 2105 unset( $this->lexical_updates[ $name ] ); 2106 } 2107 2108 $this->token_starts_at = null; 2109 $this->token_length = null; 2110 $this->tag_name_starts_at = null; 2111 $this->tag_name_length = null; 2112 $this->text_starts_at = 0; 2113 $this->text_length = 0; 2114 $this->is_closing_tag = null; 2115 $this->attributes = array(); 2116 $this->comment_type = null; 2117 $this->duplicate_attributes = null; 2118 } 2119 2120 /** 2121 * Converts class name updates into tag attributes updates 2122 * (they are accumulated in different data formats for performance). 2123 * 2124 * @since 6.2.0 2125 * 2126 * @see WP_HTML_Tag_Processor::$lexical_updates 2127 * @see WP_HTML_Tag_Processor::$classname_updates 2128 */ 2129 private function class_name_updates_to_attributes_updates() { 2130 if ( count( $this->classname_updates ) === 0 ) { 2131 return; 2132 } 2133 2134 $existing_class = $this->get_enqueued_attribute_value( 'class' ); 2135 if ( null === $existing_class || true === $existing_class ) { 2136 $existing_class = ''; 2137 } 2138 2139 if ( false === $existing_class && isset( $this->attributes['class'] ) ) { 2140 $existing_class = substr( 2141 $this->html, 2142 $this->attributes['class']->value_starts_at, 2143 $this->attributes['class']->value_length 2144 ); 2145 } 2146 2147 if ( false === $existing_class ) { 2148 $existing_class = ''; 2149 } 2150 2151 /** 2152 * Updated "class" attribute value. 2153 * 2154 * This is incrementally built while scanning through the existing class 2155 * attribute, skipping removed classes on the way, and then appending 2156 * added classes at the end. Only when finished processing will the 2157 * value contain the final new value. 2158 2159 * @var string $class 2160 */ 2161 $class = ''; 2162 2163 /** 2164 * Tracks the cursor position in the existing 2165 * class attribute value while parsing. 2166 * 2167 * @var int $at 2168 */ 2169 $at = 0; 2170 2171 /** 2172 * Indicates if there's any need to modify the existing class attribute. 2173 * 2174 * If a call to `add_class()` and `remove_class()` wouldn't impact 2175 * the `class` attribute value then there's no need to rebuild it. 2176 * For example, when adding a class that's already present or 2177 * removing one that isn't. 2178 * 2179 * This flag enables a performance optimization when none of the enqueued 2180 * class updates would impact the `class` attribute; namely, that the 2181 * processor can continue without modifying the input document, as if 2182 * none of the `add_class()` or `remove_class()` calls had been made. 2183 * 2184 * This flag is set upon the first change that requires a string update. 2185 * 2186 * @var bool $modified 2187 */ 2188 $modified = false; 2189 2190 // Remove unwanted classes by only copying the new ones. 2191 $existing_class_length = strlen( $existing_class ); 2192 while ( $at < $existing_class_length ) { 2193 // Skip to the first non-whitespace character. 2194 $ws_at = $at; 2195 $ws_length = strspn( $existing_class, " \t\f\r\n", $ws_at ); 2196 $at += $ws_length; 2197 2198 // Capture the class name – it's everything until the next whitespace. 2199 $name_length = strcspn( $existing_class, " \t\f\r\n", $at ); 2200 if ( 0 === $name_length ) { 2201 // If no more class names are found then that's the end. 2202 break; 2203 } 2204 2205 $name = substr( $existing_class, $at, $name_length ); 2206 $at += $name_length; 2207 2208 // If this class is marked for removal, start processing the next one. 2209 $remove_class = ( 2210 isset( $this->classname_updates[ $name ] ) && 2211 self::REMOVE_CLASS === $this->classname_updates[ $name ] 2212 ); 2213 2214 // If a class has already been seen then skip it; it should not be added twice. 2215 if ( ! $remove_class ) { 2216 $this->classname_updates[ $name ] = self::SKIP_CLASS; 2217 } 2218 2219 if ( $remove_class ) { 2220 $modified = true; 2221 continue; 2222 } 2223 2224 /* 2225 * Otherwise, append it to the new "class" attribute value. 2226 * 2227 * There are options for handling whitespace between tags. 2228 * Preserving the existing whitespace produces fewer changes 2229 * to the HTML content and should clarify the before/after 2230 * content when debugging the modified output. 2231 * 2232 * This approach contrasts normalizing the inter-class 2233 * whitespace to a single space, which might appear cleaner 2234 * in the output HTML but produce a noisier change. 2235 */ 2236 $class .= substr( $existing_class, $ws_at, $ws_length ); 2237 $class .= $name; 2238 } 2239 2240 // Add new classes by appending those which haven't already been seen. 2241 foreach ( $this->classname_updates as $name => $operation ) { 2242 if ( self::ADD_CLASS === $operation ) { 2243 $modified = true; 2244 2245 $class .= strlen( $class ) > 0 ? ' ' : ''; 2246 $class .= $name; 2247 } 2248 } 2249 2250 $this->classname_updates = array(); 2251 if ( ! $modified ) { 2252 return; 2253 } 2254 2255 if ( strlen( $class ) > 0 ) { 2256 $this->set_attribute( 'class', $class ); 2257 } else { 2258 $this->remove_attribute( 'class' ); 2259 } 2260 } 2261 2262 /** 2263 * Applies attribute updates to HTML document. 2264 * 2265 * @since 6.2.0 2266 * @since 6.2.1 Accumulates shift for internal cursor and passed pointer. 2267 * @since 6.3.0 Invalidate any bookmarks whose targets are overwritten. 2268 * 2269 * @param int $shift_this_point Accumulate and return shift for this position. 2270 * @return int How many bytes the given pointer moved in response to the updates. 2271 */ 2272 private function apply_attributes_updates( $shift_this_point = 0 ) { 2273 if ( ! count( $this->lexical_updates ) ) { 2274 return 0; 2275 } 2276 2277 $accumulated_shift_for_given_point = 0; 2278 2279 /* 2280 * Attribute updates can be enqueued in any order but updates 2281 * to the document must occur in lexical order; that is, each 2282 * replacement must be made before all others which follow it 2283 * at later string indices in the input document. 2284 * 2285 * Sorting avoid making out-of-order replacements which 2286 * can lead to mangled output, partially-duplicated 2287 * attributes, and overwritten attributes. 2288 */ 2289 usort( $this->lexical_updates, array( self::class, 'sort_start_ascending' ) ); 2290 2291 $bytes_already_copied = 0; 2292 $output_buffer = ''; 2293 foreach ( $this->lexical_updates as $diff ) { 2294 $shift = strlen( $diff->text ) - $diff->length; 2295 2296 // Adjust the cursor position by however much an update affects it. 2297 if ( $diff->start < $this->bytes_already_parsed ) { 2298 $this->bytes_already_parsed += $shift; 2299 } 2300 2301 // Accumulate shift of the given pointer within this function call. 2302 if ( $diff->start <= $shift_this_point ) { 2303 $accumulated_shift_for_given_point += $shift; 2304 } 2305 2306 $output_buffer .= substr( $this->html, $bytes_already_copied, $diff->start - $bytes_already_copied ); 2307 $output_buffer .= $diff->text; 2308 $bytes_already_copied = $diff->start + $diff->length; 2309 } 2310 2311 $this->html = $output_buffer . substr( $this->html, $bytes_already_copied ); 2312 2313 /* 2314 * Adjust bookmark locations to account for how the text 2315 * replacements adjust offsets in the input document. 2316 */ 2317 foreach ( $this->bookmarks as $bookmark_name => $bookmark ) { 2318 $bookmark_end = $bookmark->start + $bookmark->length; 2319 2320 /* 2321 * Each lexical update which appears before the bookmark's endpoints 2322 * might shift the offsets for those endpoints. Loop through each change 2323 * and accumulate the total shift for each bookmark, then apply that 2324 * shift after tallying the full delta. 2325 */ 2326 $head_delta = 0; 2327 $tail_delta = 0; 2328 2329 foreach ( $this->lexical_updates as $diff ) { 2330 $diff_end = $diff->start + $diff->length; 2331 2332 if ( $bookmark->start < $diff->start && $bookmark_end < $diff->start ) { 2333 break; 2334 } 2335 2336 if ( $bookmark->start >= $diff->start && $bookmark_end < $diff_end ) { 2337 $this->release_bookmark( $bookmark_name ); 2338 continue 2; 2339 } 2340 2341 $delta = strlen( $diff->text ) - $diff->length; 2342 2343 if ( $bookmark->start >= $diff->start ) { 2344 $head_delta += $delta; 2345 } 2346 2347 if ( $bookmark_end >= $diff_end ) { 2348 $tail_delta += $delta; 2349 } 2350 } 2351 2352 $bookmark->start += $head_delta; 2353 $bookmark->length += $tail_delta - $head_delta; 2354 } 2355 2356 $this->lexical_updates = array(); 2357 2358 return $accumulated_shift_for_given_point; 2359 } 2360 2361 /** 2362 * Checks whether a bookmark with the given name exists. 2363 * 2364 * @since 6.3.0 2365 * 2366 * @param string $bookmark_name Name to identify a bookmark that potentially exists. 2367 * @return bool Whether that bookmark exists. 2368 */ 2369 public function has_bookmark( $bookmark_name ) { 2370 return array_key_exists( $bookmark_name, $this->bookmarks ); 2371 } 2372 2373 /** 2374 * Move the internal cursor in the Tag Processor to a given bookmark's location. 2375 * 2376 * In order to prevent accidental infinite loops, there's a 2377 * maximum limit on the number of times seek() can be called. 2378 * 2379 * @since 6.2.0 2380 * 2381 * @param string $bookmark_name Jump to the place in the document identified by this bookmark name. 2382 * @return bool Whether the internal cursor was successfully moved to the bookmark's location. 2383 */ 2384 public function seek( $bookmark_name ) { 2385 if ( ! array_key_exists( $bookmark_name, $this->bookmarks ) ) { 2386 _doing_it_wrong( 2387 __METHOD__, 2388 __( 'Unknown bookmark name.' ), 2389 '6.2.0' 2390 ); 2391 return false; 2392 } 2393 2394 if ( ++$this->seek_count > static::MAX_SEEK_OPS ) { 2395 _doing_it_wrong( 2396 __METHOD__, 2397 __( 'Too many calls to seek() - this can lead to performance issues.' ), 2398 '6.2.0' 2399 ); 2400 return false; 2401 } 2402 2403 // Flush out any pending updates to the document. 2404 $this->get_updated_html(); 2405 2406 // Point this tag processor before the sought tag opener and consume it. 2407 $this->bytes_already_parsed = $this->bookmarks[ $bookmark_name ]->start; 2408 $this->parser_state = self::STATE_READY; 2409 return $this->next_token(); 2410 } 2411 2412 /** 2413 * Compare two WP_HTML_Text_Replacement objects. 2414 * 2415 * @since 6.2.0 2416 * 2417 * @param WP_HTML_Text_Replacement $a First attribute update. 2418 * @param WP_HTML_Text_Replacement $b Second attribute update. 2419 * @return int Comparison value for string order. 2420 */ 2421 private static function sort_start_ascending( $a, $b ) { 2422 $by_start = $a->start - $b->start; 2423 if ( 0 !== $by_start ) { 2424 return $by_start; 2425 } 2426 2427 $by_text = isset( $a->text, $b->text ) ? strcmp( $a->text, $b->text ) : 0; 2428 if ( 0 !== $by_text ) { 2429 return $by_text; 2430 } 2431 2432 /* 2433 * This code should be unreachable, because it implies the two replacements 2434 * start at the same location and contain the same text. 2435 */ 2436 return $a->length - $b->length; 2437 } 2438 2439 /** 2440 * Return the enqueued value for a given attribute, if one exists. 2441 * 2442 * Enqueued updates can take different data types: 2443 * - If an update is enqueued and is boolean, the return will be `true` 2444 * - If an update is otherwise enqueued, the return will be the string value of that update. 2445 * - If an attribute is enqueued to be removed, the return will be `null` to indicate that. 2446 * - If no updates are enqueued, the return will be `false` to differentiate from "removed." 2447 * 2448 * @since 6.2.0 2449 * 2450 * @param string $comparable_name The attribute name in its comparable form. 2451 * @return string|boolean|null Value of enqueued update if present, otherwise false. 2452 */ 2453 private function get_enqueued_attribute_value( $comparable_name ) { 2454 if ( self::STATE_MATCHED_TAG !== $this->parser_state ) { 2455 return false; 2456 } 2457 2458 if ( ! isset( $this->lexical_updates[ $comparable_name ] ) ) { 2459 return false; 2460 } 2461 2462 $enqueued_text = $this->lexical_updates[ $comparable_name ]->text; 2463 2464 // Removed attributes erase the entire span. 2465 if ( '' === $enqueued_text ) { 2466 return null; 2467 } 2468 2469 /* 2470 * Boolean attribute updates are just the attribute name without a corresponding value. 2471 * 2472 * This value might differ from the given comparable name in that there could be leading 2473 * or trailing whitespace, and that the casing follows the name given in `set_attribute`. 2474 * 2475 * Example: 2476 * 2477 * $p->set_attribute( 'data-TEST-id', 'update' ); 2478 * 'update' === $p->get_enqueued_attribute_value( 'data-test-id' ); 2479 * 2480 * Detect this difference based on the absence of the `=`, which _must_ exist in any 2481 * attribute containing a value, e.g. `<input type="text" enabled />`. 2482 * ¹ ² 2483 * 1. Attribute with a string value. 2484 * 2. Boolean attribute whose value is `true`. 2485 */ 2486 $equals_at = strpos( $enqueued_text, '=' ); 2487 if ( false === $equals_at ) { 2488 return true; 2489 } 2490 2491 /* 2492 * Finally, a normal update's value will appear after the `=` and 2493 * be double-quoted, as performed incidentally by `set_attribute`. 2494 * 2495 * e.g. `type="text"` 2496 * ¹² ³ 2497 * 1. Equals is here. 2498 * 2. Double-quoting starts one after the equals sign. 2499 * 3. Double-quoting ends at the last character in the update. 2500 */ 2501 $enqueued_value = substr( $enqueued_text, $equals_at + 2, -1 ); 2502 return html_entity_decode( $enqueued_value ); 2503 } 2504 2505 /** 2506 * Returns the value of a requested attribute from a matched tag opener if that attribute exists. 2507 * 2508 * Example: 2509 * 2510 * $p = new WP_HTML_Tag_Processor( '<div enabled class="test" data-test-id="14">Test</div>' ); 2511 * $p->next_tag( array( 'class_name' => 'test' ) ) === true; 2512 * $p->get_attribute( 'data-test-id' ) === '14'; 2513 * $p->get_attribute( 'enabled' ) === true; 2514 * $p->get_attribute( 'aria-label' ) === null; 2515 * 2516 * $p->next_tag() === false; 2517 * $p->get_attribute( 'class' ) === null; 2518 * 2519 * @since 6.2.0 2520 * 2521 * @param string $name Name of attribute whose value is requested. 2522 * @return string|true|null Value of attribute or `null` if not available. Boolean attributes return `true`. 2523 */ 2524 public function get_attribute( $name ) { 2525 if ( self::STATE_MATCHED_TAG !== $this->parser_state ) { 2526 return null; 2527 } 2528 2529 $comparable = strtolower( $name ); 2530 2531 /* 2532 * For every attribute other than `class` it's possible to perform a quick check if 2533 * there's an enqueued lexical update whose value takes priority over what's found in 2534 * the input document. 2535 * 2536 * The `class` attribute is special though because of the exposed helpers `add_class` 2537 * and `remove_class`. These form a builder for the `class` attribute, so an additional 2538 * check for enqueued class changes is required in addition to the check for any enqueued 2539 * attribute values. If any exist, those enqueued class changes must first be flushed out 2540 * into an attribute value update. 2541 */ 2542 if ( 'class' === $name ) { 2543 $this->class_name_updates_to_attributes_updates(); 2544 } 2545 2546 // Return any enqueued attribute value updates if they exist. 2547 $enqueued_value = $this->get_enqueued_attribute_value( $comparable ); 2548 if ( false !== $enqueued_value ) { 2549 return $enqueued_value; 2550 } 2551 2552 if ( ! isset( $this->attributes[ $comparable ] ) ) { 2553 return null; 2554 } 2555 2556 $attribute = $this->attributes[ $comparable ]; 2557 2558 /* 2559 * This flag distinguishes an attribute with no value 2560 * from an attribute with an empty string value. For 2561 * unquoted attributes this could look very similar. 2562 * It refers to whether an `=` follows the name. 2563 * 2564 * e.g. <div boolean-attribute empty-attribute=></div> 2565 * ¹ ² 2566 * 1. Attribute `boolean-attribute` is `true`. 2567 * 2. Attribute `empty-attribute` is `""`. 2568 */ 2569 if ( true === $attribute->is_true ) { 2570 return true; 2571 } 2572 2573 $raw_value = substr( $this->html, $attribute->value_starts_at, $attribute->value_length ); 2574 2575 return html_entity_decode( $raw_value ); 2576 } 2577 2578 /** 2579 * Gets lowercase names of all attributes matching a given prefix in the current tag. 2580 * 2581 * Note that matching is case-insensitive. This is in accordance with the spec: 2582 * 2583 * > There must never be two or more attributes on 2584 * > the same start tag whose names are an ASCII 2585 * > case-insensitive match for each other. 2586 * - HTML 5 spec 2587 * 2588 * Example: 2589 * 2590 * $p = new WP_HTML_Tag_Processor( '<div data-ENABLED class="test" DATA-test-id="14">Test</div>' ); 2591 * $p->next_tag( array( 'class_name' => 'test' ) ) === true; 2592 * $p->get_attribute_names_with_prefix( 'data-' ) === array( 'data-enabled', 'data-test-id' ); 2593 * 2594 * $p->next_tag() === false; 2595 * $p->get_attribute_names_with_prefix( 'data-' ) === null; 2596 * 2597 * @since 6.2.0 2598 * 2599 * @see https://html.spec.whatwg.org/multipage/syntax.html#attributes-2:ascii-case-insensitive 2600 * 2601 * @param string $prefix Prefix of requested attribute names. 2602 * @return array|null List of attribute names, or `null` when no tag opener is matched. 2603 */ 2604 public function get_attribute_names_with_prefix( $prefix ) { 2605 if ( 2606 self::STATE_MATCHED_TAG !== $this->parser_state || 2607 $this->is_closing_tag 2608 ) { 2609 return null; 2610 } 2611 2612 $comparable = strtolower( $prefix ); 2613 2614 $matches = array(); 2615 foreach ( array_keys( $this->attributes ) as $attr_name ) { 2616 if ( str_starts_with( $attr_name, $comparable ) ) { 2617 $matches[] = $attr_name; 2618 } 2619 } 2620 return $matches; 2621 } 2622 2623 /** 2624 * Returns the uppercase name of the matched tag. 2625 * 2626 * Example: 2627 * 2628 * $p = new WP_HTML_Tag_Processor( '<div class="test">Test</div>' ); 2629 * $p->next_tag() === true; 2630 * $p->get_tag() === 'DIV'; 2631 * 2632 * $p->next_tag() === false; 2633 * $p->get_tag() === null; 2634 * 2635 * @since 6.2.0 2636 * 2637 * @return string|null Name of currently matched tag in input HTML, or `null` if none found. 2638 */ 2639 public function get_tag() { 2640 if ( null === $this->tag_name_starts_at ) { 2641 return null; 2642 } 2643 2644 $tag_name = substr( $this->html, $this->tag_name_starts_at, $this->tag_name_length ); 2645 2646 if ( self::STATE_MATCHED_TAG === $this->parser_state ) { 2647 return strtoupper( $tag_name ); 2648 } 2649 2650 if ( 2651 self::STATE_COMMENT === $this->parser_state && 2652 self::COMMENT_AS_PI_NODE_LOOKALIKE === $this->get_comment_type() 2653 ) { 2654 return $tag_name; 2655 } 2656 2657 return null; 2658 } 2659 2660 /** 2661 * Indicates if the currently matched tag contains the self-closing flag. 2662 * 2663 * No HTML elements ought to have the self-closing flag and for those, the self-closing 2664 * flag will be ignored. For void elements this is benign because they "self close" 2665 * automatically. For non-void HTML elements though problems will appear if someone 2666 * intends to use a self-closing element in place of that element with an empty body. 2667 * For HTML foreign elements and custom elements the self-closing flag determines if 2668 * they self-close or not. 2669 * 2670 * This function does not determine if a tag is self-closing, 2671 * but only if the self-closing flag is present in the syntax. 2672 * 2673 * @since 6.3.0 2674 * 2675 * @return bool Whether the currently matched tag contains the self-closing flag. 2676 */ 2677 public function has_self_closing_flag() { 2678 if ( self::STATE_MATCHED_TAG !== $this->parser_state ) { 2679 return false; 2680 } 2681 2682 /* 2683 * The self-closing flag is the solidus at the _end_ of the tag, not the beginning. 2684 * 2685 * Example: 2686 * 2687 * <figure /> 2688 * ^ this appears one character before the end of the closing ">". 2689 */ 2690 return '/' === $this->html[ $this->token_starts_at + $this->token_length - 1 ]; 2691 } 2692 2693 /** 2694 * Indicates if the current tag token is a tag closer. 2695 * 2696 * Example: 2697 * 2698 * $p = new WP_HTML_Tag_Processor( '<div></div>' ); 2699 * $p->next_tag( array( 'tag_name' => 'div', 'tag_closers' => 'visit' ) ); 2700 * $p->is_tag_closer() === false; 2701 * 2702 * $p->next_tag( array( 'tag_name' => 'div', 'tag_closers' => 'visit' ) ); 2703 * $p->is_tag_closer() === true; 2704 * 2705 * @since 6.2.0 2706 * 2707 * @return bool Whether the current tag is a tag closer. 2708 */ 2709 public function is_tag_closer() { 2710 return ( 2711 self::STATE_MATCHED_TAG === $this->parser_state && 2712 $this->is_closing_tag 2713 ); 2714 } 2715 2716 /** 2717 * Indicates the kind of matched token, if any. 2718 * 2719 * This differs from `get_token_name()` in that it always 2720 * returns a static string indicating the type, whereas 2721 * `get_token_name()` may return values derived from the 2722 * token itself, such as a tag name or processing 2723 * instruction tag. 2724 * 2725 * Possible values: 2726 * - `#tag` when matched on a tag. 2727 * - `#text` when matched on a text node. 2728 * - `#cdata-section` when matched on a CDATA node. 2729 * - `#comment` when matched on a comment. 2730 * - `#doctype` when matched on a DOCTYPE declaration. 2731 * - `#presumptuous-tag` when matched on an empty tag closer. 2732 * - `#funky-comment` when matched on a funky comment. 2733 * 2734 * @since 6.5.0 2735 * 2736 * @return string|null What kind of token is matched, or null. 2737 */ 2738 public function get_token_type() { 2739 switch ( $this->parser_state ) { 2740 case self::STATE_MATCHED_TAG: 2741 return '#tag'; 2742 2743 case self::STATE_DOCTYPE: 2744 return '#doctype'; 2745 2746 default: 2747 return $this->get_token_name(); 2748 } 2749 } 2750 2751 /** 2752 * Returns the node name represented by the token. 2753 * 2754 * This matches the DOM API value `nodeName`. Some values 2755 * are static, such as `#text` for a text node, while others 2756 * are dynamically generated from the token itself. 2757 * 2758 * Dynamic names: 2759 * - Uppercase tag name for tag matches. 2760 * - `html` for DOCTYPE declarations. 2761 * 2762 * Note that if the Tag Processor is not matched on a token 2763 * then this function will return `null`, either because it 2764 * hasn't yet found a token or because it reached the end 2765 * of the document without matching a token. 2766 * 2767 * @since 6.5.0 2768 * 2769 * @return string|null Name of the matched token. 2770 */ 2771 public function get_token_name() { 2772 switch ( $this->parser_state ) { 2773 case self::STATE_MATCHED_TAG: 2774 return $this->get_tag(); 2775 2776 case self::STATE_TEXT_NODE: 2777 return '#text'; 2778 2779 case self::STATE_CDATA_NODE: 2780 return '#cdata-section'; 2781 2782 case self::STATE_COMMENT: 2783 return '#comment'; 2784 2785 case self::STATE_DOCTYPE: 2786 return 'html'; 2787 2788 case self::STATE_PRESUMPTUOUS_TAG: 2789 return '#presumptuous-tag'; 2790 2791 case self::STATE_FUNKY_COMMENT: 2792 return '#funky-comment'; 2793 } 2794 } 2795 2796 /** 2797 * Indicates what kind of comment produced the comment node. 2798 * 2799 * Because there are different kinds of HTML syntax which produce 2800 * comments, the Tag Processor tracks and exposes this as a type 2801 * for the comment. Nominally only regular HTML comments exist as 2802 * they are commonly known, but a number of unrelated syntax errors 2803 * also produce comments. 2804 * 2805 * @see self::COMMENT_AS_ABRUPTLY_CLOSED_COMMENT 2806 * @see self::COMMENT_AS_CDATA_LOOKALIKE 2807 * @see self::COMMENT_AS_INVALID_HTML 2808 * @see self::COMMENT_AS_HTML_COMMENT 2809 * @see self::COMMENT_AS_PI_NODE_LOOKALIKE 2810 * 2811 * @since 6.5.0 2812 * 2813 * @return string|null 2814 */ 2815 public function get_comment_type() { 2816 if ( self::STATE_COMMENT !== $this->parser_state ) { 2817 return null; 2818 } 2819 2820 return $this->comment_type; 2821 } 2822 2823 /** 2824 * Returns the modifiable text for a matched token, or an empty string. 2825 * 2826 * Modifiable text is text content that may be read and changed without 2827 * changing the HTML structure of the document around it. This includes 2828 * the contents of `#text` nodes in the HTML as well as the inner 2829 * contents of HTML comments, Processing Instructions, and others, even 2830 * though these nodes aren't part of a parsed DOM tree. They also contain 2831 * the contents of SCRIPT and STYLE tags, of TEXTAREA tags, and of any 2832 * other section in an HTML document which cannot contain HTML markup (DATA). 2833 * 2834 * If a token has no modifiable text then an empty string is returned to 2835 * avoid needless crashing or type errors. An empty string does not mean 2836 * that a token has modifiable text, and a token with modifiable text may 2837 * have an empty string (e.g. a comment with no contents). 2838 * 2839 * @since 6.5.0 2840 * 2841 * @return string 2842 */ 2843 public function get_modifiable_text() { 2844 if ( null === $this->text_starts_at ) { 2845 return ''; 2846 } 2847 2848 $text = substr( $this->html, $this->text_starts_at, $this->text_length ); 2849 2850 // Comment data is not decoded. 2851 if ( 2852 self::STATE_CDATA_NODE === $this->parser_state || 2853 self::STATE_COMMENT === $this->parser_state || 2854 self::STATE_DOCTYPE === $this->parser_state || 2855 self::STATE_FUNKY_COMMENT === $this->parser_state 2856 ) { 2857 return $text; 2858 } 2859 2860 $tag_name = $this->get_tag(); 2861 if ( 2862 // Script data is not decoded. 2863 'SCRIPT' === $tag_name || 2864 2865 // RAWTEXT data is not decoded. 2866 'IFRAME' === $tag_name || 2867 'NOEMBED' === $tag_name || 2868 'NOFRAMES' === $tag_name || 2869 'STYLE' === $tag_name || 2870 'XMP' === $tag_name 2871 ) { 2872 return $text; 2873 } 2874 2875 $decoded = html_entity_decode( $text, ENT_QUOTES | ENT_HTML5 | ENT_SUBSTITUTE ); 2876 2877 /* 2878 * TEXTAREA skips a leading newline, but this newline may appear not only as the 2879 * literal character `\n`, but also as a character reference, such as in the 2880 * following markup: `<textarea>
Content</textarea>`. 2881 * 2882 * For these cases it's important to first decode the text content before checking 2883 * for a leading newline and removing it. 2884 */ 2885 if ( 2886 self::STATE_MATCHED_TAG === $this->parser_state && 2887 'TEXTAREA' === $tag_name && 2888 strlen( $decoded ) > 0 && 2889 "\n" === $decoded[0] 2890 ) { 2891 return substr( $decoded, 1 ); 2892 } 2893 2894 return $decoded; 2895 } 2896 2897 /** 2898 * Updates or creates a new attribute on the currently matched tag with the passed value. 2899 * 2900 * For boolean attributes special handling is provided: 2901 * - When `true` is passed as the value, then only the attribute name is added to the tag. 2902 * - When `false` is passed, the attribute gets removed if it existed before. 2903 * 2904 * For string attributes, the value is escaped using the `esc_attr` function. 2905 * 2906 * @since 6.2.0 2907 * @since 6.2.1 Fix: Only create a single update for multiple calls with case-variant attribute names. 2908 * 2909 * @param string $name The attribute name to target. 2910 * @param string|bool $value The new attribute value. 2911 * @return bool Whether an attribute value was set. 2912 */ 2913 public function set_attribute( $name, $value ) { 2914 if ( 2915 self::STATE_MATCHED_TAG !== $this->parser_state || 2916 $this->is_closing_tag 2917 ) { 2918 return false; 2919 } 2920 2921 /* 2922 * WordPress rejects more characters than are strictly forbidden 2923 * in HTML5. This is to prevent additional security risks deeper 2924 * in the WordPress and plugin stack. Specifically the 2925 * less-than (<) greater-than (>) and ampersand (&) aren't allowed. 2926 * 2927 * The use of a PCRE match enables looking for specific Unicode 2928 * code points without writing a UTF-8 decoder. Whereas scanning 2929 * for one-byte characters is trivial (with `strcspn`), scanning 2930 * for the longer byte sequences would be more complicated. Given 2931 * that this shouldn't be in the hot path for execution, it's a 2932 * reasonable compromise in efficiency without introducing a 2933 * noticeable impact on the overall system. 2934 * 2935 * @see https://html.spec.whatwg.org/#attributes-2 2936 * 2937 * @todo As the only regex pattern maybe we should take it out? 2938 * Are Unicode patterns available broadly in Core? 2939 */ 2940 if ( preg_match( 2941 '~[' . 2942 // Syntax-like characters. 2943 '"\'>&</ =' . 2944 // Control characters. 2945 '\x{00}-\x{1F}' . 2946 // HTML noncharacters. 2947 '\x{FDD0}-\x{FDEF}' . 2948 '\x{FFFE}\x{FFFF}\x{1FFFE}\x{1FFFF}\x{2FFFE}\x{2FFFF}\x{3FFFE}\x{3FFFF}' . 2949 '\x{4FFFE}\x{4FFFF}\x{5FFFE}\x{5FFFF}\x{6FFFE}\x{6FFFF}\x{7FFFE}\x{7FFFF}' . 2950 '\x{8FFFE}\x{8FFFF}\x{9FFFE}\x{9FFFF}\x{AFFFE}\x{AFFFF}\x{BFFFE}\x{BFFFF}' . 2951 '\x{CFFFE}\x{CFFFF}\x{DFFFE}\x{DFFFF}\x{EFFFE}\x{EFFFF}\x{FFFFE}\x{FFFFF}' . 2952 '\x{10FFFE}\x{10FFFF}' . 2953 ']~Ssu', 2954 $name 2955 ) ) { 2956 _doing_it_wrong( 2957 __METHOD__, 2958 __( 'Invalid attribute name.' ), 2959 '6.2.0' 2960 ); 2961 2962 return false; 2963 } 2964 2965 /* 2966 * > The values "true" and "false" are not allowed on boolean attributes. 2967 * > To represent a false value, the attribute has to be omitted altogether. 2968 * - HTML5 spec, https://html.spec.whatwg.org/#boolean-attributes 2969 */ 2970 if ( false === $value ) { 2971 return $this->remove_attribute( $name ); 2972 } 2973 2974 if ( true === $value ) { 2975 $updated_attribute = $name; 2976 } else { 2977 $escaped_new_value = esc_attr( $value ); 2978 $updated_attribute = "{$name}=\"{$escaped_new_value}\""; 2979 } 2980 2981 /* 2982 * > There must never be two or more attributes on 2983 * > the same start tag whose names are an ASCII 2984 * > case-insensitive match for each other. 2985 * - HTML 5 spec 2986 * 2987 * @see https://html.spec.whatwg.org/multipage/syntax.html#attributes-2:ascii-case-insensitive 2988 */ 2989 $comparable_name = strtolower( $name ); 2990 2991 if ( isset( $this->attributes[ $comparable_name ] ) ) { 2992 /* 2993 * Update an existing attribute. 2994 * 2995 * Example – set attribute id to "new" in <div id="initial_id" />: 2996 * 2997 * <div id="initial_id"/> 2998 * ^-------------^ 2999 * start end 3000 * replacement: `id="new"` 3001 * 3002 * Result: <div id="new"/> 3003 */ 3004 $existing_attribute = $this->attributes[ $comparable_name ]; 3005 $this->lexical_updates[ $comparable_name ] = new WP_HTML_Text_Replacement( 3006 $existing_attribute->start, 3007 $existing_attribute->length, 3008 $updated_attribute 3009 ); 3010 } else { 3011 /* 3012 * Create a new attribute at the tag's name end. 3013 * 3014 * Example – add attribute id="new" to <div />: 3015 * 3016 * <div/> 3017 * ^ 3018 * start and end 3019 * replacement: ` id="new"` 3020 * 3021 * Result: <div id="new"/> 3022 */ 3023 $this->lexical_updates[ $comparable_name ] = new WP_HTML_Text_Replacement( 3024 $this->tag_name_starts_at + $this->tag_name_length, 3025 0, 3026 ' ' . $updated_attribute 3027 ); 3028 } 3029 3030 /* 3031 * Any calls to update the `class` attribute directly should wipe out any 3032 * enqueued class changes from `add_class` and `remove_class`. 3033 */ 3034 if ( 'class' === $comparable_name && ! empty( $this->classname_updates ) ) { 3035 $this->classname_updates = array(); 3036 } 3037 3038 return true; 3039 } 3040 3041 /** 3042 * Remove an attribute from the currently-matched tag. 3043 * 3044 * @since 6.2.0 3045 * 3046 * @param string $name The attribute name to remove. 3047 * @return bool Whether an attribute was removed. 3048 */ 3049 public function remove_attribute( $name ) { 3050 if ( 3051 self::STATE_MATCHED_TAG !== $this->parser_state || 3052 $this->is_closing_tag 3053 ) { 3054 return false; 3055 } 3056 3057 /* 3058 * > There must never be two or more attributes on 3059 * > the same start tag whose names are an ASCII 3060 * > case-insensitive match for each other. 3061 * - HTML 5 spec 3062 * 3063 * @see https://html.spec.whatwg.org/multipage/syntax.html#attributes-2:ascii-case-insensitive 3064 */ 3065 $name = strtolower( $name ); 3066 3067 /* 3068 * Any calls to update the `class` attribute directly should wipe out any 3069 * enqueued class changes from `add_class` and `remove_class`. 3070 */ 3071 if ( 'class' === $name && count( $this->classname_updates ) !== 0 ) { 3072 $this->classname_updates = array(); 3073 } 3074 3075 /* 3076 * If updating an attribute that didn't exist in the input 3077 * document, then remove the enqueued update and move on. 3078 * 3079 * For example, this might occur when calling `remove_attribute()` 3080 * after calling `set_attribute()` for the same attribute 3081 * and when that attribute wasn't originally present. 3082 */ 3083 if ( ! isset( $this->attributes[ $name ] ) ) { 3084 if ( isset( $this->lexical_updates[ $name ] ) ) { 3085 unset( $this->lexical_updates[ $name ] ); 3086 } 3087 return false; 3088 } 3089 3090 /* 3091 * Removes an existing tag attribute. 3092 * 3093 * Example – remove the attribute id from <div id="main"/>: 3094 * <div id="initial_id"/> 3095 * ^-------------^ 3096 * start end 3097 * replacement: `` 3098 * 3099 * Result: <div /> 3100 */ 3101 $this->lexical_updates[ $name ] = new WP_HTML_Text_Replacement( 3102 $this->attributes[ $name ]->start, 3103 $this->attributes[ $name ]->length, 3104 '' 3105 ); 3106 3107 // Removes any duplicated attributes if they were also present. 3108 if ( null !== $this->duplicate_attributes && array_key_exists( $name, $this->duplicate_attributes ) ) { 3109 foreach ( $this->duplicate_attributes[ $name ] as $attribute_token ) { 3110 $this->lexical_updates[] = new WP_HTML_Text_Replacement( 3111 $attribute_token->start, 3112 $attribute_token->length, 3113 '' 3114 ); 3115 } 3116 } 3117 3118 return true; 3119 } 3120 3121 /** 3122 * Adds a new class name to the currently matched tag. 3123 * 3124 * @since 6.2.0 3125 * 3126 * @param string $class_name The class name to add. 3127 * @return bool Whether the class was set to be added. 3128 */ 3129 public function add_class( $class_name ) { 3130 if ( 3131 self::STATE_MATCHED_TAG !== $this->parser_state || 3132 $this->is_closing_tag 3133 ) { 3134 return false; 3135 } 3136 3137 $this->classname_updates[ $class_name ] = self::ADD_CLASS; 3138 3139 return true; 3140 } 3141 3142 /** 3143 * Removes a class name from the currently matched tag. 3144 * 3145 * @since 6.2.0 3146 * 3147 * @param string $class_name The class name to remove. 3148 * @return bool Whether the class was set to be removed. 3149 */ 3150 public function remove_class( $class_name ) { 3151 if ( 3152 self::STATE_MATCHED_TAG !== $this->parser_state || 3153 $this->is_closing_tag 3154 ) { 3155 return false; 3156 } 3157 3158 if ( null !== $this->tag_name_starts_at ) { 3159 $this->classname_updates[ $class_name ] = self::REMOVE_CLASS; 3160 } 3161 3162 return true; 3163 } 3164 3165 /** 3166 * Returns the string representation of the HTML Tag Processor. 3167 * 3168 * @since 6.2.0 3169 * 3170 * @see WP_HTML_Tag_Processor::get_updated_html() 3171 * 3172 * @return string The processed HTML. 3173 */ 3174 public function __toString() { 3175 return $this->get_updated_html(); 3176 } 3177 3178 /** 3179 * Returns the string representation of the HTML Tag Processor. 3180 * 3181 * @since 6.2.0 3182 * @since 6.2.1 Shifts the internal cursor corresponding to the applied updates. 3183 * @since 6.4.0 No longer calls subclass method `next_tag()` after updating HTML. 3184 * 3185 * @return string The processed HTML. 3186 */ 3187 public function get_updated_html() { 3188 $requires_no_updating = 0 === count( $this->classname_updates ) && 0 === count( $this->lexical_updates ); 3189 3190 /* 3191 * When there is nothing more to update and nothing has already been 3192 * updated, return the original document and avoid a string copy. 3193 */ 3194 if ( $requires_no_updating ) { 3195 return $this->html; 3196 } 3197 3198 /* 3199 * Keep track of the position right before the current tag. This will 3200 * be necessary for reparsing the current tag after updating the HTML. 3201 */ 3202 $before_current_tag = $this->token_starts_at; 3203 3204 /* 3205 * 1. Apply the enqueued edits and update all the pointers to reflect those changes. 3206 */ 3207 $this->class_name_updates_to_attributes_updates(); 3208 $before_current_tag += $this->apply_attributes_updates( $before_current_tag ); 3209 3210 /* 3211 * 2. Rewind to before the current tag and reparse to get updated attributes. 3212 * 3213 * At this point the internal cursor points to the end of the tag name. 3214 * Rewind before the tag name starts so that it's as if the cursor didn't 3215 * move; a call to `next_tag()` will reparse the recently-updated attributes 3216 * and additional calls to modify the attributes will apply at this same 3217 * location, but in order to avoid issues with subclasses that might add 3218 * behaviors to `next_tag()`, the internal methods should be called here 3219 * instead. 3220 * 3221 * It's important to note that in this specific place there will be no change 3222 * because the processor was already at a tag when this was called and it's 3223 * rewinding only to the beginning of this very tag before reprocessing it 3224 * and its attributes. 3225 * 3226 * <p>Previous HTML<em>More HTML</em></p> 3227 * ↑ │ back up by the length of the tag name plus the opening < 3228 * └←─┘ back up by strlen("em") + 1 ==> 3 3229 */ 3230 $this->bytes_already_parsed = $before_current_tag; 3231 $this->base_class_next_token(); 3232 3233 return $this->html; 3234 } 3235 3236 /** 3237 * Parses tag query input into internal search criteria. 3238 * 3239 * @since 6.2.0 3240 * 3241 * @param array|string|null $query { 3242 * Optional. Which tag name to find, having which class, etc. Default is to find any tag. 3243 * 3244 * @type string|null $tag_name Which tag to find, or `null` for "any tag." 3245 * @type int|null $match_offset Find the Nth tag matching all search criteria. 3246 * 1 for "first" tag, 3 for "third," etc. 3247 * Defaults to first tag. 3248 * @type string|null $class_name Tag must contain this class name to match. 3249 * @type string $tag_closers "visit" or "skip": whether to stop on tag closers, e.g. </div>. 3250 * } 3251 */ 3252 private function parse_query( $query ) { 3253 if ( null !== $query && $query === $this->last_query ) { 3254 return; 3255 } 3256 3257 $this->last_query = $query; 3258 $this->sought_tag_name = null; 3259 $this->sought_class_name = null; 3260 $this->sought_match_offset = 1; 3261 $this->stop_on_tag_closers = false; 3262 3263 // A single string value means "find the tag of this name". 3264 if ( is_string( $query ) ) { 3265 $this->sought_tag_name = $query; 3266 return; 3267 } 3268 3269 // An empty query parameter applies no restrictions on the search. 3270 if ( null === $query ) { 3271 return; 3272 } 3273 3274 // If not using the string interface, an associative array is required. 3275 if ( ! is_array( $query ) ) { 3276 _doing_it_wrong( 3277 __METHOD__, 3278 __( 'The query argument must be an array or a tag name.' ), 3279 '6.2.0' 3280 ); 3281 return; 3282 } 3283 3284 if ( isset( $query['tag_name'] ) && is_string( $query['tag_name'] ) ) { 3285 $this->sought_tag_name = $query['tag_name']; 3286 } 3287 3288 if ( isset( $query['class_name'] ) && is_string( $query['class_name'] ) ) { 3289 $this->sought_class_name = $query['class_name']; 3290 } 3291 3292 if ( isset( $query['match_offset'] ) && is_int( $query['match_offset'] ) && 0 < $query['match_offset'] ) { 3293 $this->sought_match_offset = $query['match_offset']; 3294 } 3295 3296 if ( isset( $query['tag_closers'] ) ) { 3297 $this->stop_on_tag_closers = 'visit' === $query['tag_closers']; 3298 } 3299 } 3300 3301 3302 /** 3303 * Checks whether a given tag and its attributes match the search criteria. 3304 * 3305 * @since 6.2.0 3306 * 3307 * @return bool Whether the given tag and its attribute match the search criteria. 3308 */ 3309 private function matches() { 3310 if ( $this->is_closing_tag && ! $this->stop_on_tag_closers ) { 3311 return false; 3312 } 3313 3314 // Does the tag name match the requested tag name in a case-insensitive manner? 3315 if ( null !== $this->sought_tag_name ) { 3316 /* 3317 * String (byte) length lookup is fast. If they aren't the 3318 * same length then they can't be the same string values. 3319 */ 3320 if ( strlen( $this->sought_tag_name ) !== $this->tag_name_length ) { 3321 return false; 3322 } 3323 3324 /* 3325 * Check each character to determine if they are the same. 3326 * Defer calls to `strtoupper()` to avoid them when possible. 3327 * Calling `strcasecmp()` here tested slowed than comparing each 3328 * character, so unless benchmarks show otherwise, it should 3329 * not be used. 3330 * 3331 * It's expected that most of the time that this runs, a 3332 * lower-case tag name will be supplied and the input will 3333 * contain lower-case tag names, thus normally bypassing 3334 * the case comparison code. 3335 */ 3336 for ( $i = 0; $i < $this->tag_name_length; $i++ ) { 3337 $html_char = $this->html[ $this->tag_name_starts_at + $i ]; 3338 $tag_char = $this->sought_tag_name[ $i ]; 3339 3340 if ( $html_char !== $tag_char && strtoupper( $html_char ) !== $tag_char ) { 3341 return false; 3342 } 3343 } 3344 } 3345 3346 if ( null !== $this->sought_class_name && ! $this->has_class( $this->sought_class_name ) ) { 3347 return false; 3348 } 3349 3350 return true; 3351 } 3352 3353 /** 3354 * Parser Ready State. 3355 * 3356 * Indicates that the parser is ready to run and waiting for a state transition. 3357 * It may not have started yet, or it may have just finished parsing a token and 3358 * is ready to find the next one. 3359 * 3360 * @since 6.5.0 3361 * 3362 * @access private 3363 */ 3364 const STATE_READY = 'STATE_READY'; 3365 3366 /** 3367 * Parser Complete State. 3368 * 3369 * Indicates that the parser has reached the end of the document and there is 3370 * nothing left to scan. It finished parsing the last token completely. 3371 * 3372 * @since 6.5.0 3373 * 3374 * @access private 3375 */ 3376 const STATE_COMPLETE = 'STATE_COMPLETE'; 3377 3378 /** 3379 * Parser Incomplete Input State. 3380 * 3381 * Indicates that the parser has reached the end of the document before finishing 3382 * a token. It started parsing a token but there is a possibility that the input 3383 * HTML document was truncated in the middle of a token. 3384 * 3385 * The parser is reset at the start of the incomplete token and has paused. There 3386 * is nothing more than can be scanned unless provided a more complete document. 3387 * 3388 * @since 6.5.0 3389 * 3390 * @access private 3391 */ 3392 const STATE_INCOMPLETE_INPUT = 'STATE_INCOMPLETE_INPUT'; 3393 3394 /** 3395 * Parser Matched Tag State. 3396 * 3397 * Indicates that the parser has found an HTML tag and it's possible to get 3398 * the tag name and read or modify its attributes (if it's not a closing tag). 3399 * 3400 * @since 6.5.0 3401 * 3402 * @access private 3403 */ 3404 const STATE_MATCHED_TAG = 'STATE_MATCHED_TAG'; 3405 3406 /** 3407 * Parser Text Node State. 3408 * 3409 * Indicates that the parser has found a text node and it's possible 3410 * to read and modify that text. 3411 * 3412 * @since 6.5.0 3413 * 3414 * @access private 3415 */ 3416 const STATE_TEXT_NODE = 'STATE_TEXT_NODE'; 3417 3418 /** 3419 * Parser CDATA Node State. 3420 * 3421 * Indicates that the parser has found a CDATA node and it's possible 3422 * to read and modify its modifiable text. Note that in HTML there are 3423 * no CDATA nodes outside of foreign content (SVG and MathML). Outside 3424 * of foreign content, they are treated as HTML comments. 3425 * 3426 * @since 6.5.0 3427 * 3428 * @access private 3429 */ 3430 const STATE_CDATA_NODE = 'STATE_CDATA_NODE'; 3431 3432 /** 3433 * Indicates that the parser has found an HTML comment and it's 3434 * possible to read and modify its modifiable text. 3435 * 3436 * @since 6.5.0 3437 * 3438 * @access private 3439 */ 3440 const STATE_COMMENT = 'STATE_COMMENT'; 3441 3442 /** 3443 * Indicates that the parser has found a DOCTYPE node and it's 3444 * possible to read and modify its modifiable text. 3445 * 3446 * @since 6.5.0 3447 * 3448 * @access private 3449 */ 3450 const STATE_DOCTYPE = 'STATE_DOCTYPE'; 3451 3452 /** 3453 * Indicates that the parser has found an empty tag closer `</>`. 3454 * 3455 * Note that in HTML there are no empty tag closers, and they 3456 * are ignored. Nonetheless, the Tag Processor still 3457 * recognizes them as they appear in the HTML stream. 3458 * 3459 * These were historically discussed as a "presumptuous tag 3460 * closer," which would close the nearest open tag, but were 3461 * dismissed in favor of explicitly-closing tags. 3462 * 3463 * @since 6.5.0 3464 * 3465 * @access private 3466 */ 3467 const STATE_PRESUMPTUOUS_TAG = 'STATE_PRESUMPTUOUS_TAG'; 3468 3469 /** 3470 * Indicates that the parser has found a "funky comment" 3471 * and it's possible to read and modify its modifiable text. 3472 * 3473 * Example: 3474 * 3475 * </%url> 3476 * </{"wp-bit":"query/post-author"}> 3477 * </2> 3478 * 3479 * Funky comments are tag closers with invalid tag names. Note 3480 * that in HTML these are turn into bogus comments. Nonetheless, 3481 * the Tag Processor recognizes them in a stream of HTML and 3482 * exposes them for inspection and modification. 3483 * 3484 * @since 6.5.0 3485 * 3486 * @access private 3487 */ 3488 const STATE_FUNKY_COMMENT = 'STATE_WP_FUNKY'; 3489 3490 /** 3491 * Indicates that a comment was created when encountering abruptly-closed HTML comment. 3492 * 3493 * Example: 3494 * 3495 * <!--> 3496 * <!---> 3497 * 3498 * @since 6.5.0 3499 */ 3500 const COMMENT_AS_ABRUPTLY_CLOSED_COMMENT = 'COMMENT_AS_ABRUPTLY_CLOSED_COMMENT'; 3501 3502 /** 3503 * Indicates that a comment would be parsed as a CDATA node, 3504 * were HTML to allow CDATA nodes outside of foreign content. 3505 * 3506 * Example: 3507 * 3508 * <![CDATA[This is a CDATA node.]]> 3509 * 3510 * This is an HTML comment, but it looks like a CDATA node. 3511 * 3512 * @since 6.5.0 3513 */ 3514 const COMMENT_AS_CDATA_LOOKALIKE = 'COMMENT_AS_CDATA_LOOKALIKE'; 3515 3516 /** 3517 * Indicates that a comment was created when encountering 3518 * normative HTML comment syntax. 3519 * 3520 * Example: 3521 * 3522 * <!-- this is a comment --> 3523 * 3524 * @since 6.5.0 3525 */ 3526 const COMMENT_AS_HTML_COMMENT = 'COMMENT_AS_HTML_COMMENT'; 3527 3528 /** 3529 * Indicates that a comment would be parsed as a Processing 3530 * Instruction node, were they to exist within HTML. 3531 * 3532 * Example: 3533 * 3534 * <?wp __( 'Like' ) ?> 3535 * 3536 * This is an HTML comment, but it looks like a CDATA node. 3537 * 3538 * @since 6.5.0 3539 */ 3540 const COMMENT_AS_PI_NODE_LOOKALIKE = 'COMMENT_AS_PI_NODE_LOOKALIKE'; 3541 3542 /** 3543 * Indicates that a comment was created when encountering invalid 3544 * HTML input, a so-called "bogus comment." 3545 * 3546 * Example: 3547 * 3548 * <?nothing special> 3549 * <!{nothing special}> 3550 * 3551 * @since 6.5.0 3552 */ 3553 const COMMENT_AS_INVALID_HTML = 'COMMENT_AS_INVALID_HTML'; 3554 }
title
Description
Body
title
Description
Body
title
Description
Body
title
Body
Generated : Fri Apr 26 08:20:02 2024 | Cross-referenced by PHPXref |