[ Index ] |
PHP Cross Reference of WordPress Trunk (Updated Daily) |
[Summary view] [Print] [Text view]
1 <?php 2 /** 3 * HTML API: WP_HTML_Tag_Processor class 4 * 5 * Scans through an HTML document to find specific tags, then 6 * transforms those tags by adding, removing, or updating the 7 * values of the HTML attributes within that tag (opener). 8 * 9 * Does not fully parse HTML or _recurse_ into the HTML structure 10 * Instead this scans linearly through a document and only parses 11 * the HTML tag openers. 12 * 13 * ### Possible future direction for this module 14 * 15 * - Prune the whitespace when removing classes/attributes: e.g. "a b c" -> "c" not " c". 16 * This would increase the size of the changes for some operations but leave more 17 * natural-looking output HTML. 18 * 19 * @package WordPress 20 * @subpackage HTML-API 21 * @since 6.2.0 22 */ 23 24 /** 25 * Core class used to modify attributes in an HTML document for tags matching a query. 26 * 27 * ## Usage 28 * 29 * Use of this class requires three steps: 30 * 31 * 1. Create a new class instance with your input HTML document. 32 * 2. Find the tag(s) you are looking for. 33 * 3. Request changes to the attributes in those tag(s). 34 * 35 * Example: 36 * 37 * $tags = new WP_HTML_Tag_Processor( $html ); 38 * if ( $tags->next_tag( 'option' ) ) { 39 * $tags->set_attribute( 'selected', true ); 40 * } 41 * 42 * ### Finding tags 43 * 44 * The `next_tag()` function moves the internal cursor through 45 * your input HTML document until it finds a tag meeting any of 46 * the supplied restrictions in the optional query argument. If 47 * no argument is provided then it will find the next HTML tag, 48 * regardless of what kind it is. 49 * 50 * If you want to _find whatever the next tag is_: 51 * 52 * $tags->next_tag(); 53 * 54 * | Goal | Query | 55 * |-----------------------------------------------------------|---------------------------------------------------------------------------------| 56 * | Find any tag. | `$tags->next_tag();` | 57 * | Find next image tag. | `$tags->next_tag( array( 'tag_name' => 'img' ) );` | 58 * | Find next image tag (without passing the array). | `$tags->next_tag( 'img' );` | 59 * | Find next tag containing the `fullwidth` CSS class. | `$tags->next_tag( array( 'class_name' => 'fullwidth' ) );` | 60 * | Find next image tag containing the `fullwidth` CSS class. | `$tags->next_tag( array( 'tag_name' => 'img', 'class_name' => 'fullwidth' ) );` | 61 * 62 * If a tag was found meeting your criteria then `next_tag()` 63 * will return `true` and you can proceed to modify it. If it 64 * returns `false`, however, it failed to find the tag and 65 * moved the cursor to the end of the file. 66 * 67 * Once the cursor reaches the end of the file the processor 68 * is done and if you want to reach an earlier tag you will 69 * need to recreate the processor and start over, as it's 70 * unable to back up or move in reverse. 71 * 72 * See the section on bookmarks for an exception to this 73 * no-backing-up rule. 74 * 75 * #### Custom queries 76 * 77 * Sometimes it's necessary to further inspect an HTML tag than 78 * the query syntax here permits. In these cases one may further 79 * inspect the search results using the read-only functions 80 * provided by the processor or external state or variables. 81 * 82 * Example: 83 * 84 * // Paint up to the first five DIV or SPAN tags marked with the "jazzy" style. 85 * $remaining_count = 5; 86 * while ( $remaining_count > 0 && $tags->next_tag() ) { 87 * if ( 88 * ( 'DIV' === $tags->get_tag() || 'SPAN' === $tags->get_tag() ) && 89 * 'jazzy' === $tags->get_attribute( 'data-style' ) 90 * ) { 91 * $tags->add_class( 'theme-style-everest-jazz' ); 92 * $remaining_count--; 93 * } 94 * } 95 * 96 * `get_attribute()` will return `null` if the attribute wasn't present 97 * on the tag when it was called. It may return `""` (the empty string) 98 * in cases where the attribute was present but its value was empty. 99 * For boolean attributes, those whose name is present but no value is 100 * given, it will return `true` (the only way to set `false` for an 101 * attribute is to remove it). 102 * 103 * #### When matching fails 104 * 105 * When `next_tag()` returns `false` it could mean different things: 106 * 107 * - The requested tag wasn't found in the input document. 108 * - The input document ended in the middle of an HTML syntax element. 109 * 110 * When a document ends in the middle of a syntax element it will pause 111 * the processor. This is to make it possible in the future to extend the 112 * input document and proceed - an important requirement for chunked 113 * streaming parsing of a document. 114 * 115 * Example: 116 * 117 * $processor = new WP_HTML_Tag_Processor( 'This <div is="a" partial="token' ); 118 * false === $processor->next_tag(); 119 * 120 * If a special element (see next section) is encountered but no closing tag 121 * is found it will count as an incomplete tag. The parser will pause as if 122 * the opening tag were incomplete. 123 * 124 * Example: 125 * 126 * $processor = new WP_HTML_Tag_Processor( '<style>// there could be more styling to come' ); 127 * false === $processor->next_tag(); 128 * 129 * $processor = new WP_HTML_Tag_Processor( '<style>// this is everything</style><div>' ); 130 * true === $processor->next_tag( 'DIV' ); 131 * 132 * #### Special self-contained elements 133 * 134 * Some HTML elements are handled in a special way; their start and end tags 135 * act like a void tag. These are special because their contents can't contain 136 * HTML markup. Everything inside these elements is handled in a special way 137 * and content that _appears_ like HTML tags inside of them isn't. There can 138 * be no nesting in these elements. 139 * 140 * In the following list, "raw text" means that all of the content in the HTML 141 * until the matching closing tag is treated verbatim without any replacements 142 * and without any parsing. 143 * 144 * - IFRAME allows no content but requires a closing tag. 145 * - NOEMBED (deprecated) content is raw text. 146 * - NOFRAMES (deprecated) content is raw text. 147 * - SCRIPT content is plaintext apart from legacy rules allowing `</script>` inside an HTML comment. 148 * - STYLE content is raw text. 149 * - TITLE content is plain text but character references are decoded. 150 * - TEXTAREA content is plain text but character references are decoded. 151 * - XMP (deprecated) content is raw text. 152 * 153 * ### Modifying HTML attributes for a found tag 154 * 155 * Once you've found the start of an opening tag you can modify 156 * any number of the attributes on that tag. You can set a new 157 * value for an attribute, remove the entire attribute, or do 158 * nothing and move on to the next opening tag. 159 * 160 * Example: 161 * 162 * if ( $tags->next_tag( array( 'class_name' => 'wp-group-block' ) ) ) { 163 * $tags->set_attribute( 'title', 'This groups the contained content.' ); 164 * $tags->remove_attribute( 'data-test-id' ); 165 * } 166 * 167 * If `set_attribute()` is called for an existing attribute it will 168 * overwrite the existing value. Similarly, calling `remove_attribute()` 169 * for a non-existing attribute has no effect on the document. Both 170 * of these methods are safe to call without knowing if a given attribute 171 * exists beforehand. 172 * 173 * ### Modifying CSS classes for a found tag 174 * 175 * The tag processor treats the `class` attribute as a special case. 176 * Because it's a common operation to add or remove CSS classes, this 177 * interface adds helper methods to make that easier. 178 * 179 * As with attribute values, adding or removing CSS classes is a safe 180 * operation that doesn't require checking if the attribute or class 181 * exists before making changes. If removing the only class then the 182 * entire `class` attribute will be removed. 183 * 184 * Example: 185 * 186 * // from `<span>Yippee!</span>` 187 * // to `<span class="is-active">Yippee!</span>` 188 * $tags->add_class( 'is-active' ); 189 * 190 * // from `<span class="excited">Yippee!</span>` 191 * // to `<span class="excited is-active">Yippee!</span>` 192 * $tags->add_class( 'is-active' ); 193 * 194 * // from `<span class="is-active heavy-accent">Yippee!</span>` 195 * // to `<span class="is-active heavy-accent">Yippee!</span>` 196 * $tags->add_class( 'is-active' ); 197 * 198 * // from `<input type="text" class="is-active rugby not-disabled" length="24">` 199 * // to `<input type="text" class="is-active not-disabled" length="24"> 200 * $tags->remove_class( 'rugby' ); 201 * 202 * // from `<input type="text" class="rugby" length="24">` 203 * // to `<input type="text" length="24"> 204 * $tags->remove_class( 'rugby' ); 205 * 206 * // from `<input type="text" length="24">` 207 * // to `<input type="text" length="24"> 208 * $tags->remove_class( 'rugby' ); 209 * 210 * When class changes are enqueued but a direct change to `class` is made via 211 * `set_attribute` then the changes to `set_attribute` (or `remove_attribute`) 212 * will take precedence over those made through `add_class` and `remove_class`. 213 * 214 * ### Bookmarks 215 * 216 * While scanning through the input HTMl document it's possible to set 217 * a named bookmark when a particular tag is found. Later on, after 218 * continuing to scan other tags, it's possible to `seek` to one of 219 * the set bookmarks and then proceed again from that point forward. 220 * 221 * Because bookmarks create processing overhead one should avoid 222 * creating too many of them. As a rule, create only bookmarks 223 * of known string literal names; avoid creating "mark_{$index}" 224 * and so on. It's fine from a performance standpoint to create a 225 * bookmark and update it frequently, such as within a loop. 226 * 227 * $total_todos = 0; 228 * while ( $p->next_tag( array( 'tag_name' => 'UL', 'class_name' => 'todo' ) ) ) { 229 * $p->set_bookmark( 'list-start' ); 230 * while ( $p->next_tag( array( 'tag_closers' => 'visit' ) ) ) { 231 * if ( 'UL' === $p->get_tag() && $p->is_tag_closer() ) { 232 * $p->set_bookmark( 'list-end' ); 233 * $p->seek( 'list-start' ); 234 * $p->set_attribute( 'data-contained-todos', (string) $total_todos ); 235 * $total_todos = 0; 236 * $p->seek( 'list-end' ); 237 * break; 238 * } 239 * 240 * if ( 'LI' === $p->get_tag() && ! $p->is_tag_closer() ) { 241 * $total_todos++; 242 * } 243 * } 244 * } 245 * 246 * ## Tokens and finer-grained processing. 247 * 248 * It's possible to scan through every lexical token in the 249 * HTML document using the `next_token()` function. This 250 * alternative form takes no argument and provides no built-in 251 * query syntax. 252 * 253 * Example: 254 * 255 * $title = '(untitled)'; 256 * $text = ''; 257 * while ( $processor->next_token() ) { 258 * switch ( $processor->get_token_name() ) { 259 * case '#text': 260 * $text .= $processor->get_modifiable_text(); 261 * break; 262 * 263 * case 'BR': 264 * $text .= "\n"; 265 * break; 266 * 267 * case 'TITLE': 268 * $title = $processor->get_modifiable_text(); 269 * break; 270 * } 271 * } 272 * return trim( "# {$title}\n\n{$text}" ); 273 * 274 * ### Tokens and _modifiable text_. 275 * 276 * #### Special "atomic" HTML elements. 277 * 278 * Not all HTML elements are able to contain other elements inside of them. 279 * For instance, the contents inside a TITLE element are plaintext (except 280 * that character references like & will be decoded). This means that 281 * if the string `<img>` appears inside a TITLE element, then it's not an 282 * image tag, but rather it's text describing an image tag. Likewise, the 283 * contents of a SCRIPT or STYLE element are handled entirely separately in 284 * a browser than the contents of other elements because they represent a 285 * different language than HTML. 286 * 287 * For these elements the Tag Processor treats the entire sequence as one, 288 * from the opening tag, including its contents, through its closing tag. 289 * This means that the it's not possible to match the closing tag for a 290 * SCRIPT element unless it's unexpected; the Tag Processor already matched 291 * it when it found the opening tag. 292 * 293 * The inner contents of these elements are that element's _modifiable text_. 294 * 295 * The special elements are: 296 * - `SCRIPT` whose contents are treated as raw plaintext but supports a legacy 297 * style of including JavaScript inside of HTML comments to avoid accidentally 298 * closing the SCRIPT from inside a JavaScript string. E.g. `console.log( '</script>' )`. 299 * - `TITLE` and `TEXTAREA` whose contents are treated as plaintext and then any 300 * character references are decoded. E.g. `1 < 2 < 3` becomes `1 < 2 < 3`. 301 * - `IFRAME`, `NOSCRIPT`, `NOEMBED`, `NOFRAME`, `STYLE` whose contents are treated as 302 * raw plaintext and left as-is. E.g. `1 < 2 < 3` remains `1 < 2 < 3`. 303 * 304 * #### Other tokens with modifiable text. 305 * 306 * There are also non-elements which are void/self-closing in nature and contain 307 * modifiable text that is part of that individual syntax token itself. 308 * 309 * - `#text` nodes, whose entire token _is_ the modifiable text. 310 * - HTML comments and tokens that become comments due to some syntax error. The 311 * text for these tokens is the portion of the comment inside of the syntax. 312 * E.g. for `<!-- comment -->` the text is `" comment "` (note the spaces are included). 313 * - `CDATA` sections, whose text is the content inside of the section itself. E.g. for 314 * `<![CDATA[some content]]>` the text is `"some content"` (with restrictions [1]). 315 * - "Funky comments," which are a special case of invalid closing tags whose name is 316 * invalid. The text for these nodes is the text that a browser would transform into 317 * an HTML comment when parsing. E.g. for `</%post_author>` the text is `%post_author`. 318 * - `DOCTYPE` declarations like `<DOCTYPE html>` which have no closing tag. 319 * - XML Processing instruction nodes like `<?wp __( "Like" ); ?>` (with restrictions [2]). 320 * - The empty end tag `</>` which is ignored in the browser and DOM. 321 * 322 * [1]: There are no CDATA sections in HTML. When encountering `<![CDATA[`, everything 323 * until the next `>` becomes a bogus HTML comment, meaning there can be no CDATA 324 * section in an HTML document containing `>`. The Tag Processor will first find 325 * all valid and bogus HTML comments, and then if the comment _would_ have been a 326 * CDATA section _were they to exist_, it will indicate this as the type of comment. 327 * 328 * [2]: XML allows a broader range of characters in a processing instruction's target name 329 * and disallows "xml" as a name, since it's special. The Tag Processor only recognizes 330 * target names with an ASCII-representable subset of characters. It also exhibits the 331 * same constraint as with CDATA sections, in that `>` cannot exist within the token 332 * since Processing Instructions do no exist within HTML and their syntax transforms 333 * into a bogus comment in the DOM. 334 * 335 * ## Design and limitations 336 * 337 * The Tag Processor is designed to linearly scan HTML documents and tokenize 338 * HTML tags and their attributes. It's designed to do this as efficiently as 339 * possible without compromising parsing integrity. Therefore it will be 340 * slower than some methods of modifying HTML, such as those incorporating 341 * over-simplified PCRE patterns, but will not introduce the defects and 342 * failures that those methods bring in, which lead to broken page renders 343 * and often to security vulnerabilities. On the other hand, it will be faster 344 * than full-blown HTML parsers such as DOMDocument and use considerably 345 * less memory. It requires a negligible memory overhead, enough to consider 346 * it a zero-overhead system. 347 * 348 * The performance characteristics are maintained by avoiding tree construction 349 * and semantic cleanups which are specified in HTML5. Because of this, for 350 * example, it's not possible for the Tag Processor to associate any given 351 * opening tag with its corresponding closing tag, or to return the inner markup 352 * inside an element. Systems may be built on top of the Tag Processor to do 353 * this, but the Tag Processor is and should be constrained so it can remain an 354 * efficient, low-level, and reliable HTML scanner. 355 * 356 * The Tag Processor's design incorporates a "garbage-in-garbage-out" philosophy. 357 * HTML5 specifies that certain invalid content be transformed into different forms 358 * for display, such as removing null bytes from an input document and replacing 359 * invalid characters with the Unicode replacement character `U+FFFD` (visually "�"). 360 * Where errors or transformations exist within the HTML5 specification, the Tag Processor 361 * leaves those invalid inputs untouched, passing them through to the final browser 362 * to handle. While this implies that certain operations will be non-spec-compliant, 363 * such as reading the value of an attribute with invalid content, it also preserves a 364 * simplicity and efficiency for handling those error cases. 365 * 366 * Most operations within the Tag Processor are designed to minimize the difference 367 * between an input and output document for any given change. For example, the 368 * `add_class` and `remove_class` methods preserve whitespace and the class ordering 369 * within the `class` attribute; and when encountering tags with duplicated attributes, 370 * the Tag Processor will leave those invalid duplicate attributes where they are but 371 * update the proper attribute which the browser will read for parsing its value. An 372 * exception to this rule is that all attribute updates store their values as 373 * double-quoted strings, meaning that attributes on input with single-quoted or 374 * unquoted values will appear in the output with double-quotes. 375 * 376 * ### Scripting Flag 377 * 378 * The Tag Processor parses HTML with the "scripting flag" disabled. This means 379 * that it doesn't run any scripts while parsing the page. In a browser with 380 * JavaScript enabled, for example, the script can change the parse of the 381 * document as it loads. On the server, however, evaluating JavaScript is not 382 * only impractical, but also unwanted. 383 * 384 * Practically this means that the Tag Processor will descend into NOSCRIPT 385 * elements and process its child tags. Were the scripting flag enabled, such 386 * as in a typical browser, the contents of NOSCRIPT are skipped entirely. 387 * 388 * This allows the HTML API to process the content that will be presented in 389 * a browser when scripting is disabled, but it offers a different view of a 390 * page than most browser sessions will experience. E.g. the tags inside the 391 * NOSCRIPT disappear. 392 * 393 * ### Text Encoding 394 * 395 * The Tag Processor assumes that the input HTML document is encoded with a 396 * text encoding compatible with 7-bit ASCII's '<', '>', '&', ';', '/', '=', 397 * "'", '"', 'a' - 'z', 'A' - 'Z', and the whitespace characters ' ', tab, 398 * carriage-return, newline, and form-feed. 399 * 400 * In practice, this includes almost every single-byte encoding as well as 401 * UTF-8. Notably, however, it does not include UTF-16. If providing input 402 * that's incompatible, then convert the encoding beforehand. 403 * 404 * @since 6.2.0 405 * @since 6.2.1 Fix: Support for various invalid comments; attribute updates are case-insensitive. 406 * @since 6.3.2 Fix: Skip HTML-like content inside rawtext elements such as STYLE. 407 * @since 6.5.0 Pauses processor when input ends in an incomplete syntax token. 408 * Introduces "special" elements which act like void elements, e.g. TITLE, STYLE. 409 * Allows scanning through all tokens and processing modifiable text, where applicable. 410 */ 411 class WP_HTML_Tag_Processor { 412 /** 413 * The maximum number of bookmarks allowed to exist at 414 * any given time. 415 * 416 * @since 6.2.0 417 * @var int 418 * 419 * @see WP_HTML_Tag_Processor::set_bookmark() 420 */ 421 const MAX_BOOKMARKS = 10; 422 423 /** 424 * Maximum number of times seek() can be called. 425 * Prevents accidental infinite loops. 426 * 427 * @since 6.2.0 428 * @var int 429 * 430 * @see WP_HTML_Tag_Processor::seek() 431 */ 432 const MAX_SEEK_OPS = 1000; 433 434 /** 435 * The HTML document to parse. 436 * 437 * @since 6.2.0 438 * @var string 439 */ 440 protected $html; 441 442 /** 443 * The last query passed to next_tag(). 444 * 445 * @since 6.2.0 446 * @var array|null 447 */ 448 private $last_query; 449 450 /** 451 * The tag name this processor currently scans for. 452 * 453 * @since 6.2.0 454 * @var string|null 455 */ 456 private $sought_tag_name; 457 458 /** 459 * The CSS class name this processor currently scans for. 460 * 461 * @since 6.2.0 462 * @var string|null 463 */ 464 private $sought_class_name; 465 466 /** 467 * The match offset this processor currently scans for. 468 * 469 * @since 6.2.0 470 * @var int|null 471 */ 472 private $sought_match_offset; 473 474 /** 475 * Whether to visit tag closers, e.g. </div>, when walking an input document. 476 * 477 * @since 6.2.0 478 * @var bool 479 */ 480 private $stop_on_tag_closers; 481 482 /** 483 * Specifies mode of operation of the parser at any given time. 484 * 485 * | State | Meaning | 486 * | ----------------|----------------------------------------------------------------------| 487 * | *Ready* | The parser is ready to run. | 488 * | *Complete* | There is nothing left to parse. | 489 * | *Incomplete* | The HTML ended in the middle of a token; nothing more can be parsed. | 490 * | *Matched tag* | Found an HTML tag; it's possible to modify its attributes. | 491 * | *Text node* | Found a #text node; this is plaintext and modifiable. | 492 * | *CDATA node* | Found a CDATA section; this is modifiable. | 493 * | *Comment* | Found a comment or bogus comment; this is modifiable. | 494 * | *Presumptuous* | Found an empty tag closer: `</>`. | 495 * | *Funky comment* | Found a tag closer with an invalid tag name; this is modifiable. | 496 * 497 * @since 6.5.0 498 * 499 * @see WP_HTML_Tag_Processor::STATE_READY 500 * @see WP_HTML_Tag_Processor::STATE_COMPLETE 501 * @see WP_HTML_Tag_Processor::STATE_INCOMPLETE_INPUT 502 * @see WP_HTML_Tag_Processor::STATE_MATCHED_TAG 503 * @see WP_HTML_Tag_Processor::STATE_TEXT_NODE 504 * @see WP_HTML_Tag_Processor::STATE_CDATA_NODE 505 * @see WP_HTML_Tag_Processor::STATE_COMMENT 506 * @see WP_HTML_Tag_Processor::STATE_DOCTYPE 507 * @see WP_HTML_Tag_Processor::STATE_PRESUMPTUOUS_TAG 508 * @see WP_HTML_Tag_Processor::STATE_FUNKY_COMMENT 509 * 510 * @var string 511 */ 512 protected $parser_state = self::STATE_READY; 513 514 /** 515 * Indicates if the document is in quirks mode or no-quirks mode. 516 * 517 * Impact on HTML parsing: 518 * 519 * - In `NO_QUIRKS_MODE` (also known as "standard mode"): 520 * - CSS class and ID selectors match byte-for-byte (case-sensitively). 521 * - A TABLE start tag `<table>` implicitly closes any open `P` element. 522 * 523 * - In `QUIRKS_MODE`: 524 * - CSS class and ID selectors match match in an ASCII case-insensitive manner. 525 * - A TABLE start tag `<table>` opens a `TABLE` element as a child of a `P` 526 * element if one is open. 527 * 528 * Quirks and no-quirks mode are thus mostly about styling, but have an impact when 529 * tables are found inside paragraph elements. 530 * 531 * @see self::QUIRKS_MODE 532 * @see self::NO_QUIRKS_MODE 533 * 534 * @since 6.7.0 535 * 536 * @var string 537 */ 538 protected $compat_mode = self::NO_QUIRKS_MODE; 539 540 /** 541 * Indicates whether the parser is inside foreign content, 542 * e.g. inside an SVG or MathML element. 543 * 544 * One of 'html', 'svg', or 'math'. 545 * 546 * Several parsing rules change based on whether the parser 547 * is inside foreign content, including whether CDATA sections 548 * are allowed and whether a self-closing flag indicates that 549 * an element has no content. 550 * 551 * @since 6.7.0 552 * 553 * @var string 554 */ 555 private $parsing_namespace = 'html'; 556 557 /** 558 * What kind of syntax token became an HTML comment. 559 * 560 * Since there are many ways in which HTML syntax can create an HTML comment, 561 * this indicates which of those caused it. This allows the Tag Processor to 562 * represent more from the original input document than would appear in the DOM. 563 * 564 * @since 6.5.0 565 * 566 * @var string|null 567 */ 568 protected $comment_type = null; 569 570 /** 571 * What kind of text the matched text node represents, if it was subdivided. 572 * 573 * @see self::TEXT_IS_NULL_SEQUENCE 574 * @see self::TEXT_IS_WHITESPACE 575 * @see self::TEXT_IS_GENERIC 576 * @see self::subdivide_text_appropriately 577 * 578 * @since 6.7.0 579 * 580 * @var string 581 */ 582 protected $text_node_classification = self::TEXT_IS_GENERIC; 583 584 /** 585 * How many bytes from the original HTML document have been read and parsed. 586 * 587 * This value points to the latest byte offset in the input document which 588 * has been already parsed. It is the internal cursor for the Tag Processor 589 * and updates while scanning through the HTML tokens. 590 * 591 * @since 6.2.0 592 * @var int 593 */ 594 private $bytes_already_parsed = 0; 595 596 /** 597 * Byte offset in input document where current token starts. 598 * 599 * Example: 600 * 601 * <div id="test">... 602 * 01234 603 * - token starts at 0 604 * 605 * @since 6.5.0 606 * 607 * @var int|null 608 */ 609 private $token_starts_at; 610 611 /** 612 * Byte length of current token. 613 * 614 * Example: 615 * 616 * <div id="test">... 617 * 012345678901234 618 * - token length is 14 - 0 = 14 619 * 620 * a <!-- comment --> is a token. 621 * 0123456789 123456789 123456789 622 * - token length is 17 - 2 = 15 623 * 624 * @since 6.5.0 625 * 626 * @var int|null 627 */ 628 private $token_length; 629 630 /** 631 * Byte offset in input document where current tag name starts. 632 * 633 * Example: 634 * 635 * <div id="test">... 636 * 01234 637 * - tag name starts at 1 638 * 639 * @since 6.2.0 640 * 641 * @var int|null 642 */ 643 private $tag_name_starts_at; 644 645 /** 646 * Byte length of current tag name. 647 * 648 * Example: 649 * 650 * <div id="test">... 651 * 01234 652 * --- tag name length is 3 653 * 654 * @since 6.2.0 655 * 656 * @var int|null 657 */ 658 private $tag_name_length; 659 660 /** 661 * Byte offset into input document where current modifiable text starts. 662 * 663 * @since 6.5.0 664 * 665 * @var int 666 */ 667 private $text_starts_at; 668 669 /** 670 * Byte length of modifiable text. 671 * 672 * @since 6.5.0 673 * 674 * @var int 675 */ 676 private $text_length; 677 678 /** 679 * Whether the current tag is an opening tag, e.g. <div>, or a closing tag, e.g. </div>. 680 * 681 * @var bool 682 */ 683 private $is_closing_tag; 684 685 /** 686 * Lazily-built index of attributes found within an HTML tag, keyed by the attribute name. 687 * 688 * Example: 689 * 690 * // Supposing the parser is working through this content 691 * // and stops after recognizing the `id` attribute. 692 * // <div id="test-4" class=outline title="data:text/plain;base64=asdk3nk1j3fo8"> 693 * // ^ parsing will continue from this point. 694 * $this->attributes = array( 695 * 'id' => new WP_HTML_Attribute_Token( 'id', 9, 6, 5, 11, false ) 696 * ); 697 * 698 * // When picking up parsing again, or when asking to find the 699 * // `class` attribute we will continue and add to this array. 700 * $this->attributes = array( 701 * 'id' => new WP_HTML_Attribute_Token( 'id', 9, 6, 5, 11, false ), 702 * 'class' => new WP_HTML_Attribute_Token( 'class', 23, 7, 17, 13, false ) 703 * ); 704 * 705 * // Note that only the `class` attribute value is stored in the index. 706 * // That's because it is the only value used by this class at the moment. 707 * 708 * @since 6.2.0 709 * @var WP_HTML_Attribute_Token[] 710 */ 711 private $attributes = array(); 712 713 /** 714 * Tracks spans of duplicate attributes on a given tag, used for removing 715 * all copies of an attribute when calling `remove_attribute()`. 716 * 717 * @since 6.3.2 718 * 719 * @var (WP_HTML_Span[])[]|null 720 */ 721 private $duplicate_attributes = null; 722 723 /** 724 * Which class names to add or remove from a tag. 725 * 726 * These are tracked separately from attribute updates because they are 727 * semantically distinct, whereas this interface exists for the common 728 * case of adding and removing class names while other attributes are 729 * generally modified as with DOM `setAttribute` calls. 730 * 731 * When modifying an HTML document these will eventually be collapsed 732 * into a single `set_attribute( 'class', $changes )` call. 733 * 734 * Example: 735 * 736 * // Add the `wp-block-group` class, remove the `wp-group` class. 737 * $classname_updates = array( 738 * // Indexed by a comparable class name. 739 * 'wp-block-group' => WP_HTML_Tag_Processor::ADD_CLASS, 740 * 'wp-group' => WP_HTML_Tag_Processor::REMOVE_CLASS 741 * ); 742 * 743 * @since 6.2.0 744 * @var bool[] 745 */ 746 private $classname_updates = array(); 747 748 /** 749 * Tracks a semantic location in the original HTML which 750 * shifts with updates as they are applied to the document. 751 * 752 * @since 6.2.0 753 * @var WP_HTML_Span[] 754 */ 755 protected $bookmarks = array(); 756 757 const ADD_CLASS = true; 758 const REMOVE_CLASS = false; 759 const SKIP_CLASS = null; 760 761 /** 762 * Lexical replacements to apply to input HTML document. 763 * 764 * "Lexical" in this class refers to the part of this class which 765 * operates on pure text _as text_ and not as HTML. There's a line 766 * between the public interface, with HTML-semantic methods like 767 * `set_attribute` and `add_class`, and an internal state that tracks 768 * text offsets in the input document. 769 * 770 * When higher-level HTML methods are called, those have to transform their 771 * operations (such as setting an attribute's value) into text diffing 772 * operations (such as replacing the sub-string from indices A to B with 773 * some given new string). These text-diffing operations are the lexical 774 * updates. 775 * 776 * As new higher-level methods are added they need to collapse their 777 * operations into these lower-level lexical updates since that's the 778 * Tag Processor's internal language of change. Any code which creates 779 * these lexical updates must ensure that they do not cross HTML syntax 780 * boundaries, however, so these should never be exposed outside of this 781 * class or any classes which intentionally expand its functionality. 782 * 783 * These are enqueued while editing the document instead of being immediately 784 * applied to avoid processing overhead, string allocations, and string 785 * copies when applying many updates to a single document. 786 * 787 * Example: 788 * 789 * // Replace an attribute stored with a new value, indices 790 * // sourced from the lazily-parsed HTML recognizer. 791 * $start = $attributes['src']->start; 792 * $length = $attributes['src']->length; 793 * $modifications[] = new WP_HTML_Text_Replacement( $start, $length, $new_value ); 794 * 795 * // Correspondingly, something like this will appear in this array. 796 * $lexical_updates = array( 797 * WP_HTML_Text_Replacement( 14, 28, 'https://my-site.my-domain/wp-content/uploads/2014/08/kittens.jpg' ) 798 * ); 799 * 800 * @since 6.2.0 801 * @var WP_HTML_Text_Replacement[] 802 */ 803 protected $lexical_updates = array(); 804 805 /** 806 * Tracks and limits `seek()` calls to prevent accidental infinite loops. 807 * 808 * @since 6.2.0 809 * @var int 810 * 811 * @see WP_HTML_Tag_Processor::seek() 812 */ 813 protected $seek_count = 0; 814 815 /** 816 * Whether the parser should skip over an immediately-following linefeed 817 * character, as is the case with LISTING, PRE, and TEXTAREA. 818 * 819 * > If the next token is a U+000A LINE FEED (LF) character token, then 820 * > ignore that token and move on to the next one. (Newlines at the start 821 * > of [these] elements are ignored as an authoring convenience.) 822 * 823 * @since 6.7.0 824 * 825 * @var int|null 826 */ 827 private $skip_newline_at = null; 828 829 /** 830 * Constructor. 831 * 832 * @since 6.2.0 833 * 834 * @param string $html HTML to process. 835 */ 836 public function __construct( $html ) { 837 if ( ! is_string( $html ) ) { 838 _doing_it_wrong( 839 __METHOD__, 840 __( 'The HTML parameter must be a string.' ), 841 '6.9.0' 842 ); 843 $html = ''; 844 } 845 $this->html = $html; 846 } 847 848 /** 849 * Switches parsing mode into a new namespace, such as when 850 * encountering an SVG tag and entering foreign content. 851 * 852 * @since 6.7.0 853 * 854 * @param string $new_namespace One of 'html', 'svg', or 'math' indicating into what 855 * namespace the next tokens will be processed. 856 * @return bool Whether the namespace was valid and changed. 857 */ 858 public function change_parsing_namespace( string $new_namespace ): bool { 859 if ( ! in_array( $new_namespace, array( 'html', 'math', 'svg' ), true ) ) { 860 return false; 861 } 862 863 $this->parsing_namespace = $new_namespace; 864 return true; 865 } 866 867 /** 868 * Finds the next tag matching the $query. 869 * 870 * @since 6.2.0 871 * @since 6.5.0 No longer processes incomplete tokens at end of document; pauses the processor at start of token. 872 * 873 * @param array|string|null $query { 874 * Optional. Which tag name to find, having which class, etc. Default is to find any tag. 875 * 876 * @type string|null $tag_name Which tag to find, or `null` for "any tag." 877 * @type int|null $match_offset Find the Nth tag matching all search criteria. 878 * 1 for "first" tag, 3 for "third," etc. 879 * Defaults to first tag. 880 * @type string|null $class_name Tag must contain this whole class name to match. 881 * @type string|null $tag_closers "visit" or "skip": whether to stop on tag closers, e.g. </div>. 882 * } 883 * @return bool Whether a tag was matched. 884 */ 885 public function next_tag( $query = null ): bool { 886 $this->parse_query( $query ); 887 $already_found = 0; 888 889 do { 890 if ( false === $this->next_token() ) { 891 return false; 892 } 893 894 if ( self::STATE_MATCHED_TAG !== $this->parser_state ) { 895 continue; 896 } 897 898 if ( $this->matches() ) { 899 ++$already_found; 900 } 901 } while ( $already_found < $this->sought_match_offset ); 902 903 return true; 904 } 905 906 /** 907 * Finds the next token in the HTML document. 908 * 909 * An HTML document can be viewed as a stream of tokens, 910 * where tokens are things like HTML tags, HTML comments, 911 * text nodes, etc. This method finds the next token in 912 * the HTML document and returns whether it found one. 913 * 914 * If it starts parsing a token and reaches the end of the 915 * document then it will seek to the start of the last 916 * token and pause, returning `false` to indicate that it 917 * failed to find a complete token. 918 * 919 * Possible token types, based on the HTML specification: 920 * 921 * - an HTML tag, whether opening, closing, or void. 922 * - a text node - the plaintext inside tags. 923 * - an HTML comment. 924 * - a DOCTYPE declaration. 925 * - a processing instruction, e.g. `<?xml version="1.0" ?>`. 926 * 927 * The Tag Processor currently only supports the tag token. 928 * 929 * @since 6.5.0 930 * @since 6.7.0 Recognizes CDATA sections within foreign content. 931 * 932 * @return bool Whether a token was parsed. 933 */ 934 public function next_token(): bool { 935 return $this->base_class_next_token(); 936 } 937 938 /** 939 * Internal method which finds the next token in the HTML document. 940 * 941 * This method is a protected internal function which implements the logic for 942 * finding the next token in a document. It exists so that the parser can update 943 * its state without affecting the location of the cursor in the document and 944 * without triggering subclass methods for things like `next_token()`, e.g. when 945 * applying patches before searching for the next token. 946 * 947 * @since 6.5.0 948 * 949 * @access private 950 * 951 * @return bool Whether a token was parsed. 952 */ 953 private function base_class_next_token(): bool { 954 $was_at = $this->bytes_already_parsed; 955 $this->after_tag(); 956 957 // Don't proceed if there's nothing more to scan. 958 if ( 959 self::STATE_COMPLETE === $this->parser_state || 960 self::STATE_INCOMPLETE_INPUT === $this->parser_state 961 ) { 962 return false; 963 } 964 965 /* 966 * The next step in the parsing loop determines the parsing state; 967 * clear it so that state doesn't linger from the previous step. 968 */ 969 $this->parser_state = self::STATE_READY; 970 971 if ( $this->bytes_already_parsed >= strlen( $this->html ) ) { 972 $this->parser_state = self::STATE_COMPLETE; 973 return false; 974 } 975 976 // Find the next tag if it exists. 977 if ( false === $this->parse_next_tag() ) { 978 if ( self::STATE_INCOMPLETE_INPUT === $this->parser_state ) { 979 $this->bytes_already_parsed = $was_at; 980 } 981 982 return false; 983 } 984 985 /* 986 * For legacy reasons the rest of this function handles tags and their 987 * attributes. If the processor has reached the end of the document 988 * or if it matched any other token then it should return here to avoid 989 * attempting to process tag-specific syntax. 990 */ 991 if ( 992 self::STATE_INCOMPLETE_INPUT !== $this->parser_state && 993 self::STATE_COMPLETE !== $this->parser_state && 994 self::STATE_MATCHED_TAG !== $this->parser_state 995 ) { 996 return true; 997 } 998 999 // Parse all of its attributes. 1000 while ( $this->parse_next_attribute() ) { 1001 continue; 1002 } 1003 1004 // Ensure that the tag closes before the end of the document. 1005 if ( 1006 self::STATE_INCOMPLETE_INPUT === $this->parser_state || 1007 $this->bytes_already_parsed >= strlen( $this->html ) 1008 ) { 1009 // Does this appropriately clear state (parsed attributes)? 1010 $this->parser_state = self::STATE_INCOMPLETE_INPUT; 1011 $this->bytes_already_parsed = $was_at; 1012 1013 return false; 1014 } 1015 1016 $tag_ends_at = strpos( $this->html, '>', $this->bytes_already_parsed ); 1017 if ( false === $tag_ends_at ) { 1018 $this->parser_state = self::STATE_INCOMPLETE_INPUT; 1019 $this->bytes_already_parsed = $was_at; 1020 1021 return false; 1022 } 1023 $this->parser_state = self::STATE_MATCHED_TAG; 1024 $this->bytes_already_parsed = $tag_ends_at + 1; 1025 $this->token_length = $this->bytes_already_parsed - $this->token_starts_at; 1026 1027 /* 1028 * Certain tags require additional processing. The first-letter pre-check 1029 * avoids unnecessary string allocation when comparing the tag names. 1030 * 1031 * - IFRAME 1032 * - LISTING (deprecated) 1033 * - NOEMBED (deprecated) 1034 * - NOFRAMES (deprecated) 1035 * - PRE 1036 * - SCRIPT 1037 * - STYLE 1038 * - TEXTAREA 1039 * - TITLE 1040 * - XMP (deprecated) 1041 */ 1042 if ( 1043 $this->is_closing_tag || 1044 'html' !== $this->parsing_namespace || 1045 1 !== strspn( $this->html, 'iIlLnNpPsStTxX', $this->tag_name_starts_at, 1 ) 1046 ) { 1047 return true; 1048 } 1049 1050 $tag_name = $this->get_tag(); 1051 1052 /* 1053 * For LISTING, PRE, and TEXTAREA, the first linefeed of an immediately-following 1054 * text node is ignored as an authoring convenience. 1055 * 1056 * @see static::skip_newline_at 1057 */ 1058 if ( 'LISTING' === $tag_name || 'PRE' === $tag_name ) { 1059 $this->skip_newline_at = $this->bytes_already_parsed; 1060 return true; 1061 } 1062 1063 /* 1064 * There are certain elements whose children are not DATA but are instead 1065 * RCDATA or RAWTEXT. These cannot contain other elements, and the contents 1066 * are parsed as plaintext, with character references decoded in RCDATA but 1067 * not in RAWTEXT. 1068 * 1069 * These elements are described here as "self-contained" or special atomic 1070 * elements whose end tag is consumed with the opening tag, and they will 1071 * contain modifiable text inside of them. 1072 * 1073 * Preserve the opening tag pointers, as these will be overwritten 1074 * when finding the closing tag. They will be reset after finding 1075 * the closing to tag to point to the opening of the special atomic 1076 * tag sequence. 1077 */ 1078 $tag_name_starts_at = $this->tag_name_starts_at; 1079 $tag_name_length = $this->tag_name_length; 1080 $tag_ends_at = $this->token_starts_at + $this->token_length; 1081 $attributes = $this->attributes; 1082 $duplicate_attributes = $this->duplicate_attributes; 1083 1084 // Find the closing tag if necessary. 1085 switch ( $tag_name ) { 1086 case 'SCRIPT': 1087 $found_closer = $this->skip_script_data(); 1088 break; 1089 1090 case 'TEXTAREA': 1091 case 'TITLE': 1092 $found_closer = $this->skip_rcdata( $tag_name ); 1093 break; 1094 1095 /* 1096 * In the browser this list would include the NOSCRIPT element, 1097 * but the Tag Processor is an environment with the scripting 1098 * flag disabled, meaning that it needs to descend into the 1099 * NOSCRIPT element to be able to properly process what will be 1100 * sent to a browser. 1101 * 1102 * Note that this rule makes HTML5 syntax incompatible with XML, 1103 * because the parsing of this token depends on client application. 1104 * The NOSCRIPT element cannot be represented in the XHTML syntax. 1105 */ 1106 case 'IFRAME': 1107 case 'NOEMBED': 1108 case 'NOFRAMES': 1109 case 'STYLE': 1110 case 'XMP': 1111 $found_closer = $this->skip_rawtext( $tag_name ); 1112 break; 1113 1114 // No other tags should be treated in their entirety here. 1115 default: 1116 return true; 1117 } 1118 1119 if ( ! $found_closer ) { 1120 $this->parser_state = self::STATE_INCOMPLETE_INPUT; 1121 $this->bytes_already_parsed = $was_at; 1122 return false; 1123 } 1124 1125 /* 1126 * The values here look like they reference the opening tag but they reference 1127 * the closing tag instead. This is why the opening tag values were stored 1128 * above in a variable. It reads confusingly here, but that's because the 1129 * functions that skip the contents have moved all the internal cursors past 1130 * the inner content of the tag. 1131 */ 1132 $this->token_starts_at = $was_at; 1133 $this->token_length = $this->bytes_already_parsed - $this->token_starts_at; 1134 $this->text_starts_at = $tag_ends_at; 1135 $this->text_length = $this->tag_name_starts_at - $this->text_starts_at; 1136 $this->tag_name_starts_at = $tag_name_starts_at; 1137 $this->tag_name_length = $tag_name_length; 1138 $this->attributes = $attributes; 1139 $this->duplicate_attributes = $duplicate_attributes; 1140 1141 return true; 1142 } 1143 1144 /** 1145 * Whether the processor paused because the input HTML document ended 1146 * in the middle of a syntax element, such as in the middle of a tag. 1147 * 1148 * Example: 1149 * 1150 * $processor = new WP_HTML_Tag_Processor( '<input type="text" value="Th' ); 1151 * false === $processor->get_next_tag(); 1152 * true === $processor->paused_at_incomplete_token(); 1153 * 1154 * @since 6.5.0 1155 * 1156 * @return bool Whether the parse paused at the start of an incomplete token. 1157 */ 1158 public function paused_at_incomplete_token(): bool { 1159 return self::STATE_INCOMPLETE_INPUT === $this->parser_state; 1160 } 1161 1162 /** 1163 * Generator for a foreach loop to step through each class name for the matched tag. 1164 * 1165 * This generator function is designed to be used inside a "foreach" loop. 1166 * 1167 * Example: 1168 * 1169 * $p = new WP_HTML_Tag_Processor( "<div class='free <egg<\tlang-en'>" ); 1170 * $p->next_tag(); 1171 * foreach ( $p->class_list() as $class_name ) { 1172 * echo "{$class_name} "; 1173 * } 1174 * // Outputs: "free <egg> lang-en " 1175 * 1176 * @since 6.4.0 1177 */ 1178 public function class_list() { 1179 if ( self::STATE_MATCHED_TAG !== $this->parser_state ) { 1180 return; 1181 } 1182 1183 /** @var string $class contains the string value of the class attribute, with character references decoded. */ 1184 $class = $this->get_attribute( 'class' ); 1185 1186 if ( ! is_string( $class ) ) { 1187 return; 1188 } 1189 1190 $seen = array(); 1191 1192 $is_quirks = self::QUIRKS_MODE === $this->compat_mode; 1193 1194 $at = 0; 1195 while ( $at < strlen( $class ) ) { 1196 // Skip past any initial boundary characters. 1197 $at += strspn( $class, " \t\f\r\n", $at ); 1198 if ( $at >= strlen( $class ) ) { 1199 return; 1200 } 1201 1202 // Find the byte length until the next boundary. 1203 $length = strcspn( $class, " \t\f\r\n", $at ); 1204 if ( 0 === $length ) { 1205 return; 1206 } 1207 1208 $name = str_replace( "\x00", "\u{FFFD}", substr( $class, $at, $length ) ); 1209 if ( $is_quirks ) { 1210 $name = strtolower( $name ); 1211 } 1212 $at += $length; 1213 1214 /* 1215 * It's expected that the number of class names for a given tag is relatively small. 1216 * Given this, it is probably faster overall to scan an array for a value rather 1217 * than to use the class name as a key and check if it's a key of $seen. 1218 */ 1219 if ( in_array( $name, $seen, true ) ) { 1220 continue; 1221 } 1222 1223 $seen[] = $name; 1224 yield $name; 1225 } 1226 } 1227 1228 1229 /** 1230 * Returns if a matched tag contains the given ASCII case-insensitive class name. 1231 * 1232 * @since 6.4.0 1233 * 1234 * @param string $wanted_class Look for this CSS class name, ASCII case-insensitive. 1235 * @return bool|null Whether the matched tag contains the given class name, or null if not matched. 1236 */ 1237 public function has_class( $wanted_class ): ?bool { 1238 if ( self::STATE_MATCHED_TAG !== $this->parser_state ) { 1239 return null; 1240 } 1241 1242 $case_insensitive = self::QUIRKS_MODE === $this->compat_mode; 1243 1244 $wanted_length = strlen( $wanted_class ); 1245 foreach ( $this->class_list() as $class_name ) { 1246 if ( 1247 strlen( $class_name ) === $wanted_length && 1248 0 === substr_compare( $class_name, $wanted_class, 0, strlen( $wanted_class ), $case_insensitive ) 1249 ) { 1250 return true; 1251 } 1252 } 1253 1254 return false; 1255 } 1256 1257 1258 /** 1259 * Sets a bookmark in the HTML document. 1260 * 1261 * Bookmarks represent specific places or tokens in the HTML 1262 * document, such as a tag opener or closer. When applying 1263 * edits to a document, such as setting an attribute, the 1264 * text offsets of that token may shift; the bookmark is 1265 * kept updated with those shifts and remains stable unless 1266 * the entire span of text in which the token sits is removed. 1267 * 1268 * Release bookmarks when they are no longer needed. 1269 * 1270 * Example: 1271 * 1272 * <main><h2>Surprising fact you may not know!</h2></main> 1273 * ^ ^ 1274 * \-|-- this `H2` opener bookmark tracks the token 1275 * 1276 * <main class="clickbait"><h2>Surprising fact you may no… 1277 * ^ ^ 1278 * \-|-- it shifts with edits 1279 * 1280 * Bookmarks provide the ability to seek to a previously-scanned 1281 * place in the HTML document. This avoids the need to re-scan 1282 * the entire document. 1283 * 1284 * Example: 1285 * 1286 * <ul><li>One</li><li>Two</li><li>Three</li></ul> 1287 * ^^^^ 1288 * want to note this last item 1289 * 1290 * $p = new WP_HTML_Tag_Processor( $html ); 1291 * $in_list = false; 1292 * while ( $p->next_tag( array( 'tag_closers' => $in_list ? 'visit' : 'skip' ) ) ) { 1293 * if ( 'UL' === $p->get_tag() ) { 1294 * if ( $p->is_tag_closer() ) { 1295 * $in_list = false; 1296 * $p->set_bookmark( 'resume' ); 1297 * if ( $p->seek( 'last-li' ) ) { 1298 * $p->add_class( 'last-li' ); 1299 * } 1300 * $p->seek( 'resume' ); 1301 * $p->release_bookmark( 'last-li' ); 1302 * $p->release_bookmark( 'resume' ); 1303 * } else { 1304 * $in_list = true; 1305 * } 1306 * } 1307 * 1308 * if ( 'LI' === $p->get_tag() ) { 1309 * $p->set_bookmark( 'last-li' ); 1310 * } 1311 * } 1312 * 1313 * Bookmarks intentionally hide the internal string offsets 1314 * to which they refer. They are maintained internally as 1315 * updates are applied to the HTML document and therefore 1316 * retain their "position" - the location to which they 1317 * originally pointed. The inability to use bookmarks with 1318 * functions like `substr` is therefore intentional to guard 1319 * against accidentally breaking the HTML. 1320 * 1321 * Because bookmarks allocate memory and require processing 1322 * for every applied update, they are limited and require 1323 * a name. They should not be created with programmatically-made 1324 * names, such as "li_{$index}" with some loop. As a general 1325 * rule they should only be created with string-literal names 1326 * like "start-of-section" or "last-paragraph". 1327 * 1328 * Bookmarks are a powerful tool to enable complicated behavior. 1329 * Consider double-checking that you need this tool if you are 1330 * reaching for it, as inappropriate use could lead to broken 1331 * HTML structure or unwanted processing overhead. 1332 * 1333 * @since 6.2.0 1334 * 1335 * @param string $name Identifies this particular bookmark. 1336 * @return bool Whether the bookmark was successfully created. 1337 */ 1338 public function set_bookmark( $name ): bool { 1339 // It only makes sense to set a bookmark if the parser has paused on a concrete token. 1340 if ( 1341 self::STATE_COMPLETE === $this->parser_state || 1342 self::STATE_INCOMPLETE_INPUT === $this->parser_state 1343 ) { 1344 return false; 1345 } 1346 1347 if ( ! array_key_exists( $name, $this->bookmarks ) && count( $this->bookmarks ) >= static::MAX_BOOKMARKS ) { 1348 _doing_it_wrong( 1349 __METHOD__, 1350 __( 'Too many bookmarks: cannot create any more.' ), 1351 '6.2.0' 1352 ); 1353 return false; 1354 } 1355 1356 $this->bookmarks[ $name ] = new WP_HTML_Span( $this->token_starts_at, $this->token_length ); 1357 1358 return true; 1359 } 1360 1361 1362 /** 1363 * Removes a bookmark that is no longer needed. 1364 * 1365 * Releasing a bookmark frees up the small 1366 * performance overhead it requires. 1367 * 1368 * @param string $name Name of the bookmark to remove. 1369 * @return bool Whether the bookmark already existed before removal. 1370 */ 1371 public function release_bookmark( $name ): bool { 1372 if ( ! array_key_exists( $name, $this->bookmarks ) ) { 1373 return false; 1374 } 1375 1376 unset( $this->bookmarks[ $name ] ); 1377 1378 return true; 1379 } 1380 1381 /** 1382 * Skips contents of generic rawtext elements. 1383 * 1384 * @since 6.3.2 1385 * 1386 * @see https://html.spec.whatwg.org/#generic-raw-text-element-parsing-algorithm 1387 * 1388 * @param string $tag_name The uppercase tag name which will close the RAWTEXT region. 1389 * @return bool Whether an end to the RAWTEXT region was found before the end of the document. 1390 */ 1391 private function skip_rawtext( string $tag_name ): bool { 1392 /* 1393 * These two functions distinguish themselves on whether character references are 1394 * decoded, and since functionality to read the inner markup isn't supported, it's 1395 * not necessary to implement these two functions separately. 1396 */ 1397 return $this->skip_rcdata( $tag_name ); 1398 } 1399 1400 /** 1401 * Skips contents of RCDATA elements, namely title and textarea tags. 1402 * 1403 * @since 6.2.0 1404 * 1405 * @see https://html.spec.whatwg.org/multipage/parsing.html#rcdata-state 1406 * 1407 * @param string $tag_name The uppercase tag name which will close the RCDATA region. 1408 * @return bool Whether an end to the RCDATA region was found before the end of the document. 1409 */ 1410 private function skip_rcdata( string $tag_name ): bool { 1411 $html = $this->html; 1412 $doc_length = strlen( $html ); 1413 $tag_length = strlen( $tag_name ); 1414 1415 $at = $this->bytes_already_parsed; 1416 1417 while ( false !== $at && $at < $doc_length ) { 1418 $at = strpos( $this->html, '</', $at ); 1419 $this->tag_name_starts_at = $at; 1420 1421 // Fail if there is no possible tag closer. 1422 if ( false === $at || ( $at + $tag_length ) >= $doc_length ) { 1423 return false; 1424 } 1425 1426 $at += 2; 1427 1428 /* 1429 * Find a case-insensitive match to the tag name. 1430 * 1431 * Because tag names are limited to US-ASCII there is no 1432 * need to perform any kind of Unicode normalization when 1433 * comparing; any character which could be impacted by such 1434 * normalization could not be part of a tag name. 1435 */ 1436 for ( $i = 0; $i < $tag_length; $i++ ) { 1437 $tag_char = $tag_name[ $i ]; 1438 $html_char = $html[ $at + $i ]; 1439 1440 if ( $html_char !== $tag_char && strtoupper( $html_char ) !== $tag_char ) { 1441 $at += $i; 1442 continue 2; 1443 } 1444 } 1445 1446 $at += $tag_length; 1447 $this->bytes_already_parsed = $at; 1448 1449 if ( $at >= strlen( $html ) ) { 1450 return false; 1451 } 1452 1453 /* 1454 * Ensure that the tag name terminates to avoid matching on 1455 * substrings of a longer tag name. For example, the sequence 1456 * "</textarearug" should not match for "</textarea" even 1457 * though "textarea" is found within the text. 1458 */ 1459 $c = $html[ $at ]; 1460 if ( ' ' !== $c && "\t" !== $c && "\r" !== $c && "\n" !== $c && '/' !== $c && '>' !== $c ) { 1461 continue; 1462 } 1463 1464 while ( $this->parse_next_attribute() ) { 1465 continue; 1466 } 1467 1468 $at = $this->bytes_already_parsed; 1469 if ( $at >= strlen( $this->html ) ) { 1470 return false; 1471 } 1472 1473 if ( '>' === $html[ $at ] ) { 1474 $this->bytes_already_parsed = $at + 1; 1475 return true; 1476 } 1477 1478 if ( $at + 1 >= strlen( $this->html ) ) { 1479 return false; 1480 } 1481 1482 if ( '/' === $html[ $at ] && '>' === $html[ $at + 1 ] ) { 1483 $this->bytes_already_parsed = $at + 2; 1484 return true; 1485 } 1486 } 1487 1488 return false; 1489 } 1490 1491 /** 1492 * Skips contents of script tags. 1493 * 1494 * @since 6.2.0 1495 * 1496 * @return bool Whether the script tag was closed before the end of the document. 1497 */ 1498 private function skip_script_data(): bool { 1499 $state = 'unescaped'; 1500 $html = $this->html; 1501 $doc_length = strlen( $html ); 1502 $at = $this->bytes_already_parsed; 1503 1504 while ( false !== $at && $at < $doc_length ) { 1505 $at += strcspn( $html, '-<', $at ); 1506 1507 /* 1508 * Optimization: Terminating a complete script element requires at least eight 1509 * additional bytes in the document. Some checks below may cause local escaped 1510 * state transitions when processing shorter strings, but those transitions are 1511 * irrelevant if the script tag is incomplete and the function must return false. 1512 * 1513 * This may need updating if those transitions become significant or exported from 1514 * this function in some way, such as when building safe methods to embed JavaScript 1515 * or data inside a SCRIPT element. 1516 * 1517 * $at may be here. 1518 * ↓ 1519 * ...</script> 1520 * ╰──┬───╯ 1521 * $at + 8 additional bytes are required for a non-false return value. 1522 * 1523 * This single check eliminates the need to check lengths for the shorter spans: 1524 * 1525 * $at may be here. 1526 * ↓ 1527 * <script><!-- --></script> 1528 * ├╯ 1529 * $at + 2 additional characters does not require a length check. 1530 * 1531 * The transition from "escaped" to "unescaped" is not relevant if the document ends: 1532 * 1533 * $at may be here. 1534 * ↓ 1535 * <script><!-- -->[[END-OF-DOCUMENT]] 1536 * ╰──┬───╯ 1537 * $at + 8 additional bytes is not satisfied, return false. 1538 */ 1539 if ( $at + 8 >= $doc_length ) { 1540 return false; 1541 } 1542 1543 /* 1544 * For all script states a "-->" transitions 1545 * back into the normal unescaped script mode, 1546 * even if that's the current state. 1547 */ 1548 if ( 1549 '-' === $html[ $at ] && 1550 '-' === $html[ $at + 1 ] && 1551 '>' === $html[ $at + 2 ] 1552 ) { 1553 $at += 3; 1554 $state = 'unescaped'; 1555 continue; 1556 } 1557 1558 /* 1559 * Everything of interest past here starts with "<". 1560 * Check this character and advance position regardless. 1561 */ 1562 if ( '<' !== $html[ $at++ ] ) { 1563 continue; 1564 } 1565 1566 /* 1567 * "<!--" only transitions from _unescaped_ to _escaped_. This byte sequence is only 1568 * significant in the _unescaped_ state and is ignored in any other state. 1569 */ 1570 if ( 1571 'unescaped' === $state && 1572 '!' === $html[ $at ] && 1573 '-' === $html[ $at + 1 ] && 1574 '-' === $html[ $at + 2 ] 1575 ) { 1576 $at += 3; 1577 1578 /* 1579 * The parser is ready to enter the _escaped_ state, but may remain in the 1580 * _unescaped_ state. This occurs when "<!--" is immediately followed by a 1581 * sequence of 0 or more "-" followed by ">". This is similar to abruptly closed 1582 * HTML comments like "<!-->" or "<!--->". 1583 * 1584 * Note that this check may advance the position significantly and requires a 1585 * length check to prevent bad offsets on inputs like `<script><!---------`. 1586 */ 1587 $at += strspn( $html, '-', $at ); 1588 if ( $at < $doc_length && '>' === $html[ $at ] ) { 1589 ++$at; 1590 continue; 1591 } 1592 1593 $state = 'escaped'; 1594 continue; 1595 } 1596 1597 if ( '/' === $html[ $at ] ) { 1598 $closer_potentially_starts_at = $at - 1; 1599 $is_closing = true; 1600 ++$at; 1601 } else { 1602 $is_closing = false; 1603 } 1604 1605 /* 1606 * At this point the only remaining state-changes occur with the 1607 * <script> and </script> tags; unless one of these appears next, 1608 * proceed scanning to the next potential token in the text. 1609 */ 1610 if ( ! ( 1611 ( 's' === $html[ $at ] || 'S' === $html[ $at ] ) && 1612 ( 'c' === $html[ $at + 1 ] || 'C' === $html[ $at + 1 ] ) && 1613 ( 'r' === $html[ $at + 2 ] || 'R' === $html[ $at + 2 ] ) && 1614 ( 'i' === $html[ $at + 3 ] || 'I' === $html[ $at + 3 ] ) && 1615 ( 'p' === $html[ $at + 4 ] || 'P' === $html[ $at + 4 ] ) && 1616 ( 't' === $html[ $at + 5 ] || 'T' === $html[ $at + 5 ] ) 1617 ) ) { 1618 ++$at; 1619 continue; 1620 } 1621 1622 /* 1623 * Ensure that the script tag terminates to avoid matching on 1624 * substrings of a non-match. For example, the sequence 1625 * "<script123" should not end a script region even though 1626 * "<script" is found within the text. 1627 */ 1628 $at += 6; 1629 $c = $html[ $at ]; 1630 if ( 1631 /** 1632 * These characters trigger state transitions of interest: 1633 * 1634 * - @see {https://html.spec.whatwg.org/multipage/parsing.html#script-data-end-tag-name-state} 1635 * - @see {https://html.spec.whatwg.org/multipage/parsing.html#script-data-escaped-end-tag-name-state} 1636 * - @see {https://html.spec.whatwg.org/multipage/parsing.html#script-data-double-escape-start-state} 1637 * - @see {https://html.spec.whatwg.org/multipage/parsing.html#script-data-double-escape-end-state} 1638 * 1639 * The "\r" character is not present in the above references. However, "\r" must be 1640 * treated the same as "\n". This is because the HTML Standard requires newline 1641 * normalization during preprocessing which applies this replacement. 1642 * 1643 * - @see https://html.spec.whatwg.org/multipage/parsing.html#preprocessing-the-input-stream 1644 * - @see https://infra.spec.whatwg.org/#normalize-newlines 1645 */ 1646 '>' !== $c && 1647 ' ' !== $c && 1648 "\n" !== $c && 1649 '/' !== $c && 1650 "\t" !== $c && 1651 "\f" !== $c && 1652 "\r" !== $c 1653 ) { 1654 continue; 1655 } 1656 1657 if ( 'escaped' === $state && ! $is_closing ) { 1658 $state = 'double-escaped'; 1659 continue; 1660 } 1661 1662 if ( 'double-escaped' === $state && $is_closing ) { 1663 $state = 'escaped'; 1664 continue; 1665 } 1666 1667 if ( $is_closing ) { 1668 $this->bytes_already_parsed = $closer_potentially_starts_at; 1669 $this->tag_name_starts_at = $closer_potentially_starts_at; 1670 if ( $this->bytes_already_parsed >= $doc_length ) { 1671 return false; 1672 } 1673 1674 while ( $this->parse_next_attribute() ) { 1675 continue; 1676 } 1677 1678 if ( $this->bytes_already_parsed >= $doc_length ) { 1679 return false; 1680 } 1681 1682 if ( '>' === $html[ $this->bytes_already_parsed ] ) { 1683 ++$this->bytes_already_parsed; 1684 return true; 1685 } 1686 } 1687 1688 ++$at; 1689 } 1690 1691 return false; 1692 } 1693 1694 /** 1695 * Parses the next tag. 1696 * 1697 * This will find and start parsing the next tag, including 1698 * the opening `<`, the potential closer `/`, and the tag 1699 * name. It does not parse the attributes or scan to the 1700 * closing `>`; these are left for other methods. 1701 * 1702 * @since 6.2.0 1703 * @since 6.2.1 Support abruptly-closed comments, invalid-tag-closer-comments, and empty elements. 1704 * 1705 * @return bool Whether a tag was found before the end of the document. 1706 */ 1707 private function parse_next_tag(): bool { 1708 $this->after_tag(); 1709 1710 $html = $this->html; 1711 $doc_length = strlen( $html ); 1712 $was_at = $this->bytes_already_parsed; 1713 $at = $was_at; 1714 1715 while ( $at < $doc_length ) { 1716 $at = strpos( $html, '<', $at ); 1717 if ( false === $at ) { 1718 break; 1719 } 1720 1721 if ( $at > $was_at ) { 1722 /* 1723 * A "<" normally starts a new HTML tag or syntax token, but in cases where the 1724 * following character can't produce a valid token, the "<" is instead treated 1725 * as plaintext and the parser should skip over it. This avoids a problem when 1726 * following earlier practices of typing emoji with text, e.g. "<3". This 1727 * should be a heart, not a tag. It's supposed to be rendered, not hidden. 1728 * 1729 * At this point the parser checks if this is one of those cases and if it is 1730 * will continue searching for the next "<" in search of a token boundary. 1731 * 1732 * @see https://html.spec.whatwg.org/#tag-open-state 1733 */ 1734 if ( 1 !== strspn( $html, '!/?abcdefghijklmnopqrstuvwxyzABCDEFGHIJKLMNOPQRSTUVWXYZ', $at + 1, 1 ) ) { 1735 ++$at; 1736 continue; 1737 } 1738 1739 $this->parser_state = self::STATE_TEXT_NODE; 1740 $this->token_starts_at = $was_at; 1741 $this->token_length = $at - $was_at; 1742 $this->text_starts_at = $was_at; 1743 $this->text_length = $this->token_length; 1744 $this->bytes_already_parsed = $at; 1745 return true; 1746 } 1747 1748 $this->token_starts_at = $at; 1749 1750 if ( $at + 1 < $doc_length && '/' === $this->html[ $at + 1 ] ) { 1751 $this->is_closing_tag = true; 1752 ++$at; 1753 } else { 1754 $this->is_closing_tag = false; 1755 } 1756 1757 /* 1758 * HTML tag names must start with [a-zA-Z] otherwise they are not tags. 1759 * For example, "<3" is rendered as text, not a tag opener. If at least 1760 * one letter follows the "<" then _it is_ a tag, but if the following 1761 * character is anything else it _is not a tag_. 1762 * 1763 * It's not uncommon to find non-tags starting with `<` in an HTML 1764 * document, so it's good for performance to make this pre-check before 1765 * continuing to attempt to parse a tag name. 1766 * 1767 * Reference: 1768 * * https://html.spec.whatwg.org/multipage/parsing.html#data-state 1769 * * https://html.spec.whatwg.org/multipage/parsing.html#tag-open-state 1770 */ 1771 $tag_name_prefix_length = strspn( $html, 'abcdefghijklmnopqrstuvwxyzABCDEFGHIJKLMNOPQRSTUVWXYZ', $at + 1 ); 1772 if ( $tag_name_prefix_length > 0 ) { 1773 ++$at; 1774 $this->parser_state = self::STATE_MATCHED_TAG; 1775 $this->tag_name_starts_at = $at; 1776 $this->tag_name_length = $tag_name_prefix_length + strcspn( $html, " \t\f\r\n/>", $at + $tag_name_prefix_length ); 1777 $this->bytes_already_parsed = $at + $this->tag_name_length; 1778 return true; 1779 } 1780 1781 /* 1782 * Abort if no tag is found before the end of 1783 * the document. There is nothing left to parse. 1784 */ 1785 if ( $at + 1 >= $doc_length ) { 1786 $this->parser_state = self::STATE_INCOMPLETE_INPUT; 1787 1788 return false; 1789 } 1790 1791 /* 1792 * `<!` transitions to markup declaration open state 1793 * https://html.spec.whatwg.org/multipage/parsing.html#markup-declaration-open-state 1794 */ 1795 if ( ! $this->is_closing_tag && '!' === $html[ $at + 1 ] ) { 1796 /* 1797 * `<!--` transitions to a comment state – apply further comment rules. 1798 * https://html.spec.whatwg.org/multipage/parsing.html#tag-open-state 1799 */ 1800 if ( 0 === substr_compare( $html, '--', $at + 2, 2 ) ) { 1801 $closer_at = $at + 4; 1802 // If it's not possible to close the comment then there is nothing more to scan. 1803 if ( $doc_length <= $closer_at ) { 1804 $this->parser_state = self::STATE_INCOMPLETE_INPUT; 1805 1806 return false; 1807 } 1808 1809 // Abruptly-closed empty comments are a sequence of dashes followed by `>`. 1810 $span_of_dashes = strspn( $html, '-', $closer_at ); 1811 if ( '>' === $html[ $closer_at + $span_of_dashes ] ) { 1812 /* 1813 * @todo When implementing `set_modifiable_text()` ensure that updates to this token 1814 * don't break the syntax for short comments, e.g. `<!--->`. Unlike other comment 1815 * and bogus comment syntax, these leave no clear insertion point for text and 1816 * they need to be modified specially in order to contain text. E.g. to store 1817 * `?` as the modifiable text, the `<!--->` needs to become `<!--?-->`, which 1818 * involves inserting an additional `-` into the token after the modifiable text. 1819 */ 1820 $this->parser_state = self::STATE_COMMENT; 1821 $this->comment_type = self::COMMENT_AS_ABRUPTLY_CLOSED_COMMENT; 1822 $this->token_length = $closer_at + $span_of_dashes + 1 - $this->token_starts_at; 1823 1824 // Only provide modifiable text if the token is long enough to contain it. 1825 if ( $span_of_dashes >= 2 ) { 1826 $this->comment_type = self::COMMENT_AS_HTML_COMMENT; 1827 $this->text_starts_at = $this->token_starts_at + 4; 1828 $this->text_length = $span_of_dashes - 2; 1829 } 1830 1831 $this->bytes_already_parsed = $closer_at + $span_of_dashes + 1; 1832 return true; 1833 } 1834 1835 /* 1836 * Comments may be closed by either a --> or an invalid --!>. 1837 * The first occurrence closes the comment. 1838 * 1839 * See https://html.spec.whatwg.org/#parse-error-incorrectly-closed-comment 1840 */ 1841 --$closer_at; // Pre-increment inside condition below reduces risk of accidental infinite looping. 1842 while ( ++$closer_at < $doc_length ) { 1843 $closer_at = strpos( $html, '--', $closer_at ); 1844 if ( false === $closer_at ) { 1845 $this->parser_state = self::STATE_INCOMPLETE_INPUT; 1846 1847 return false; 1848 } 1849 1850 if ( $closer_at + 2 < $doc_length && '>' === $html[ $closer_at + 2 ] ) { 1851 $this->parser_state = self::STATE_COMMENT; 1852 $this->comment_type = self::COMMENT_AS_HTML_COMMENT; 1853 $this->token_length = $closer_at + 3 - $this->token_starts_at; 1854 $this->text_starts_at = $this->token_starts_at + 4; 1855 $this->text_length = $closer_at - $this->text_starts_at; 1856 $this->bytes_already_parsed = $closer_at + 3; 1857 return true; 1858 } 1859 1860 if ( 1861 $closer_at + 3 < $doc_length && 1862 '!' === $html[ $closer_at + 2 ] && 1863 '>' === $html[ $closer_at + 3 ] 1864 ) { 1865 $this->parser_state = self::STATE_COMMENT; 1866 $this->comment_type = self::COMMENT_AS_HTML_COMMENT; 1867 $this->token_length = $closer_at + 4 - $this->token_starts_at; 1868 $this->text_starts_at = $this->token_starts_at + 4; 1869 $this->text_length = $closer_at - $this->text_starts_at; 1870 $this->bytes_already_parsed = $closer_at + 4; 1871 return true; 1872 } 1873 } 1874 } 1875 1876 /* 1877 * `<!DOCTYPE` transitions to DOCTYPE state – skip to the nearest > 1878 * These are ASCII-case-insensitive. 1879 * https://html.spec.whatwg.org/multipage/parsing.html#tag-open-state 1880 */ 1881 if ( 1882 $doc_length > $at + 8 && 1883 ( 'D' === $html[ $at + 2 ] || 'd' === $html[ $at + 2 ] ) && 1884 ( 'O' === $html[ $at + 3 ] || 'o' === $html[ $at + 3 ] ) && 1885 ( 'C' === $html[ $at + 4 ] || 'c' === $html[ $at + 4 ] ) && 1886 ( 'T' === $html[ $at + 5 ] || 't' === $html[ $at + 5 ] ) && 1887 ( 'Y' === $html[ $at + 6 ] || 'y' === $html[ $at + 6 ] ) && 1888 ( 'P' === $html[ $at + 7 ] || 'p' === $html[ $at + 7 ] ) && 1889 ( 'E' === $html[ $at + 8 ] || 'e' === $html[ $at + 8 ] ) 1890 ) { 1891 $closer_at = strpos( $html, '>', $at + 9 ); 1892 if ( false === $closer_at ) { 1893 $this->parser_state = self::STATE_INCOMPLETE_INPUT; 1894 1895 return false; 1896 } 1897 1898 $this->parser_state = self::STATE_DOCTYPE; 1899 $this->token_length = $closer_at + 1 - $this->token_starts_at; 1900 $this->text_starts_at = $this->token_starts_at + 9; 1901 $this->text_length = $closer_at - $this->text_starts_at; 1902 $this->bytes_already_parsed = $closer_at + 1; 1903 return true; 1904 } 1905 1906 if ( 1907 'html' !== $this->parsing_namespace && 1908 strlen( $html ) > $at + 8 && 1909 '[' === $html[ $at + 2 ] && 1910 'C' === $html[ $at + 3 ] && 1911 'D' === $html[ $at + 4 ] && 1912 'A' === $html[ $at + 5 ] && 1913 'T' === $html[ $at + 6 ] && 1914 'A' === $html[ $at + 7 ] && 1915 '[' === $html[ $at + 8 ] 1916 ) { 1917 $closer_at = strpos( $html, ']]>', $at + 9 ); 1918 if ( false === $closer_at ) { 1919 $this->parser_state = self::STATE_INCOMPLETE_INPUT; 1920 1921 return false; 1922 } 1923 1924 $this->parser_state = self::STATE_CDATA_NODE; 1925 $this->text_starts_at = $at + 9; 1926 $this->text_length = $closer_at - $this->text_starts_at; 1927 $this->token_length = $closer_at + 3 - $this->token_starts_at; 1928 $this->bytes_already_parsed = $closer_at + 3; 1929 return true; 1930 } 1931 1932 /* 1933 * Anything else here is an incorrectly-opened comment and transitions 1934 * to the bogus comment state - skip to the nearest >. If no closer is 1935 * found then the HTML was truncated inside the markup declaration. 1936 */ 1937 $closer_at = strpos( $html, '>', $at + 1 ); 1938 if ( false === $closer_at ) { 1939 $this->parser_state = self::STATE_INCOMPLETE_INPUT; 1940 1941 return false; 1942 } 1943 1944 $this->parser_state = self::STATE_COMMENT; 1945 $this->comment_type = self::COMMENT_AS_INVALID_HTML; 1946 $this->token_length = $closer_at + 1 - $this->token_starts_at; 1947 $this->text_starts_at = $this->token_starts_at + 2; 1948 $this->text_length = $closer_at - $this->text_starts_at; 1949 $this->bytes_already_parsed = $closer_at + 1; 1950 1951 /* 1952 * Identify nodes that would be CDATA if HTML had CDATA sections. 1953 * 1954 * This section must occur after identifying the bogus comment end 1955 * because in an HTML parser it will span to the nearest `>`, even 1956 * if there's no `]]>` as would be required in an XML document. It 1957 * is therefore not possible to parse a CDATA section containing 1958 * a `>` in the HTML syntax. 1959 * 1960 * Inside foreign elements there is a discrepancy between browsers 1961 * and the specification on this. 1962 * 1963 * @todo Track whether the Tag Processor is inside a foreign element 1964 * and require the proper closing `]]>` in those cases. 1965 */ 1966 if ( 1967 $this->token_length >= 10 && 1968 '[' === $html[ $this->token_starts_at + 2 ] && 1969 'C' === $html[ $this->token_starts_at + 3 ] && 1970 'D' === $html[ $this->token_starts_at + 4 ] && 1971 'A' === $html[ $this->token_starts_at + 5 ] && 1972 'T' === $html[ $this->token_starts_at + 6 ] && 1973 'A' === $html[ $this->token_starts_at + 7 ] && 1974 '[' === $html[ $this->token_starts_at + 8 ] && 1975 ']' === $html[ $closer_at - 1 ] && 1976 ']' === $html[ $closer_at - 2 ] 1977 ) { 1978 $this->parser_state = self::STATE_COMMENT; 1979 $this->comment_type = self::COMMENT_AS_CDATA_LOOKALIKE; 1980 $this->text_starts_at += 7; 1981 $this->text_length -= 9; 1982 } 1983 1984 return true; 1985 } 1986 1987 /* 1988 * </> is a missing end tag name, which is ignored. 1989 * 1990 * This was also known as the "presumptuous empty tag" 1991 * in early discussions as it was proposed to close 1992 * the nearest previous opening tag. 1993 * 1994 * See https://html.spec.whatwg.org/#parse-error-missing-end-tag-name 1995 */ 1996 if ( '>' === $html[ $at + 1 ] ) { 1997 // `<>` is interpreted as plaintext. 1998 if ( ! $this->is_closing_tag ) { 1999 ++$at; 2000 continue; 2001 } 2002 2003 $this->parser_state = self::STATE_PRESUMPTUOUS_TAG; 2004 $this->token_length = $at + 2 - $this->token_starts_at; 2005 $this->bytes_already_parsed = $at + 2; 2006 return true; 2007 } 2008 2009 /* 2010 * `<?` transitions to a bogus comment state – skip to the nearest > 2011 * See https://html.spec.whatwg.org/multipage/parsing.html#tag-open-state 2012 */ 2013 if ( ! $this->is_closing_tag && '?' === $html[ $at + 1 ] ) { 2014 $closer_at = strpos( $html, '>', $at + 2 ); 2015 if ( false === $closer_at ) { 2016 $this->parser_state = self::STATE_INCOMPLETE_INPUT; 2017 2018 return false; 2019 } 2020 2021 $this->parser_state = self::STATE_COMMENT; 2022 $this->comment_type = self::COMMENT_AS_INVALID_HTML; 2023 $this->token_length = $closer_at + 1 - $this->token_starts_at; 2024 $this->text_starts_at = $this->token_starts_at + 2; 2025 $this->text_length = $closer_at - $this->text_starts_at; 2026 $this->bytes_already_parsed = $closer_at + 1; 2027 2028 /* 2029 * Identify a Processing Instruction node were HTML to have them. 2030 * 2031 * This section must occur after identifying the bogus comment end 2032 * because in an HTML parser it will span to the nearest `>`, even 2033 * if there's no `?>` as would be required in an XML document. It 2034 * is therefore not possible to parse a Processing Instruction node 2035 * containing a `>` in the HTML syntax. 2036 * 2037 * XML allows for more target names, but this code only identifies 2038 * those with ASCII-representable target names. This means that it 2039 * may identify some Processing Instruction nodes as bogus comments, 2040 * but it will not misinterpret the HTML structure. By limiting the 2041 * identification to these target names the Tag Processor can avoid 2042 * the need to start parsing UTF-8 sequences. 2043 * 2044 * > NameStartChar ::= ":" | [A-Z] | "_" | [a-z] | [#xC0-#xD6] | [#xD8-#xF6] | [#xF8-#x2FF] | 2045 * [#x370-#x37D] | [#x37F-#x1FFF] | [#x200C-#x200D] | [#x2070-#x218F] | 2046 * [#x2C00-#x2FEF] | [#x3001-#xD7FF] | [#xF900-#xFDCF] | [#xFDF0-#xFFFD] | 2047 * [#x10000-#xEFFFF] 2048 * > NameChar ::= NameStartChar | "-" | "." | [0-9] | #xB7 | [#x0300-#x036F] | [#x203F-#x2040] 2049 * 2050 * @todo Processing instruction nodes in SGML may contain any kind of markup. XML defines a 2051 * special case with `<?xml ... ?>` syntax, but the `?` is part of the bogus comment. 2052 * 2053 * @see https://www.w3.org/TR/2006/REC-xml11-20060816/#NT-PITarget 2054 */ 2055 if ( $this->token_length >= 5 && '?' === $html[ $closer_at - 1 ] ) { 2056 $comment_text = substr( $html, $this->token_starts_at + 2, $this->token_length - 4 ); 2057 $pi_target_length = strspn( $comment_text, 'abcdefghijklmnopqrstuvwxyzABCDEFGHIJKLMNOPQRSTUVWXYZ:_' ); 2058 2059 if ( 0 < $pi_target_length ) { 2060 $pi_target_length += strspn( $comment_text, 'abcdefghijklmnopqrstuvwxyzABCDEFGHIJKLMNOPQRSTUVWXYZ0123456789:_-.', $pi_target_length ); 2061 2062 $this->comment_type = self::COMMENT_AS_PI_NODE_LOOKALIKE; 2063 $this->tag_name_starts_at = $this->token_starts_at + 2; 2064 $this->tag_name_length = $pi_target_length; 2065 $this->text_starts_at += $pi_target_length; 2066 $this->text_length -= $pi_target_length + 1; 2067 } 2068 } 2069 2070 return true; 2071 } 2072 2073 /* 2074 * If a non-alpha starts the tag name in a tag closer it's a comment. 2075 * Find the first `>`, which closes the comment. 2076 * 2077 * This parser classifies these particular comments as special "funky comments" 2078 * which are made available for further processing. 2079 * 2080 * See https://html.spec.whatwg.org/#parse-error-invalid-first-character-of-tag-name 2081 */ 2082 if ( $this->is_closing_tag ) { 2083 // No chance of finding a closer. 2084 if ( $at + 3 > $doc_length ) { 2085 $this->parser_state = self::STATE_INCOMPLETE_INPUT; 2086 2087 return false; 2088 } 2089 2090 $closer_at = strpos( $html, '>', $at + 2 ); 2091 if ( false === $closer_at ) { 2092 $this->parser_state = self::STATE_INCOMPLETE_INPUT; 2093 2094 return false; 2095 } 2096 2097 $this->parser_state = self::STATE_FUNKY_COMMENT; 2098 $this->token_length = $closer_at + 1 - $this->token_starts_at; 2099 $this->text_starts_at = $this->token_starts_at + 2; 2100 $this->text_length = $closer_at - $this->text_starts_at; 2101 $this->bytes_already_parsed = $closer_at + 1; 2102 return true; 2103 } 2104 2105 ++$at; 2106 } 2107 2108 /* 2109 * This does not imply an incomplete parse; it indicates that there 2110 * can be nothing left in the document other than a #text node. 2111 */ 2112 $this->parser_state = self::STATE_TEXT_NODE; 2113 $this->token_starts_at = $was_at; 2114 $this->token_length = $doc_length - $was_at; 2115 $this->text_starts_at = $was_at; 2116 $this->text_length = $this->token_length; 2117 $this->bytes_already_parsed = $doc_length; 2118 return true; 2119 } 2120 2121 /** 2122 * Parses the next attribute. 2123 * 2124 * @since 6.2.0 2125 * 2126 * @return bool Whether an attribute was found before the end of the document. 2127 */ 2128 private function parse_next_attribute(): bool { 2129 $doc_length = strlen( $this->html ); 2130 2131 // Skip whitespace and slashes. 2132 $this->bytes_already_parsed += strspn( $this->html, " \t\f\r\n/", $this->bytes_already_parsed ); 2133 if ( $this->bytes_already_parsed >= $doc_length ) { 2134 $this->parser_state = self::STATE_INCOMPLETE_INPUT; 2135 2136 return false; 2137 } 2138 2139 /* 2140 * Treat the equal sign as a part of the attribute 2141 * name if it is the first encountered byte. 2142 * 2143 * @see https://html.spec.whatwg.org/multipage/parsing.html#before-attribute-name-state 2144 */ 2145 $name_length = '=' === $this->html[ $this->bytes_already_parsed ] 2146 ? 1 + strcspn( $this->html, "=/> \t\f\r\n", $this->bytes_already_parsed + 1 ) 2147 : strcspn( $this->html, "=/> \t\f\r\n", $this->bytes_already_parsed ); 2148 2149 // No attribute, just tag closer. 2150 if ( 0 === $name_length || $this->bytes_already_parsed + $name_length >= $doc_length ) { 2151 return false; 2152 } 2153 2154 $attribute_start = $this->bytes_already_parsed; 2155 $attribute_name = substr( $this->html, $attribute_start, $name_length ); 2156 $this->bytes_already_parsed += $name_length; 2157 if ( $this->bytes_already_parsed >= $doc_length ) { 2158 $this->parser_state = self::STATE_INCOMPLETE_INPUT; 2159 2160 return false; 2161 } 2162 2163 $this->skip_whitespace(); 2164 if ( $this->bytes_already_parsed >= $doc_length ) { 2165 $this->parser_state = self::STATE_INCOMPLETE_INPUT; 2166 2167 return false; 2168 } 2169 2170 $has_value = '=' === $this->html[ $this->bytes_already_parsed ]; 2171 if ( $has_value ) { 2172 ++$this->bytes_already_parsed; 2173 $this->skip_whitespace(); 2174 if ( $this->bytes_already_parsed >= $doc_length ) { 2175 $this->parser_state = self::STATE_INCOMPLETE_INPUT; 2176 2177 return false; 2178 } 2179 2180 switch ( $this->html[ $this->bytes_already_parsed ] ) { 2181 case "'": 2182 case '"': 2183 $quote = $this->html[ $this->bytes_already_parsed ]; 2184 $value_start = $this->bytes_already_parsed + 1; 2185 $end_quote_at = strpos( $this->html, $quote, $value_start ); 2186 $end_quote_at = false === $end_quote_at ? $doc_length : $end_quote_at; 2187 $value_length = $end_quote_at - $value_start; 2188 $attribute_end = $end_quote_at + 1; 2189 $this->bytes_already_parsed = $attribute_end; 2190 break; 2191 2192 default: 2193 $value_start = $this->bytes_already_parsed; 2194 $value_length = strcspn( $this->html, "> \t\f\r\n", $value_start ); 2195 $attribute_end = $value_start + $value_length; 2196 $this->bytes_already_parsed = $attribute_end; 2197 } 2198 } else { 2199 $value_start = $this->bytes_already_parsed; 2200 $value_length = 0; 2201 $attribute_end = $attribute_start + $name_length; 2202 } 2203 2204 if ( $attribute_end >= $doc_length ) { 2205 $this->parser_state = self::STATE_INCOMPLETE_INPUT; 2206 2207 return false; 2208 } 2209 2210 if ( $this->is_closing_tag ) { 2211 return true; 2212 } 2213 2214 /* 2215 * > There must never be two or more attributes on 2216 * > the same start tag whose names are an ASCII 2217 * > case-insensitive match for each other. 2218 * - HTML 5 spec 2219 * 2220 * @see https://html.spec.whatwg.org/multipage/syntax.html#attributes-2:ascii-case-insensitive 2221 */ 2222 $comparable_name = strtolower( $attribute_name ); 2223 2224 // If an attribute is listed many times, only use the first declaration and ignore the rest. 2225 if ( ! isset( $this->attributes[ $comparable_name ] ) ) { 2226 $this->attributes[ $comparable_name ] = new WP_HTML_Attribute_Token( 2227 $attribute_name, 2228 $value_start, 2229 $value_length, 2230 $attribute_start, 2231 $attribute_end - $attribute_start, 2232 ! $has_value 2233 ); 2234 2235 return true; 2236 } 2237 2238 /* 2239 * Track the duplicate attributes so if we remove it, all disappear together. 2240 * 2241 * While `$this->duplicated_attributes` could always be stored as an `array()`, 2242 * which would simplify the logic here, storing a `null` and only allocating 2243 * an array when encountering duplicates avoids needless allocations in the 2244 * normative case of parsing tags with no duplicate attributes. 2245 */ 2246 $duplicate_span = new WP_HTML_Span( $attribute_start, $attribute_end - $attribute_start ); 2247 if ( null === $this->duplicate_attributes ) { 2248 $this->duplicate_attributes = array( $comparable_name => array( $duplicate_span ) ); 2249 } elseif ( ! isset( $this->duplicate_attributes[ $comparable_name ] ) ) { 2250 $this->duplicate_attributes[ $comparable_name ] = array( $duplicate_span ); 2251 } else { 2252 $this->duplicate_attributes[ $comparable_name ][] = $duplicate_span; 2253 } 2254 2255 return true; 2256 } 2257 2258 /** 2259 * Move the internal cursor past any immediate successive whitespace. 2260 * 2261 * @since 6.2.0 2262 */ 2263 private function skip_whitespace(): void { 2264 $this->bytes_already_parsed += strspn( $this->html, " \t\f\r\n", $this->bytes_already_parsed ); 2265 } 2266 2267 /** 2268 * Applies attribute updates and cleans up once a tag is fully parsed. 2269 * 2270 * @since 6.2.0 2271 */ 2272 private function after_tag(): void { 2273 /* 2274 * There could be lexical updates enqueued for an attribute that 2275 * also exists on the next tag. In order to avoid conflating the 2276 * attributes across the two tags, lexical updates with names 2277 * need to be flushed to raw lexical updates. 2278 */ 2279 $this->class_name_updates_to_attributes_updates(); 2280 2281 /* 2282 * Purge updates if there are too many. The actual count isn't 2283 * scientific, but a few values from 100 to a few thousand were 2284 * tests to find a practically-useful limit. 2285 * 2286 * If the update queue grows too big, then the Tag Processor 2287 * will spend more time iterating through them and lose the 2288 * efficiency gains of deferring applying them. 2289 */ 2290 if ( 1000 < count( $this->lexical_updates ) ) { 2291 $this->get_updated_html(); 2292 } 2293 2294 foreach ( $this->lexical_updates as $name => $update ) { 2295 /* 2296 * Any updates appearing after the cursor should be applied 2297 * before proceeding, otherwise they may be overlooked. 2298 */ 2299 if ( $update->start >= $this->bytes_already_parsed ) { 2300 $this->get_updated_html(); 2301 break; 2302 } 2303 2304 if ( is_int( $name ) ) { 2305 continue; 2306 } 2307 2308 $this->lexical_updates[] = $update; 2309 unset( $this->lexical_updates[ $name ] ); 2310 } 2311 2312 $this->token_starts_at = null; 2313 $this->token_length = null; 2314 $this->tag_name_starts_at = null; 2315 $this->tag_name_length = null; 2316 $this->text_starts_at = 0; 2317 $this->text_length = 0; 2318 $this->is_closing_tag = null; 2319 $this->attributes = array(); 2320 $this->comment_type = null; 2321 $this->text_node_classification = self::TEXT_IS_GENERIC; 2322 $this->duplicate_attributes = null; 2323 } 2324 2325 /** 2326 * Converts class name updates into tag attributes updates 2327 * (they are accumulated in different data formats for performance). 2328 * 2329 * @since 6.2.0 2330 * 2331 * @see WP_HTML_Tag_Processor::$lexical_updates 2332 * @see WP_HTML_Tag_Processor::$classname_updates 2333 */ 2334 private function class_name_updates_to_attributes_updates(): void { 2335 if ( count( $this->classname_updates ) === 0 ) { 2336 return; 2337 } 2338 2339 $existing_class = $this->get_enqueued_attribute_value( 'class' ); 2340 if ( null === $existing_class || true === $existing_class ) { 2341 $existing_class = ''; 2342 } 2343 2344 if ( false === $existing_class && isset( $this->attributes['class'] ) ) { 2345 $existing_class = substr( 2346 $this->html, 2347 $this->attributes['class']->value_starts_at, 2348 $this->attributes['class']->value_length 2349 ); 2350 } 2351 2352 if ( false === $existing_class ) { 2353 $existing_class = ''; 2354 } 2355 2356 /** 2357 * Updated "class" attribute value. 2358 * 2359 * This is incrementally built while scanning through the existing class 2360 * attribute, skipping removed classes on the way, and then appending 2361 * added classes at the end. Only when finished processing will the 2362 * value contain the final new value. 2363 2364 * @var string $class 2365 */ 2366 $class = ''; 2367 2368 /** 2369 * Tracks the cursor position in the existing 2370 * class attribute value while parsing. 2371 * 2372 * @var int $at 2373 */ 2374 $at = 0; 2375 2376 /** 2377 * Indicates if there's any need to modify the existing class attribute. 2378 * 2379 * If a call to `add_class()` and `remove_class()` wouldn't impact 2380 * the `class` attribute value then there's no need to rebuild it. 2381 * For example, when adding a class that's already present or 2382 * removing one that isn't. 2383 * 2384 * This flag enables a performance optimization when none of the enqueued 2385 * class updates would impact the `class` attribute; namely, that the 2386 * processor can continue without modifying the input document, as if 2387 * none of the `add_class()` or `remove_class()` calls had been made. 2388 * 2389 * This flag is set upon the first change that requires a string update. 2390 * 2391 * @var bool $modified 2392 */ 2393 $modified = false; 2394 2395 $seen = array(); 2396 $to_remove = array(); 2397 $is_quirks = self::QUIRKS_MODE === $this->compat_mode; 2398 if ( $is_quirks ) { 2399 foreach ( $this->classname_updates as $updated_name => $action ) { 2400 if ( self::REMOVE_CLASS === $action ) { 2401 $to_remove[] = strtolower( $updated_name ); 2402 } 2403 } 2404 } else { 2405 foreach ( $this->classname_updates as $updated_name => $action ) { 2406 if ( self::REMOVE_CLASS === $action ) { 2407 $to_remove[] = $updated_name; 2408 } 2409 } 2410 } 2411 2412 // Remove unwanted classes by only copying the new ones. 2413 $existing_class_length = strlen( $existing_class ); 2414 while ( $at < $existing_class_length ) { 2415 // Skip to the first non-whitespace character. 2416 $ws_at = $at; 2417 $ws_length = strspn( $existing_class, " \t\f\r\n", $ws_at ); 2418 $at += $ws_length; 2419 2420 // Capture the class name – it's everything until the next whitespace. 2421 $name_length = strcspn( $existing_class, " \t\f\r\n", $at ); 2422 if ( 0 === $name_length ) { 2423 // If no more class names are found then that's the end. 2424 break; 2425 } 2426 2427 $name = substr( $existing_class, $at, $name_length ); 2428 $comparable_class_name = $is_quirks ? strtolower( $name ) : $name; 2429 $at += $name_length; 2430 2431 // If this class is marked for removal, remove it and move on to the next one. 2432 if ( in_array( $comparable_class_name, $to_remove, true ) ) { 2433 $modified = true; 2434 continue; 2435 } 2436 2437 // If a class has already been seen then skip it; it should not be added twice. 2438 if ( in_array( $comparable_class_name, $seen, true ) ) { 2439 continue; 2440 } 2441 2442 $seen[] = $comparable_class_name; 2443 2444 /* 2445 * Otherwise, append it to the new "class" attribute value. 2446 * 2447 * There are options for handling whitespace between tags. 2448 * Preserving the existing whitespace produces fewer changes 2449 * to the HTML content and should clarify the before/after 2450 * content when debugging the modified output. 2451 * 2452 * This approach contrasts normalizing the inter-class 2453 * whitespace to a single space, which might appear cleaner 2454 * in the output HTML but produce a noisier change. 2455 */ 2456 if ( '' !== $class ) { 2457 $class .= substr( $existing_class, $ws_at, $ws_length ); 2458 } 2459 $class .= $name; 2460 } 2461 2462 // Add new classes by appending those which haven't already been seen. 2463 foreach ( $this->classname_updates as $name => $operation ) { 2464 $comparable_name = $is_quirks ? strtolower( $name ) : $name; 2465 if ( self::ADD_CLASS === $operation && ! in_array( $comparable_name, $seen, true ) ) { 2466 $modified = true; 2467 2468 $class .= strlen( $class ) > 0 ? ' ' : ''; 2469 $class .= $name; 2470 } 2471 } 2472 2473 $this->classname_updates = array(); 2474 if ( ! $modified ) { 2475 return; 2476 } 2477 2478 if ( strlen( $class ) > 0 ) { 2479 $this->set_attribute( 'class', $class ); 2480 } else { 2481 $this->remove_attribute( 'class' ); 2482 } 2483 } 2484 2485 /** 2486 * Applies attribute updates to HTML document. 2487 * 2488 * @since 6.2.0 2489 * @since 6.2.1 Accumulates shift for internal cursor and passed pointer. 2490 * @since 6.3.0 Invalidate any bookmarks whose targets are overwritten. 2491 * 2492 * @param int $shift_this_point Accumulate and return shift for this position. 2493 * @return int How many bytes the given pointer moved in response to the updates. 2494 */ 2495 private function apply_attributes_updates( int $shift_this_point ): int { 2496 if ( ! count( $this->lexical_updates ) ) { 2497 return 0; 2498 } 2499 2500 $accumulated_shift_for_given_point = 0; 2501 2502 /* 2503 * Attribute updates can be enqueued in any order but updates 2504 * to the document must occur in lexical order; that is, each 2505 * replacement must be made before all others which follow it 2506 * at later string indices in the input document. 2507 * 2508 * Sorting avoid making out-of-order replacements which 2509 * can lead to mangled output, partially-duplicated 2510 * attributes, and overwritten attributes. 2511 */ 2512 usort( $this->lexical_updates, array( self::class, 'sort_start_ascending' ) ); 2513 2514 $bytes_already_copied = 0; 2515 $output_buffer = ''; 2516 foreach ( $this->lexical_updates as $diff ) { 2517 $shift = strlen( $diff->text ) - $diff->length; 2518 2519 // Adjust the cursor position by however much an update affects it. 2520 if ( $diff->start < $this->bytes_already_parsed ) { 2521 $this->bytes_already_parsed += $shift; 2522 } 2523 2524 // Accumulate shift of the given pointer within this function call. 2525 if ( $diff->start < $shift_this_point ) { 2526 $accumulated_shift_for_given_point += $shift; 2527 } 2528 2529 $output_buffer .= substr( $this->html, $bytes_already_copied, $diff->start - $bytes_already_copied ); 2530 $output_buffer .= $diff->text; 2531 $bytes_already_copied = $diff->start + $diff->length; 2532 } 2533 2534 $this->html = $output_buffer . substr( $this->html, $bytes_already_copied ); 2535 2536 /* 2537 * Adjust bookmark locations to account for how the text 2538 * replacements adjust offsets in the input document. 2539 */ 2540 foreach ( $this->bookmarks as $bookmark_name => $bookmark ) { 2541 $bookmark_end = $bookmark->start + $bookmark->length; 2542 2543 /* 2544 * Each lexical update which appears before the bookmark's endpoints 2545 * might shift the offsets for those endpoints. Loop through each change 2546 * and accumulate the total shift for each bookmark, then apply that 2547 * shift after tallying the full delta. 2548 */ 2549 $head_delta = 0; 2550 $tail_delta = 0; 2551 2552 foreach ( $this->lexical_updates as $diff ) { 2553 $diff_end = $diff->start + $diff->length; 2554 2555 if ( $bookmark->start < $diff->start && $bookmark_end < $diff->start ) { 2556 break; 2557 } 2558 2559 if ( $bookmark->start >= $diff->start && $bookmark_end < $diff_end ) { 2560 $this->release_bookmark( $bookmark_name ); 2561 continue 2; 2562 } 2563 2564 $delta = strlen( $diff->text ) - $diff->length; 2565 2566 if ( $bookmark->start >= $diff->start ) { 2567 $head_delta += $delta; 2568 } 2569 2570 if ( $bookmark_end >= $diff_end ) { 2571 $tail_delta += $delta; 2572 } 2573 } 2574 2575 $bookmark->start += $head_delta; 2576 $bookmark->length += $tail_delta - $head_delta; 2577 } 2578 2579 $this->lexical_updates = array(); 2580 2581 return $accumulated_shift_for_given_point; 2582 } 2583 2584 /** 2585 * Checks whether a bookmark with the given name exists. 2586 * 2587 * @since 6.3.0 2588 * 2589 * @param string $bookmark_name Name to identify a bookmark that potentially exists. 2590 * @return bool Whether that bookmark exists. 2591 */ 2592 public function has_bookmark( $bookmark_name ): bool { 2593 return array_key_exists( $bookmark_name, $this->bookmarks ); 2594 } 2595 2596 /** 2597 * Move the internal cursor in the Tag Processor to a given bookmark's location. 2598 * 2599 * In order to prevent accidental infinite loops, there's a 2600 * maximum limit on the number of times seek() can be called. 2601 * 2602 * @since 6.2.0 2603 * 2604 * @param string $bookmark_name Jump to the place in the document identified by this bookmark name. 2605 * @return bool Whether the internal cursor was successfully moved to the bookmark's location. 2606 */ 2607 public function seek( $bookmark_name ): bool { 2608 if ( ! array_key_exists( $bookmark_name, $this->bookmarks ) ) { 2609 _doing_it_wrong( 2610 __METHOD__, 2611 __( 'Unknown bookmark name.' ), 2612 '6.2.0' 2613 ); 2614 return false; 2615 } 2616 2617 $existing_bookmark = $this->bookmarks[ $bookmark_name ]; 2618 2619 if ( 2620 $this->token_starts_at === $existing_bookmark->start && 2621 $this->token_length === $existing_bookmark->length 2622 ) { 2623 return true; 2624 } 2625 2626 if ( ++$this->seek_count > static::MAX_SEEK_OPS ) { 2627 _doing_it_wrong( 2628 __METHOD__, 2629 __( 'Too many calls to seek() - this can lead to performance issues.' ), 2630 '6.2.0' 2631 ); 2632 return false; 2633 } 2634 2635 // Flush out any pending updates to the document. 2636 $this->get_updated_html(); 2637 2638 // Point this tag processor before the sought tag opener and consume it. 2639 $this->bytes_already_parsed = $this->bookmarks[ $bookmark_name ]->start; 2640 $this->parser_state = self::STATE_READY; 2641 return $this->next_token(); 2642 } 2643 2644 /** 2645 * Compare two WP_HTML_Text_Replacement objects. 2646 * 2647 * @since 6.2.0 2648 * 2649 * @param WP_HTML_Text_Replacement $a First attribute update. 2650 * @param WP_HTML_Text_Replacement $b Second attribute update. 2651 * @return int Comparison value for string order. 2652 */ 2653 private static function sort_start_ascending( WP_HTML_Text_Replacement $a, WP_HTML_Text_Replacement $b ): int { 2654 $by_start = $a->start - $b->start; 2655 if ( 0 !== $by_start ) { 2656 return $by_start; 2657 } 2658 2659 $by_text = isset( $a->text, $b->text ) ? strcmp( $a->text, $b->text ) : 0; 2660 if ( 0 !== $by_text ) { 2661 return $by_text; 2662 } 2663 2664 /* 2665 * This code should be unreachable, because it implies the two replacements 2666 * start at the same location and contain the same text. 2667 */ 2668 return $a->length - $b->length; 2669 } 2670 2671 /** 2672 * Return the enqueued value for a given attribute, if one exists. 2673 * 2674 * Enqueued updates can take different data types: 2675 * - If an update is enqueued and is boolean, the return will be `true` 2676 * - If an update is otherwise enqueued, the return will be the string value of that update. 2677 * - If an attribute is enqueued to be removed, the return will be `null` to indicate that. 2678 * - If no updates are enqueued, the return will be `false` to differentiate from "removed." 2679 * 2680 * @since 6.2.0 2681 * 2682 * @param string $comparable_name The attribute name in its comparable form. 2683 * @return string|boolean|null Value of enqueued update if present, otherwise false. 2684 */ 2685 private function get_enqueued_attribute_value( string $comparable_name ) { 2686 if ( self::STATE_MATCHED_TAG !== $this->parser_state ) { 2687 return false; 2688 } 2689 2690 if ( ! isset( $this->lexical_updates[ $comparable_name ] ) ) { 2691 return false; 2692 } 2693 2694 $enqueued_text = $this->lexical_updates[ $comparable_name ]->text; 2695 2696 // Removed attributes erase the entire span. 2697 if ( '' === $enqueued_text ) { 2698 return null; 2699 } 2700 2701 /* 2702 * Boolean attribute updates are just the attribute name without a corresponding value. 2703 * 2704 * This value might differ from the given comparable name in that there could be leading 2705 * or trailing whitespace, and that the casing follows the name given in `set_attribute`. 2706 * 2707 * Example: 2708 * 2709 * $p->set_attribute( 'data-TEST-id', 'update' ); 2710 * 'update' === $p->get_enqueued_attribute_value( 'data-test-id' ); 2711 * 2712 * Detect this difference based on the absence of the `=`, which _must_ exist in any 2713 * attribute containing a value, e.g. `<input type="text" enabled />`. 2714 * ¹ ² 2715 * 1. Attribute with a string value. 2716 * 2. Boolean attribute whose value is `true`. 2717 */ 2718 $equals_at = strpos( $enqueued_text, '=' ); 2719 if ( false === $equals_at ) { 2720 return true; 2721 } 2722 2723 /* 2724 * Finally, a normal update's value will appear after the `=` and 2725 * be double-quoted, as performed incidentally by `set_attribute`. 2726 * 2727 * e.g. `type="text"` 2728 * ¹² ³ 2729 * 1. Equals is here. 2730 * 2. Double-quoting starts one after the equals sign. 2731 * 3. Double-quoting ends at the last character in the update. 2732 */ 2733 $enqueued_value = substr( $enqueued_text, $equals_at + 2, -1 ); 2734 return WP_HTML_Decoder::decode_attribute( $enqueued_value ); 2735 } 2736 2737 /** 2738 * Returns the value of a requested attribute from a matched tag opener if that attribute exists. 2739 * 2740 * Example: 2741 * 2742 * $p = new WP_HTML_Tag_Processor( '<div enabled class="test" data-test-id="14">Test</div>' ); 2743 * $p->next_tag( array( 'class_name' => 'test' ) ) === true; 2744 * $p->get_attribute( 'data-test-id' ) === '14'; 2745 * $p->get_attribute( 'enabled' ) === true; 2746 * $p->get_attribute( 'aria-label' ) === null; 2747 * 2748 * $p->next_tag() === false; 2749 * $p->get_attribute( 'class' ) === null; 2750 * 2751 * @since 6.2.0 2752 * 2753 * @param string $name Name of attribute whose value is requested. 2754 * @return string|true|null Value of attribute or `null` if not available. Boolean attributes return `true`. 2755 */ 2756 public function get_attribute( $name ) { 2757 if ( self::STATE_MATCHED_TAG !== $this->parser_state ) { 2758 return null; 2759 } 2760 2761 $comparable = strtolower( $name ); 2762 2763 /* 2764 * For every attribute other than `class` it's possible to perform a quick check if 2765 * there's an enqueued lexical update whose value takes priority over what's found in 2766 * the input document. 2767 * 2768 * The `class` attribute is special though because of the exposed helpers `add_class` 2769 * and `remove_class`. These form a builder for the `class` attribute, so an additional 2770 * check for enqueued class changes is required in addition to the check for any enqueued 2771 * attribute values. If any exist, those enqueued class changes must first be flushed out 2772 * into an attribute value update. 2773 */ 2774 if ( 'class' === $name ) { 2775 $this->class_name_updates_to_attributes_updates(); 2776 } 2777 2778 // Return any enqueued attribute value updates if they exist. 2779 $enqueued_value = $this->get_enqueued_attribute_value( $comparable ); 2780 if ( false !== $enqueued_value ) { 2781 return $enqueued_value; 2782 } 2783 2784 if ( ! isset( $this->attributes[ $comparable ] ) ) { 2785 return null; 2786 } 2787 2788 $attribute = $this->attributes[ $comparable ]; 2789 2790 /* 2791 * This flag distinguishes an attribute with no value 2792 * from an attribute with an empty string value. For 2793 * unquoted attributes this could look very similar. 2794 * It refers to whether an `=` follows the name. 2795 * 2796 * e.g. <div boolean-attribute empty-attribute=></div> 2797 * ¹ ² 2798 * 1. Attribute `boolean-attribute` is `true`. 2799 * 2. Attribute `empty-attribute` is `""`. 2800 */ 2801 if ( true === $attribute->is_true ) { 2802 return true; 2803 } 2804 2805 $raw_value = substr( $this->html, $attribute->value_starts_at, $attribute->value_length ); 2806 2807 return WP_HTML_Decoder::decode_attribute( $raw_value ); 2808 } 2809 2810 /** 2811 * Gets lowercase names of all attributes matching a given prefix in the current tag. 2812 * 2813 * Note that matching is case-insensitive. This is in accordance with the spec: 2814 * 2815 * > There must never be two or more attributes on 2816 * > the same start tag whose names are an ASCII 2817 * > case-insensitive match for each other. 2818 * - HTML 5 spec 2819 * 2820 * Example: 2821 * 2822 * $p = new WP_HTML_Tag_Processor( '<div data-ENABLED class="test" DATA-test-id="14">Test</div>' ); 2823 * $p->next_tag( array( 'class_name' => 'test' ) ) === true; 2824 * $p->get_attribute_names_with_prefix( 'data-' ) === array( 'data-enabled', 'data-test-id' ); 2825 * 2826 * $p->next_tag() === false; 2827 * $p->get_attribute_names_with_prefix( 'data-' ) === null; 2828 * 2829 * @since 6.2.0 2830 * 2831 * @see https://html.spec.whatwg.org/multipage/syntax.html#attributes-2:ascii-case-insensitive 2832 * 2833 * @param string $prefix Prefix of requested attribute names. 2834 * @return array|null List of attribute names, or `null` when no tag opener is matched. 2835 */ 2836 public function get_attribute_names_with_prefix( $prefix ): ?array { 2837 if ( 2838 self::STATE_MATCHED_TAG !== $this->parser_state || 2839 $this->is_closing_tag 2840 ) { 2841 return null; 2842 } 2843 2844 $comparable = strtolower( $prefix ); 2845 2846 $matches = array(); 2847 foreach ( array_keys( $this->attributes ) as $attr_name ) { 2848 if ( str_starts_with( $attr_name, $comparable ) ) { 2849 $matches[] = $attr_name; 2850 } 2851 } 2852 return $matches; 2853 } 2854 2855 /** 2856 * Returns the namespace of the matched token. 2857 * 2858 * @since 6.7.0 2859 * 2860 * @return string One of 'html', 'math', or 'svg'. 2861 */ 2862 public function get_namespace(): string { 2863 return $this->parsing_namespace; 2864 } 2865 2866 /** 2867 * Returns the uppercase name of the matched tag. 2868 * 2869 * Example: 2870 * 2871 * $p = new WP_HTML_Tag_Processor( '<div class="test">Test</div>' ); 2872 * $p->next_tag() === true; 2873 * $p->get_tag() === 'DIV'; 2874 * 2875 * $p->next_tag() === false; 2876 * $p->get_tag() === null; 2877 * 2878 * @since 6.2.0 2879 * 2880 * @return string|null Name of currently matched tag in input HTML, or `null` if none found. 2881 */ 2882 public function get_tag(): ?string { 2883 if ( null === $this->tag_name_starts_at ) { 2884 return null; 2885 } 2886 2887 $tag_name = substr( $this->html, $this->tag_name_starts_at, $this->tag_name_length ); 2888 2889 if ( self::STATE_MATCHED_TAG === $this->parser_state ) { 2890 return strtoupper( $tag_name ); 2891 } 2892 2893 if ( 2894 self::STATE_COMMENT === $this->parser_state && 2895 self::COMMENT_AS_PI_NODE_LOOKALIKE === $this->get_comment_type() 2896 ) { 2897 return $tag_name; 2898 } 2899 2900 return null; 2901 } 2902 2903 /** 2904 * Returns the adjusted tag name for a given token, taking into 2905 * account the current parsing context, whether HTML, SVG, or MathML. 2906 * 2907 * @since 6.7.0 2908 * 2909 * @return string|null Name of current tag name. 2910 */ 2911 public function get_qualified_tag_name(): ?string { 2912 $tag_name = $this->get_tag(); 2913 if ( null === $tag_name ) { 2914 return null; 2915 } 2916 2917 if ( 'html' === $this->get_namespace() ) { 2918 return $tag_name; 2919 } 2920 2921 $lower_tag_name = strtolower( $tag_name ); 2922 if ( 'math' === $this->get_namespace() ) { 2923 return $lower_tag_name; 2924 } 2925 2926 if ( 'svg' === $this->get_namespace() ) { 2927 switch ( $lower_tag_name ) { 2928 case 'altglyph': 2929 return 'altGlyph'; 2930 2931 case 'altglyphdef': 2932 return 'altGlyphDef'; 2933 2934 case 'altglyphitem': 2935 return 'altGlyphItem'; 2936 2937 case 'animatecolor': 2938 return 'animateColor'; 2939 2940 case 'animatemotion': 2941 return 'animateMotion'; 2942 2943 case 'animatetransform': 2944 return 'animateTransform'; 2945 2946 case 'clippath': 2947 return 'clipPath'; 2948 2949 case 'feblend': 2950 return 'feBlend'; 2951 2952 case 'fecolormatrix': 2953 return 'feColorMatrix'; 2954 2955 case 'fecomponenttransfer': 2956 return 'feComponentTransfer'; 2957 2958 case 'fecomposite': 2959 return 'feComposite'; 2960 2961 case 'feconvolvematrix': 2962 return 'feConvolveMatrix'; 2963 2964 case 'fediffuselighting': 2965 return 'feDiffuseLighting'; 2966 2967 case 'fedisplacementmap': 2968 return 'feDisplacementMap'; 2969 2970 case 'fedistantlight': 2971 return 'feDistantLight'; 2972 2973 case 'fedropshadow': 2974 return 'feDropShadow'; 2975 2976 case 'feflood': 2977 return 'feFlood'; 2978 2979 case 'fefunca': 2980 return 'feFuncA'; 2981 2982 case 'fefuncb': 2983 return 'feFuncB'; 2984 2985 case 'fefuncg': 2986 return 'feFuncG'; 2987 2988 case 'fefuncr': 2989 return 'feFuncR'; 2990 2991 case 'fegaussianblur': 2992 return 'feGaussianBlur'; 2993 2994 case 'feimage': 2995 return 'feImage'; 2996 2997 case 'femerge': 2998 return 'feMerge'; 2999 3000 case 'femergenode': 3001 return 'feMergeNode'; 3002 3003 case 'femorphology': 3004 return 'feMorphology'; 3005 3006 case 'feoffset': 3007 return 'feOffset'; 3008 3009 case 'fepointlight': 3010 return 'fePointLight'; 3011 3012 case 'fespecularlighting': 3013 return 'feSpecularLighting'; 3014 3015 case 'fespotlight': 3016 return 'feSpotLight'; 3017 3018 case 'fetile': 3019 return 'feTile'; 3020 3021 case 'feturbulence': 3022 return 'feTurbulence'; 3023 3024 case 'foreignobject': 3025 return 'foreignObject'; 3026 3027 case 'glyphref': 3028 return 'glyphRef'; 3029 3030 case 'lineargradient': 3031 return 'linearGradient'; 3032 3033 case 'radialgradient': 3034 return 'radialGradient'; 3035 3036 case 'textpath': 3037 return 'textPath'; 3038 3039 default: 3040 return $lower_tag_name; 3041 } 3042 } 3043 3044 // This unnecessary return prevents tools from inaccurately reporting type errors. 3045 return $tag_name; 3046 } 3047 3048 /** 3049 * Returns the adjusted attribute name for a given attribute, taking into 3050 * account the current parsing context, whether HTML, SVG, or MathML. 3051 * 3052 * @since 6.7.0 3053 * 3054 * @param string $attribute_name Which attribute to adjust. 3055 * 3056 * @return string|null 3057 */ 3058 public function get_qualified_attribute_name( $attribute_name ): ?string { 3059 if ( self::STATE_MATCHED_TAG !== $this->parser_state ) { 3060 return null; 3061 } 3062 3063 $namespace = $this->get_namespace(); 3064 $lower_name = strtolower( $attribute_name ); 3065 3066 if ( 'math' === $namespace && 'definitionurl' === $lower_name ) { 3067 return 'definitionURL'; 3068 } 3069 3070 if ( 'svg' === $this->get_namespace() ) { 3071 switch ( $lower_name ) { 3072 case 'attributename': 3073 return 'attributeName'; 3074 3075 case 'attributetype': 3076 return 'attributeType'; 3077 3078 case 'basefrequency': 3079 return 'baseFrequency'; 3080 3081 case 'baseprofile': 3082 return 'baseProfile'; 3083 3084 case 'calcmode': 3085 return 'calcMode'; 3086 3087 case 'clippathunits': 3088 return 'clipPathUnits'; 3089 3090 case 'diffuseconstant': 3091 return 'diffuseConstant'; 3092 3093 case 'edgemode': 3094 return 'edgeMode'; 3095 3096 case 'filterunits': 3097 return 'filterUnits'; 3098 3099 case 'glyphref': 3100 return 'glyphRef'; 3101 3102 case 'gradienttransform': 3103 return 'gradientTransform'; 3104 3105 case 'gradientunits': 3106 return 'gradientUnits'; 3107 3108 case 'kernelmatrix': 3109 return 'kernelMatrix'; 3110 3111 case 'kernelunitlength': 3112 return 'kernelUnitLength'; 3113 3114 case 'keypoints': 3115 return 'keyPoints'; 3116 3117 case 'keysplines': 3118 return 'keySplines'; 3119 3120 case 'keytimes': 3121 return 'keyTimes'; 3122 3123 case 'lengthadjust': 3124 return 'lengthAdjust'; 3125 3126 case 'limitingconeangle': 3127 return 'limitingConeAngle'; 3128 3129 case 'markerheight': 3130 return 'markerHeight'; 3131 3132 case 'markerunits': 3133 return 'markerUnits'; 3134 3135 case 'markerwidth': 3136 return 'markerWidth'; 3137 3138 case 'maskcontentunits': 3139 return 'maskContentUnits'; 3140 3141 case 'maskunits': 3142 return 'maskUnits'; 3143 3144 case 'numoctaves': 3145 return 'numOctaves'; 3146 3147 case 'pathlength': 3148 return 'pathLength'; 3149 3150 case 'patterncontentunits': 3151 return 'patternContentUnits'; 3152 3153 case 'patterntransform': 3154 return 'patternTransform'; 3155 3156 case 'patternunits': 3157 return 'patternUnits'; 3158 3159 case 'pointsatx': 3160 return 'pointsAtX'; 3161 3162 case 'pointsaty': 3163 return 'pointsAtY'; 3164 3165 case 'pointsatz': 3166 return 'pointsAtZ'; 3167 3168 case 'preservealpha': 3169 return 'preserveAlpha'; 3170 3171 case 'preserveaspectratio': 3172 return 'preserveAspectRatio'; 3173 3174 case 'primitiveunits': 3175 return 'primitiveUnits'; 3176 3177 case 'refx': 3178 return 'refX'; 3179 3180 case 'refy': 3181 return 'refY'; 3182 3183 case 'repeatcount': 3184 return 'repeatCount'; 3185 3186 case 'repeatdur': 3187 return 'repeatDur'; 3188 3189 case 'requiredextensions': 3190 return 'requiredExtensions'; 3191 3192 case 'requiredfeatures': 3193 return 'requiredFeatures'; 3194 3195 case 'specularconstant': 3196 return 'specularConstant'; 3197 3198 case 'specularexponent': 3199 return 'specularExponent'; 3200 3201 case 'spreadmethod': 3202 return 'spreadMethod'; 3203 3204 case 'startoffset': 3205 return 'startOffset'; 3206 3207 case 'stddeviation': 3208 return 'stdDeviation'; 3209 3210 case 'stitchtiles': 3211 return 'stitchTiles'; 3212 3213 case 'surfacescale': 3214 return 'surfaceScale'; 3215 3216 case 'systemlanguage': 3217 return 'systemLanguage'; 3218 3219 case 'tablevalues': 3220 return 'tableValues'; 3221 3222 case 'targetx': 3223 return 'targetX'; 3224 3225 case 'targety': 3226 return 'targetY'; 3227 3228 case 'textlength': 3229 return 'textLength'; 3230 3231 case 'viewbox': 3232 return 'viewBox'; 3233 3234 case 'viewtarget': 3235 return 'viewTarget'; 3236 3237 case 'xchannelselector': 3238 return 'xChannelSelector'; 3239 3240 case 'ychannelselector': 3241 return 'yChannelSelector'; 3242 3243 case 'zoomandpan': 3244 return 'zoomAndPan'; 3245 } 3246 } 3247 3248 if ( 'html' !== $namespace ) { 3249 switch ( $lower_name ) { 3250 case 'xlink:actuate': 3251 return 'xlink actuate'; 3252 3253 case 'xlink:arcrole': 3254 return 'xlink arcrole'; 3255 3256 case 'xlink:href': 3257 return 'xlink href'; 3258 3259 case 'xlink:role': 3260 return 'xlink role'; 3261 3262 case 'xlink:show': 3263 return 'xlink show'; 3264 3265 case 'xlink:title': 3266 return 'xlink title'; 3267 3268 case 'xlink:type': 3269 return 'xlink type'; 3270 3271 case 'xml:lang': 3272 return 'xml lang'; 3273 3274 case 'xml:space': 3275 return 'xml space'; 3276 3277 case 'xmlns': 3278 return 'xmlns'; 3279 3280 case 'xmlns:xlink': 3281 return 'xmlns xlink'; 3282 } 3283 } 3284 3285 return $attribute_name; 3286 } 3287 3288 /** 3289 * Indicates if the currently matched tag contains the self-closing flag. 3290 * 3291 * No HTML elements ought to have the self-closing flag and for those, the self-closing 3292 * flag will be ignored. For void elements this is benign because they "self close" 3293 * automatically. For non-void HTML elements though problems will appear if someone 3294 * intends to use a self-closing element in place of that element with an empty body. 3295 * For HTML foreign elements and custom elements the self-closing flag determines if 3296 * they self-close or not. 3297 * 3298 * This function does not determine if a tag is self-closing, 3299 * but only if the self-closing flag is present in the syntax. 3300 * 3301 * @since 6.3.0 3302 * 3303 * @return bool Whether the currently matched tag contains the self-closing flag. 3304 */ 3305 public function has_self_closing_flag(): bool { 3306 if ( self::STATE_MATCHED_TAG !== $this->parser_state ) { 3307 return false; 3308 } 3309 3310 /* 3311 * The self-closing flag is the solidus at the _end_ of the tag, not the beginning. 3312 * 3313 * Example: 3314 * 3315 * <figure /> 3316 * ^ this appears one character before the end of the closing ">". 3317 */ 3318 return '/' === $this->html[ $this->token_starts_at + $this->token_length - 2 ]; 3319 } 3320 3321 /** 3322 * Indicates if the current tag token is a tag closer. 3323 * 3324 * Example: 3325 * 3326 * $p = new WP_HTML_Tag_Processor( '<div></div>' ); 3327 * $p->next_tag( array( 'tag_name' => 'div', 'tag_closers' => 'visit' ) ); 3328 * $p->is_tag_closer() === false; 3329 * 3330 * $p->next_tag( array( 'tag_name' => 'div', 'tag_closers' => 'visit' ) ); 3331 * $p->is_tag_closer() === true; 3332 * 3333 * @since 6.2.0 3334 * @since 6.7.0 Reports all BR tags as opening tags. 3335 * 3336 * @return bool Whether the current tag is a tag closer. 3337 */ 3338 public function is_tag_closer(): bool { 3339 return ( 3340 self::STATE_MATCHED_TAG === $this->parser_state && 3341 $this->is_closing_tag && 3342 3343 /* 3344 * The BR tag can only exist as an opening tag. If something like `</br>` 3345 * appears then the HTML parser will treat it as an opening tag with no 3346 * attributes. The BR tag is unique in this way. 3347 * 3348 * @see https://html.spec.whatwg.org/#parsing-main-inbody 3349 */ 3350 'BR' !== $this->get_tag() 3351 ); 3352 } 3353 3354 /** 3355 * Indicates the kind of matched token, if any. 3356 * 3357 * This differs from `get_token_name()` in that it always 3358 * returns a static string indicating the type, whereas 3359 * `get_token_name()` may return values derived from the 3360 * token itself, such as a tag name or processing 3361 * instruction tag. 3362 * 3363 * Possible values: 3364 * - `#tag` when matched on a tag. 3365 * - `#text` when matched on a text node. 3366 * - `#cdata-section` when matched on a CDATA node. 3367 * - `#comment` when matched on a comment. 3368 * - `#doctype` when matched on a DOCTYPE declaration. 3369 * - `#presumptuous-tag` when matched on an empty tag closer. 3370 * - `#funky-comment` when matched on a funky comment. 3371 * 3372 * @since 6.5.0 3373 * 3374 * @return string|null What kind of token is matched, or null. 3375 */ 3376 public function get_token_type(): ?string { 3377 switch ( $this->parser_state ) { 3378 case self::STATE_MATCHED_TAG: 3379 return '#tag'; 3380 3381 case self::STATE_DOCTYPE: 3382 return '#doctype'; 3383 3384 default: 3385 return $this->get_token_name(); 3386 } 3387 } 3388 3389 /** 3390 * Returns the node name represented by the token. 3391 * 3392 * This matches the DOM API value `nodeName`. Some values 3393 * are static, such as `#text` for a text node, while others 3394 * are dynamically generated from the token itself. 3395 * 3396 * Dynamic names: 3397 * - Uppercase tag name for tag matches. 3398 * - `html` for DOCTYPE declarations. 3399 * 3400 * Note that if the Tag Processor is not matched on a token 3401 * then this function will return `null`, either because it 3402 * hasn't yet found a token or because it reached the end 3403 * of the document without matching a token. 3404 * 3405 * @since 6.5.0 3406 * 3407 * @return string|null Name of the matched token. 3408 */ 3409 public function get_token_name(): ?string { 3410 switch ( $this->parser_state ) { 3411 case self::STATE_MATCHED_TAG: 3412 return $this->get_tag(); 3413 3414 case self::STATE_TEXT_NODE: 3415 return '#text'; 3416 3417 case self::STATE_CDATA_NODE: 3418 return '#cdata-section'; 3419 3420 case self::STATE_COMMENT: 3421 return '#comment'; 3422 3423 case self::STATE_DOCTYPE: 3424 return 'html'; 3425 3426 case self::STATE_PRESUMPTUOUS_TAG: 3427 return '#presumptuous-tag'; 3428 3429 case self::STATE_FUNKY_COMMENT: 3430 return '#funky-comment'; 3431 } 3432 3433 return null; 3434 } 3435 3436 /** 3437 * Indicates what kind of comment produced the comment node. 3438 * 3439 * Because there are different kinds of HTML syntax which produce 3440 * comments, the Tag Processor tracks and exposes this as a type 3441 * for the comment. Nominally only regular HTML comments exist as 3442 * they are commonly known, but a number of unrelated syntax errors 3443 * also produce comments. 3444 * 3445 * @see self::COMMENT_AS_ABRUPTLY_CLOSED_COMMENT 3446 * @see self::COMMENT_AS_CDATA_LOOKALIKE 3447 * @see self::COMMENT_AS_INVALID_HTML 3448 * @see self::COMMENT_AS_HTML_COMMENT 3449 * @see self::COMMENT_AS_PI_NODE_LOOKALIKE 3450 * 3451 * @since 6.5.0 3452 * 3453 * @return string|null 3454 */ 3455 public function get_comment_type(): ?string { 3456 if ( self::STATE_COMMENT !== $this->parser_state ) { 3457 return null; 3458 } 3459 3460 return $this->comment_type; 3461 } 3462 3463 /** 3464 * Returns the text of a matched comment or null if not on a comment type node. 3465 * 3466 * This method returns the entire text content of a comment node as it 3467 * would appear in the browser. 3468 * 3469 * This differs from {@see ::get_modifiable_text()} in that certain comment 3470 * types in the HTML API cannot allow their entire comment text content to 3471 * be modified. Namely, "bogus comments" of the form `<?not allowed in html>` 3472 * will create a comment whose text content starts with `?`. Note that if 3473 * that character were modified, it would be possible to change the node 3474 * type. 3475 * 3476 * @since 6.7.0 3477 * 3478 * @return string|null The comment text as it would appear in the browser or null 3479 * if not on a comment type node. 3480 */ 3481 public function get_full_comment_text(): ?string { 3482 if ( self::STATE_FUNKY_COMMENT === $this->parser_state ) { 3483 return $this->get_modifiable_text(); 3484 } 3485 3486 if ( self::STATE_COMMENT !== $this->parser_state ) { 3487 return null; 3488 } 3489 3490 switch ( $this->get_comment_type() ) { 3491 case self::COMMENT_AS_HTML_COMMENT: 3492 case self::COMMENT_AS_ABRUPTLY_CLOSED_COMMENT: 3493 return $this->get_modifiable_text(); 3494 3495 case self::COMMENT_AS_CDATA_LOOKALIKE: 3496 return "[CDATA[{$this->get_modifiable_text()}]]"; 3497 3498 case self::COMMENT_AS_PI_NODE_LOOKALIKE: 3499 return "?{$this->get_tag()}{$this->get_modifiable_text()}?"; 3500 3501 /* 3502 * This represents "bogus comments state" from HTML tokenization. 3503 * This can be entered by `<?` or `<!`, where `?` is included in 3504 * the comment text but `!` is not. 3505 */ 3506 case self::COMMENT_AS_INVALID_HTML: 3507 $preceding_character = $this->html[ $this->text_starts_at - 1 ]; 3508 $comment_start = '?' === $preceding_character ? '?' : ''; 3509 return "{$comment_start}{$this->get_modifiable_text()}"; 3510 } 3511 3512 return null; 3513 } 3514 3515 /** 3516 * Subdivides a matched text node, splitting NULL byte sequences and decoded whitespace as 3517 * distinct nodes prefixes. 3518 * 3519 * Note that once anything that's neither a NULL byte nor decoded whitespace is 3520 * encountered, then the remainder of the text node is left intact as generic text. 3521 * 3522 * - The HTML Processor uses this to apply distinct rules for different kinds of text. 3523 * - Inter-element whitespace can be detected and skipped with this method. 3524 * 3525 * Text nodes aren't eagerly subdivided because there's no need to split them unless 3526 * decisions are being made on NULL byte sequences or whitespace-only text. 3527 * 3528 * Example: 3529 * 3530 * $processor = new WP_HTML_Tag_Processor( "\x00Apples & Oranges" ); 3531 * true === $processor->next_token(); // Text is "Apples & Oranges". 3532 * true === $processor->subdivide_text_appropriately(); // Text is "". 3533 * true === $processor->next_token(); // Text is "Apples & Oranges". 3534 * false === $processor->subdivide_text_appropriately(); 3535 * 3536 * $processor = new WP_HTML_Tag_Processor( " \r\n\tMore" ); 3537 * true === $processor->next_token(); // Text is " ␉More". 3538 * true === $processor->subdivide_text_appropriately(); // Text is " ␉". 3539 * true === $processor->next_token(); // Text is "More". 3540 * false === $processor->subdivide_text_appropriately(); 3541 * 3542 * @since 6.7.0 3543 * 3544 * @return bool Whether the text node was subdivided. 3545 */ 3546 public function subdivide_text_appropriately(): bool { 3547 if ( self::STATE_TEXT_NODE !== $this->parser_state ) { 3548 return false; 3549 } 3550 3551 $this->text_node_classification = self::TEXT_IS_GENERIC; 3552 3553 /* 3554 * NULL bytes are treated categorically different than numeric character 3555 * references whose number is zero. `�` is not the same as `"\x00"`. 3556 */ 3557 $leading_nulls = strspn( $this->html, "\x00", $this->text_starts_at, $this->text_length ); 3558 if ( $leading_nulls > 0 ) { 3559 $this->token_length = $leading_nulls; 3560 $this->text_length = $leading_nulls; 3561 $this->bytes_already_parsed = $this->token_starts_at + $leading_nulls; 3562 $this->text_node_classification = self::TEXT_IS_NULL_SEQUENCE; 3563 return true; 3564 } 3565 3566 /* 3567 * Start a decoding loop to determine the point at which the 3568 * text subdivides. This entails raw whitespace bytes and any 3569 * character reference that decodes to the same. 3570 */ 3571 $at = $this->text_starts_at; 3572 $end = $this->text_starts_at + $this->text_length; 3573 while ( $at < $end ) { 3574 $skipped = strspn( $this->html, " \t\f\r\n", $at, $end - $at ); 3575 $at += $skipped; 3576 3577 if ( $at < $end && '&' === $this->html[ $at ] ) { 3578 $matched_byte_length = null; 3579 $replacement = WP_HTML_Decoder::read_character_reference( 'data', $this->html, $at, $matched_byte_length ); 3580 if ( isset( $replacement ) && 1 === strspn( $replacement, " \t\f\r\n" ) ) { 3581 $at += $matched_byte_length; 3582 continue; 3583 } 3584 } 3585 3586 break; 3587 } 3588 3589 if ( $at > $this->text_starts_at ) { 3590 $new_length = $at - $this->text_starts_at; 3591 $this->text_length = $new_length; 3592 $this->token_length = $new_length; 3593 $this->bytes_already_parsed = $at; 3594 $this->text_node_classification = self::TEXT_IS_WHITESPACE; 3595 return true; 3596 } 3597 3598 return false; 3599 } 3600 3601 /** 3602 * Returns the modifiable text for a matched token, or an empty string. 3603 * 3604 * Modifiable text is text content that may be read and changed without 3605 * changing the HTML structure of the document around it. This includes 3606 * the contents of `#text` nodes in the HTML as well as the inner 3607 * contents of HTML comments, Processing Instructions, and others, even 3608 * though these nodes aren't part of a parsed DOM tree. They also contain 3609 * the contents of SCRIPT and STYLE tags, of TEXTAREA tags, and of any 3610 * other section in an HTML document which cannot contain HTML markup (DATA). 3611 * 3612 * If a token has no modifiable text then an empty string is returned to 3613 * avoid needless crashing or type errors. An empty string does not mean 3614 * that a token has modifiable text, and a token with modifiable text may 3615 * have an empty string (e.g. a comment with no contents). 3616 * 3617 * Limitations: 3618 * 3619 * - This function will not strip the leading newline appropriately 3620 * after seeking into a LISTING or PRE element. To ensure that the 3621 * newline is treated properly, seek to the LISTING or PRE opening 3622 * tag instead of to the first text node inside the element. 3623 * 3624 * @since 6.5.0 3625 * @since 6.7.0 Replaces NULL bytes (U+0000) and newlines appropriately. 3626 * 3627 * @return string 3628 */ 3629 public function get_modifiable_text(): string { 3630 $has_enqueued_update = isset( $this->lexical_updates['modifiable text'] ); 3631 3632 if ( ! $has_enqueued_update && ( null === $this->text_starts_at || 0 === $this->text_length ) ) { 3633 return ''; 3634 } 3635 3636 $text = $has_enqueued_update 3637 ? $this->lexical_updates['modifiable text']->text 3638 : substr( $this->html, $this->text_starts_at, $this->text_length ); 3639 3640 /* 3641 * Pre-processing the input stream would normally happen before 3642 * any parsing is done, but deferring it means it's possible to 3643 * skip in most cases. When getting the modifiable text, however 3644 * it's important to apply the pre-processing steps, which is 3645 * normalizing newlines. 3646 * 3647 * @see https://html.spec.whatwg.org/#preprocessing-the-input-stream 3648 * @see https://infra.spec.whatwg.org/#normalize-newlines 3649 */ 3650 $text = str_replace( "\r\n", "\n", $text ); 3651 $text = str_replace( "\r", "\n", $text ); 3652 3653 // Comment data is not decoded. 3654 if ( 3655 self::STATE_CDATA_NODE === $this->parser_state || 3656 self::STATE_COMMENT === $this->parser_state || 3657 self::STATE_DOCTYPE === $this->parser_state || 3658 self::STATE_FUNKY_COMMENT === $this->parser_state 3659 ) { 3660 return str_replace( "\x00", "\u{FFFD}", $text ); 3661 } 3662 3663 $tag_name = $this->get_token_name(); 3664 if ( 3665 // Script data is not decoded. 3666 'SCRIPT' === $tag_name || 3667 3668 // RAWTEXT data is not decoded. 3669 'IFRAME' === $tag_name || 3670 'NOEMBED' === $tag_name || 3671 'NOFRAMES' === $tag_name || 3672 'STYLE' === $tag_name || 3673 'XMP' === $tag_name 3674 ) { 3675 return str_replace( "\x00", "\u{FFFD}", $text ); 3676 } 3677 3678 $decoded = WP_HTML_Decoder::decode_text_node( $text ); 3679 3680 /* 3681 * Skip the first line feed after LISTING, PRE, and TEXTAREA opening tags. 3682 * 3683 * Note that this first newline may come in the form of a character 3684 * reference, such as `
`, and so it's important to perform 3685 * this transformation only after decoding the raw text content. 3686 */ 3687 if ( 3688 ( "\n" === ( $decoded[0] ?? '' ) ) && 3689 ( ( $this->skip_newline_at === $this->token_starts_at && '#text' === $tag_name ) || 'TEXTAREA' === $tag_name ) 3690 ) { 3691 $decoded = substr( $decoded, 1 ); 3692 } 3693 3694 /* 3695 * Only in normative text nodes does the NULL byte (U+0000) get removed. 3696 * In all other contexts it's replaced by the replacement character (U+FFFD) 3697 * for security reasons (to avoid joining together strings that were safe 3698 * when separated, but not when joined). 3699 * 3700 * @todo Inside HTML integration points and MathML integration points, the 3701 * text is processed according to the insertion mode, not according 3702 * to the foreign content rules. This should strip the NULL bytes. 3703 */ 3704 return ( '#text' === $tag_name && 'html' === $this->get_namespace() ) 3705 ? str_replace( "\x00", '', $decoded ) 3706 : str_replace( "\x00", "\u{FFFD}", $decoded ); 3707 } 3708 3709 /** 3710 * Sets the modifiable text for the matched token, if matched. 3711 * 3712 * Modifiable text is text content that may be read and changed without 3713 * changing the HTML structure of the document around it. This includes 3714 * the contents of `#text` nodes in the HTML as well as the inner 3715 * contents of HTML comments, Processing Instructions, and others, even 3716 * though these nodes aren't part of a parsed DOM tree. They also contain 3717 * the contents of SCRIPT and STYLE tags, of TEXTAREA tags, and of any 3718 * other section in an HTML document which cannot contain HTML markup (DATA). 3719 * 3720 * Not all modifiable text may be set by this method, and not all content 3721 * may be set as modifiable text. In the case that this fails it will return 3722 * `false` indicating as much. For instance, it will not allow inserting the 3723 * string `</script` into a SCRIPT element, because the rules for escaping 3724 * that safely are complicated. Similarly, it will not allow setting content 3725 * into a comment which would prematurely terminate the comment. 3726 * 3727 * Example: 3728 * 3729 * // Add a preface to all STYLE contents. 3730 * while ( $processor->next_tag( 'STYLE' ) ) { 3731 * $style = $processor->get_modifiable_text(); 3732 * $processor->set_modifiable_text( "// Made with love on the World Wide Web\n{$style}" ); 3733 * } 3734 * 3735 * // Replace smiley text with Emoji smilies. 3736 * while ( $processor->next_token() ) { 3737 * if ( '#text' !== $processor->get_token_name() ) { 3738 * continue; 3739 * } 3740 * 3741 * $chunk = $processor->get_modifiable_text(); 3742 * if ( ! str_contains( $chunk, ':)' ) ) { 3743 * continue; 3744 * } 3745 * 3746 * $processor->set_modifiable_text( str_replace( ':)', '🙂', $chunk ) ); 3747 * } 3748 * 3749 * @since 6.7.0 3750 * 3751 * @param string $plaintext_content New text content to represent in the matched token. 3752 * 3753 * @return bool Whether the text was able to update. 3754 */ 3755 public function set_modifiable_text( string $plaintext_content ): bool { 3756 if ( self::STATE_TEXT_NODE === $this->parser_state ) { 3757 $this->lexical_updates['modifiable text'] = new WP_HTML_Text_Replacement( 3758 $this->text_starts_at, 3759 $this->text_length, 3760 htmlspecialchars( $plaintext_content, ENT_QUOTES | ENT_HTML5 ) 3761 ); 3762 3763 return true; 3764 } 3765 3766 // Comment data is not encoded. 3767 if ( 3768 self::STATE_COMMENT === $this->parser_state && 3769 self::COMMENT_AS_HTML_COMMENT === $this->comment_type 3770 ) { 3771 // Check if the text could close the comment. 3772 if ( 1 === preg_match( '/--!?>/', $plaintext_content ) ) { 3773 return false; 3774 } 3775 3776 $this->lexical_updates['modifiable text'] = new WP_HTML_Text_Replacement( 3777 $this->text_starts_at, 3778 $this->text_length, 3779 $plaintext_content 3780 ); 3781 3782 return true; 3783 } 3784 3785 if ( self::STATE_MATCHED_TAG !== $this->parser_state ) { 3786 return false; 3787 } 3788 3789 switch ( $this->get_tag() ) { 3790 case 'SCRIPT': 3791 /** 3792 * This is over-protective, but ensures the update doesn't break 3793 * the HTML structure of the SCRIPT element. 3794 * 3795 * More thorough analysis could track the HTML tokenizer states 3796 * and to ensure that the SCRIPT element closes at the expected 3797 * SCRIPT close tag as is done in {@see ::skip_script_data()}. 3798 * 3799 * A SCRIPT element could be closed prematurely by contents 3800 * like `</script>`. A SCRIPT element could be prevented from 3801 * closing by contents like `<!--<script>`. 3802 * 3803 * The following strings are essential for dangerous content, 3804 * although they are insufficient on their own. This trade-off 3805 * prevents dangerous scripts from being sent to the browser. 3806 * It is also unlikely to produce HTML that may confuse more 3807 * basic HTML tooling. 3808 */ 3809 if ( 3810 false !== stripos( $plaintext_content, '</script' ) || 3811 false !== stripos( $plaintext_content, '<script' ) 3812 ) { 3813 return false; 3814 } 3815 3816 $this->lexical_updates['modifiable text'] = new WP_HTML_Text_Replacement( 3817 $this->text_starts_at, 3818 $this->text_length, 3819 $plaintext_content 3820 ); 3821 3822 return true; 3823 3824 case 'STYLE': 3825 $plaintext_content = preg_replace_callback( 3826 '~</(?P<TAG_NAME>style)~i', 3827 static function ( $tag_match ) { 3828 return "\\3c\\2f{$tag_match['TAG_NAME']}"; 3829 }, 3830 $plaintext_content 3831 ); 3832 3833 $this->lexical_updates['modifiable text'] = new WP_HTML_Text_Replacement( 3834 $this->text_starts_at, 3835 $this->text_length, 3836 $plaintext_content 3837 ); 3838 3839 return true; 3840 3841 case 'TEXTAREA': 3842 case 'TITLE': 3843 $plaintext_content = preg_replace_callback( 3844 "~</(?P<TAG_NAME>{$this->get_tag()})~i", 3845 static function ( $tag_match ) { 3846 return "</{$tag_match['TAG_NAME']}"; 3847 }, 3848 $plaintext_content 3849 ); 3850 3851 /* 3852 * These don't _need_ to be escaped, but since they are decoded it's 3853 * safe to leave them escaped and this can prevent other code from 3854 * naively detecting tags within the contents. 3855 * 3856 * @todo It would be useful to prefix a multiline replacement text 3857 * with a newline, but not necessary. This is for aesthetics. 3858 */ 3859 $this->lexical_updates['modifiable text'] = new WP_HTML_Text_Replacement( 3860 $this->text_starts_at, 3861 $this->text_length, 3862 $plaintext_content 3863 ); 3864 3865 return true; 3866 } 3867 3868 return false; 3869 } 3870 3871 /** 3872 * Updates or creates a new attribute on the currently matched tag with the passed value. 3873 * 3874 * For boolean attributes special handling is provided: 3875 * - When `true` is passed as the value, then only the attribute name is added to the tag. 3876 * - When `false` is passed, the attribute gets removed if it existed before. 3877 * 3878 * For string attributes, the value is escaped using the `esc_attr` function. 3879 * 3880 * @since 6.2.0 3881 * @since 6.2.1 Fix: Only create a single update for multiple calls with case-variant attribute names. 3882 * 3883 * @param string $name The attribute name to target. 3884 * @param string|bool $value The new attribute value. 3885 * @return bool Whether an attribute value was set. 3886 */ 3887 public function set_attribute( $name, $value ): bool { 3888 if ( 3889 self::STATE_MATCHED_TAG !== $this->parser_state || 3890 $this->is_closing_tag 3891 ) { 3892 return false; 3893 } 3894 3895 /* 3896 * WordPress rejects more characters than are strictly forbidden 3897 * in HTML5. This is to prevent additional security risks deeper 3898 * in the WordPress and plugin stack. Specifically the 3899 * less-than (<) greater-than (>) and ampersand (&) aren't allowed. 3900 * 3901 * The use of a PCRE match enables looking for specific Unicode 3902 * code points without writing a UTF-8 decoder. Whereas scanning 3903 * for one-byte characters is trivial (with `strcspn`), scanning 3904 * for the longer byte sequences would be more complicated. Given 3905 * that this shouldn't be in the hot path for execution, it's a 3906 * reasonable compromise in efficiency without introducing a 3907 * noticeable impact on the overall system. 3908 * 3909 * @see https://html.spec.whatwg.org/#attributes-2 3910 * 3911 * @todo As the only regex pattern maybe we should take it out? 3912 * Are Unicode patterns available broadly in Core? 3913 */ 3914 if ( preg_match( 3915 '~[' . 3916 // Syntax-like characters. 3917 '"\'>&</ =' . 3918 // Control characters. 3919 '\x{00}-\x{1F}' . 3920 // HTML noncharacters. 3921 '\x{FDD0}-\x{FDEF}' . 3922 '\x{FFFE}\x{FFFF}\x{1FFFE}\x{1FFFF}\x{2FFFE}\x{2FFFF}\x{3FFFE}\x{3FFFF}' . 3923 '\x{4FFFE}\x{4FFFF}\x{5FFFE}\x{5FFFF}\x{6FFFE}\x{6FFFF}\x{7FFFE}\x{7FFFF}' . 3924 '\x{8FFFE}\x{8FFFF}\x{9FFFE}\x{9FFFF}\x{AFFFE}\x{AFFFF}\x{BFFFE}\x{BFFFF}' . 3925 '\x{CFFFE}\x{CFFFF}\x{DFFFE}\x{DFFFF}\x{EFFFE}\x{EFFFF}\x{FFFFE}\x{FFFFF}' . 3926 '\x{10FFFE}\x{10FFFF}' . 3927 ']~Ssu', 3928 $name 3929 ) ) { 3930 _doing_it_wrong( 3931 __METHOD__, 3932 __( 'Invalid attribute name.' ), 3933 '6.2.0' 3934 ); 3935 3936 return false; 3937 } 3938 3939 /* 3940 * > The values "true" and "false" are not allowed on boolean attributes. 3941 * > To represent a false value, the attribute has to be omitted altogether. 3942 * - HTML5 spec, https://html.spec.whatwg.org/#boolean-attributes 3943 */ 3944 if ( false === $value ) { 3945 return $this->remove_attribute( $name ); 3946 } 3947 3948 if ( true === $value ) { 3949 $updated_attribute = $name; 3950 } else { 3951 $comparable_name = strtolower( $name ); 3952 3953 /* 3954 * Escape URL attributes. 3955 * 3956 * @see https://html.spec.whatwg.org/#attributes-3 3957 */ 3958 $escaped_new_value = in_array( $comparable_name, wp_kses_uri_attributes(), true ) ? esc_url( $value ) : esc_attr( $value ); 3959 3960 // If the escaping functions wiped out the update, reject it and indicate it was rejected. 3961 if ( '' === $escaped_new_value && '' !== $value ) { 3962 return false; 3963 } 3964 3965 $updated_attribute = "{$name}=\"{$escaped_new_value}\""; 3966 } 3967 3968 /* 3969 * > There must never be two or more attributes on 3970 * > the same start tag whose names are an ASCII 3971 * > case-insensitive match for each other. 3972 * - HTML 5 spec 3973 * 3974 * @see https://html.spec.whatwg.org/multipage/syntax.html#attributes-2:ascii-case-insensitive 3975 */ 3976 $comparable_name = strtolower( $name ); 3977 3978 if ( isset( $this->attributes[ $comparable_name ] ) ) { 3979 /* 3980 * Update an existing attribute. 3981 * 3982 * Example – set attribute id to "new" in <div id="initial_id" />: 3983 * 3984 * <div id="initial_id"/> 3985 * ^-------------^ 3986 * start end 3987 * replacement: `id="new"` 3988 * 3989 * Result: <div id="new"/> 3990 */ 3991 $existing_attribute = $this->attributes[ $comparable_name ]; 3992 $this->lexical_updates[ $comparable_name ] = new WP_HTML_Text_Replacement( 3993 $existing_attribute->start, 3994 $existing_attribute->length, 3995 $updated_attribute 3996 ); 3997 } else { 3998 /* 3999 * Create a new attribute at the tag's name end. 4000 * 4001 * Example – add attribute id="new" to <div />: 4002 * 4003 * <div/> 4004 * ^ 4005 * start and end 4006 * replacement: ` id="new"` 4007 * 4008 * Result: <div id="new"/> 4009 */ 4010 $this->lexical_updates[ $comparable_name ] = new WP_HTML_Text_Replacement( 4011 $this->tag_name_starts_at + $this->tag_name_length, 4012 0, 4013 ' ' . $updated_attribute 4014 ); 4015 } 4016 4017 /* 4018 * Any calls to update the `class` attribute directly should wipe out any 4019 * enqueued class changes from `add_class` and `remove_class`. 4020 */ 4021 if ( 'class' === $comparable_name && ! empty( $this->classname_updates ) ) { 4022 $this->classname_updates = array(); 4023 } 4024 4025 return true; 4026 } 4027 4028 /** 4029 * Remove an attribute from the currently-matched tag. 4030 * 4031 * @since 6.2.0 4032 * 4033 * @param string $name The attribute name to remove. 4034 * @return bool Whether an attribute was removed. 4035 */ 4036 public function remove_attribute( $name ): bool { 4037 if ( 4038 self::STATE_MATCHED_TAG !== $this->parser_state || 4039 $this->is_closing_tag 4040 ) { 4041 return false; 4042 } 4043 4044 /* 4045 * > There must never be two or more attributes on 4046 * > the same start tag whose names are an ASCII 4047 * > case-insensitive match for each other. 4048 * - HTML 5 spec 4049 * 4050 * @see https://html.spec.whatwg.org/multipage/syntax.html#attributes-2:ascii-case-insensitive 4051 */ 4052 $name = strtolower( $name ); 4053 4054 /* 4055 * Any calls to update the `class` attribute directly should wipe out any 4056 * enqueued class changes from `add_class` and `remove_class`. 4057 */ 4058 if ( 'class' === $name && count( $this->classname_updates ) !== 0 ) { 4059 $this->classname_updates = array(); 4060 } 4061 4062 /* 4063 * If updating an attribute that didn't exist in the input 4064 * document, then remove the enqueued update and move on. 4065 * 4066 * For example, this might occur when calling `remove_attribute()` 4067 * after calling `set_attribute()` for the same attribute 4068 * and when that attribute wasn't originally present. 4069 */ 4070 if ( ! isset( $this->attributes[ $name ] ) ) { 4071 if ( isset( $this->lexical_updates[ $name ] ) ) { 4072 unset( $this->lexical_updates[ $name ] ); 4073 } 4074 return false; 4075 } 4076 4077 /* 4078 * Removes an existing tag attribute. 4079 * 4080 * Example – remove the attribute id from <div id="main"/>: 4081 * <div id="initial_id"/> 4082 * ^-------------^ 4083 * start end 4084 * replacement: `` 4085 * 4086 * Result: <div /> 4087 */ 4088 $this->lexical_updates[ $name ] = new WP_HTML_Text_Replacement( 4089 $this->attributes[ $name ]->start, 4090 $this->attributes[ $name ]->length, 4091 '' 4092 ); 4093 4094 // Removes any duplicated attributes if they were also present. 4095 foreach ( $this->duplicate_attributes[ $name ] ?? array() as $attribute_token ) { 4096 $this->lexical_updates[] = new WP_HTML_Text_Replacement( 4097 $attribute_token->start, 4098 $attribute_token->length, 4099 '' 4100 ); 4101 } 4102 4103 return true; 4104 } 4105 4106 /** 4107 * Adds a new class name to the currently matched tag. 4108 * 4109 * @since 6.2.0 4110 * 4111 * @param string $class_name The class name to add. 4112 * @return bool Whether the class was set to be added. 4113 */ 4114 public function add_class( $class_name ): bool { 4115 if ( 4116 self::STATE_MATCHED_TAG !== $this->parser_state || 4117 $this->is_closing_tag 4118 ) { 4119 return false; 4120 } 4121 4122 if ( self::QUIRKS_MODE !== $this->compat_mode ) { 4123 $this->classname_updates[ $class_name ] = self::ADD_CLASS; 4124 return true; 4125 } 4126 4127 /* 4128 * Because class names are matched ASCII-case-insensitively in quirks mode, 4129 * this needs to see if a case variant of the given class name is already 4130 * enqueued and update that existing entry, if so. This picks the casing of 4131 * the first-provided class name for all lexical variations. 4132 */ 4133 $class_name_length = strlen( $class_name ); 4134 foreach ( $this->classname_updates as $updated_name => $action ) { 4135 if ( 4136 strlen( $updated_name ) === $class_name_length && 4137 0 === substr_compare( $updated_name, $class_name, 0, $class_name_length, true ) 4138 ) { 4139 $this->classname_updates[ $updated_name ] = self::ADD_CLASS; 4140 return true; 4141 } 4142 } 4143 4144 $this->classname_updates[ $class_name ] = self::ADD_CLASS; 4145 return true; 4146 } 4147 4148 /** 4149 * Removes a class name from the currently matched tag. 4150 * 4151 * @since 6.2.0 4152 * 4153 * @param string $class_name The class name to remove. 4154 * @return bool Whether the class was set to be removed. 4155 */ 4156 public function remove_class( $class_name ): bool { 4157 if ( 4158 self::STATE_MATCHED_TAG !== $this->parser_state || 4159 $this->is_closing_tag 4160 ) { 4161 return false; 4162 } 4163 4164 if ( self::QUIRKS_MODE !== $this->compat_mode ) { 4165 $this->classname_updates[ $class_name ] = self::REMOVE_CLASS; 4166 return true; 4167 } 4168 4169 /* 4170 * Because class names are matched ASCII-case-insensitively in quirks mode, 4171 * this needs to see if a case variant of the given class name is already 4172 * enqueued and update that existing entry, if so. This picks the casing of 4173 * the first-provided class name for all lexical variations. 4174 */ 4175 $class_name_length = strlen( $class_name ); 4176 foreach ( $this->classname_updates as $updated_name => $action ) { 4177 if ( 4178 strlen( $updated_name ) === $class_name_length && 4179 0 === substr_compare( $updated_name, $class_name, 0, $class_name_length, true ) 4180 ) { 4181 $this->classname_updates[ $updated_name ] = self::REMOVE_CLASS; 4182 return true; 4183 } 4184 } 4185 4186 $this->classname_updates[ $class_name ] = self::REMOVE_CLASS; 4187 return true; 4188 } 4189 4190 /** 4191 * Returns the string representation of the HTML Tag Processor. 4192 * 4193 * @since 6.2.0 4194 * 4195 * @see WP_HTML_Tag_Processor::get_updated_html() 4196 * 4197 * @return string The processed HTML. 4198 */ 4199 public function __toString(): string { 4200 return $this->get_updated_html(); 4201 } 4202 4203 /** 4204 * Returns the string representation of the HTML Tag Processor. 4205 * 4206 * @since 6.2.0 4207 * @since 6.2.1 Shifts the internal cursor corresponding to the applied updates. 4208 * @since 6.4.0 No longer calls subclass method `next_tag()` after updating HTML. 4209 * 4210 * @return string The processed HTML. 4211 */ 4212 public function get_updated_html(): string { 4213 $requires_no_updating = 0 === count( $this->classname_updates ) && 0 === count( $this->lexical_updates ); 4214 4215 /* 4216 * When there is nothing more to update and nothing has already been 4217 * updated, return the original document and avoid a string copy. 4218 */ 4219 if ( $requires_no_updating ) { 4220 return $this->html; 4221 } 4222 4223 /* 4224 * Keep track of the position right before the current tag. This will 4225 * be necessary for reparsing the current tag after updating the HTML. 4226 */ 4227 $before_current_tag = $this->token_starts_at ?? 0; 4228 4229 /* 4230 * 1. Apply the enqueued edits and update all the pointers to reflect those changes. 4231 */ 4232 $this->class_name_updates_to_attributes_updates(); 4233 $before_current_tag += $this->apply_attributes_updates( $before_current_tag ); 4234 4235 /* 4236 * 2. Rewind to before the current tag and reparse to get updated attributes. 4237 * 4238 * At this point the internal cursor points to the end of the tag name. 4239 * Rewind before the tag name starts so that it's as if the cursor didn't 4240 * move; a call to `next_tag()` will reparse the recently-updated attributes 4241 * and additional calls to modify the attributes will apply at this same 4242 * location, but in order to avoid issues with subclasses that might add 4243 * behaviors to `next_tag()`, the internal methods should be called here 4244 * instead. 4245 * 4246 * It's important to note that in this specific place there will be no change 4247 * because the processor was already at a tag when this was called and it's 4248 * rewinding only to the beginning of this very tag before reprocessing it 4249 * and its attributes. 4250 * 4251 * <p>Previous HTML<em>More HTML</em></p> 4252 * ↑ │ back up by the length of the tag name plus the opening < 4253 * └←─┘ back up by strlen("em") + 1 ==> 3 4254 */ 4255 $this->bytes_already_parsed = $before_current_tag; 4256 $this->base_class_next_token(); 4257 4258 return $this->html; 4259 } 4260 4261 /** 4262 * Parses tag query input into internal search criteria. 4263 * 4264 * @since 6.2.0 4265 * 4266 * @param array|string|null $query { 4267 * Optional. Which tag name to find, having which class, etc. Default is to find any tag. 4268 * 4269 * @type string|null $tag_name Which tag to find, or `null` for "any tag." 4270 * @type int|null $match_offset Find the Nth tag matching all search criteria. 4271 * 1 for "first" tag, 3 for "third," etc. 4272 * Defaults to first tag. 4273 * @type string|null $class_name Tag must contain this class name to match. 4274 * @type string $tag_closers "visit" or "skip": whether to stop on tag closers, e.g. </div>. 4275 * } 4276 */ 4277 private function parse_query( $query ) { 4278 if ( null !== $query && $query === $this->last_query ) { 4279 return; 4280 } 4281 4282 $this->last_query = $query; 4283 $this->sought_tag_name = null; 4284 $this->sought_class_name = null; 4285 $this->sought_match_offset = 1; 4286 $this->stop_on_tag_closers = false; 4287 4288 // A single string value means "find the tag of this name". 4289 if ( is_string( $query ) ) { 4290 $this->sought_tag_name = $query; 4291 return; 4292 } 4293 4294 // An empty query parameter applies no restrictions on the search. 4295 if ( null === $query ) { 4296 return; 4297 } 4298 4299 // If not using the string interface, an associative array is required. 4300 if ( ! is_array( $query ) ) { 4301 _doing_it_wrong( 4302 __METHOD__, 4303 __( 'The query argument must be an array or a tag name.' ), 4304 '6.2.0' 4305 ); 4306 return; 4307 } 4308 4309 if ( isset( $query['tag_name'] ) && is_string( $query['tag_name'] ) ) { 4310 $this->sought_tag_name = $query['tag_name']; 4311 } 4312 4313 if ( isset( $query['class_name'] ) && is_string( $query['class_name'] ) ) { 4314 $this->sought_class_name = $query['class_name']; 4315 } 4316 4317 if ( isset( $query['match_offset'] ) && is_int( $query['match_offset'] ) && 0 < $query['match_offset'] ) { 4318 $this->sought_match_offset = $query['match_offset']; 4319 } 4320 4321 if ( isset( $query['tag_closers'] ) ) { 4322 $this->stop_on_tag_closers = 'visit' === $query['tag_closers']; 4323 } 4324 } 4325 4326 4327 /** 4328 * Checks whether a given tag and its attributes match the search criteria. 4329 * 4330 * @since 6.2.0 4331 * 4332 * @return bool Whether the given tag and its attribute match the search criteria. 4333 */ 4334 private function matches(): bool { 4335 if ( $this->is_closing_tag && ! $this->stop_on_tag_closers ) { 4336 return false; 4337 } 4338 4339 // Does the tag name match the requested tag name in a case-insensitive manner? 4340 if ( 4341 isset( $this->sought_tag_name ) && 4342 ( 4343 strlen( $this->sought_tag_name ) !== $this->tag_name_length || 4344 0 !== substr_compare( $this->html, $this->sought_tag_name, $this->tag_name_starts_at, $this->tag_name_length, true ) 4345 ) 4346 ) { 4347 return false; 4348 } 4349 4350 if ( null !== $this->sought_class_name && ! $this->has_class( $this->sought_class_name ) ) { 4351 return false; 4352 } 4353 4354 return true; 4355 } 4356 4357 /** 4358 * Gets DOCTYPE declaration info from a DOCTYPE token. 4359 * 4360 * DOCTYPE tokens may appear in many places in an HTML document. In most places, they are 4361 * simply ignored. The main parsing functions find the basic shape of DOCTYPE tokens but 4362 * do not perform detailed parsing. 4363 * 4364 * This method can be called to perform a full parse of the DOCTYPE token and retrieve 4365 * its information. 4366 * 4367 * @return WP_HTML_Doctype_Info|null The DOCTYPE declaration information or `null` if not 4368 * currently at a DOCTYPE node. 4369 */ 4370 public function get_doctype_info(): ?WP_HTML_Doctype_Info { 4371 if ( self::STATE_DOCTYPE !== $this->parser_state ) { 4372 return null; 4373 } 4374 4375 return WP_HTML_Doctype_Info::from_doctype_token( substr( $this->html, $this->token_starts_at, $this->token_length ) ); 4376 } 4377 4378 /** 4379 * Parser Ready State. 4380 * 4381 * Indicates that the parser is ready to run and waiting for a state transition. 4382 * It may not have started yet, or it may have just finished parsing a token and 4383 * is ready to find the next one. 4384 * 4385 * @since 6.5.0 4386 * 4387 * @access private 4388 */ 4389 const STATE_READY = 'STATE_READY'; 4390 4391 /** 4392 * Parser Complete State. 4393 * 4394 * Indicates that the parser has reached the end of the document and there is 4395 * nothing left to scan. It finished parsing the last token completely. 4396 * 4397 * @since 6.5.0 4398 * 4399 * @access private 4400 */ 4401 const STATE_COMPLETE = 'STATE_COMPLETE'; 4402 4403 /** 4404 * Parser Incomplete Input State. 4405 * 4406 * Indicates that the parser has reached the end of the document before finishing 4407 * a token. It started parsing a token but there is a possibility that the input 4408 * HTML document was truncated in the middle of a token. 4409 * 4410 * The parser is reset at the start of the incomplete token and has paused. There 4411 * is nothing more than can be scanned unless provided a more complete document. 4412 * 4413 * @since 6.5.0 4414 * 4415 * @access private 4416 */ 4417 const STATE_INCOMPLETE_INPUT = 'STATE_INCOMPLETE_INPUT'; 4418 4419 /** 4420 * Parser Matched Tag State. 4421 * 4422 * Indicates that the parser has found an HTML tag and it's possible to get 4423 * the tag name and read or modify its attributes (if it's not a closing tag). 4424 * 4425 * @since 6.5.0 4426 * 4427 * @access private 4428 */ 4429 const STATE_MATCHED_TAG = 'STATE_MATCHED_TAG'; 4430 4431 /** 4432 * Parser Text Node State. 4433 * 4434 * Indicates that the parser has found a text node and it's possible 4435 * to read and modify that text. 4436 * 4437 * @since 6.5.0 4438 * 4439 * @access private 4440 */ 4441 const STATE_TEXT_NODE = 'STATE_TEXT_NODE'; 4442 4443 /** 4444 * Parser CDATA Node State. 4445 * 4446 * Indicates that the parser has found a CDATA node and it's possible 4447 * to read and modify its modifiable text. Note that in HTML there are 4448 * no CDATA nodes outside of foreign content (SVG and MathML). Outside 4449 * of foreign content, they are treated as HTML comments. 4450 * 4451 * @since 6.5.0 4452 * 4453 * @access private 4454 */ 4455 const STATE_CDATA_NODE = 'STATE_CDATA_NODE'; 4456 4457 /** 4458 * Indicates that the parser has found an HTML comment and it's 4459 * possible to read and modify its modifiable text. 4460 * 4461 * @since 6.5.0 4462 * 4463 * @access private 4464 */ 4465 const STATE_COMMENT = 'STATE_COMMENT'; 4466 4467 /** 4468 * Indicates that the parser has found a DOCTYPE node and it's 4469 * possible to read its DOCTYPE information via `get_doctype_info()`. 4470 * 4471 * @since 6.5.0 4472 * 4473 * @access private 4474 */ 4475 const STATE_DOCTYPE = 'STATE_DOCTYPE'; 4476 4477 /** 4478 * Indicates that the parser has found an empty tag closer `</>`. 4479 * 4480 * Note that in HTML there are no empty tag closers, and they 4481 * are ignored. Nonetheless, the Tag Processor still 4482 * recognizes them as they appear in the HTML stream. 4483 * 4484 * These were historically discussed as a "presumptuous tag 4485 * closer," which would close the nearest open tag, but were 4486 * dismissed in favor of explicitly-closing tags. 4487 * 4488 * @since 6.5.0 4489 * 4490 * @access private 4491 */ 4492 const STATE_PRESUMPTUOUS_TAG = 'STATE_PRESUMPTUOUS_TAG'; 4493 4494 /** 4495 * Indicates that the parser has found a "funky comment" 4496 * and it's possible to read and modify its modifiable text. 4497 * 4498 * Example: 4499 * 4500 * </%url> 4501 * </{"wp-bit":"query/post-author"}> 4502 * </2> 4503 * 4504 * Funky comments are tag closers with invalid tag names. Note 4505 * that in HTML these are turn into bogus comments. Nonetheless, 4506 * the Tag Processor recognizes them in a stream of HTML and 4507 * exposes them for inspection and modification. 4508 * 4509 * @since 6.5.0 4510 * 4511 * @access private 4512 */ 4513 const STATE_FUNKY_COMMENT = 'STATE_WP_FUNKY'; 4514 4515 /** 4516 * Indicates that a comment was created when encountering abruptly-closed HTML comment. 4517 * 4518 * Example: 4519 * 4520 * <!--> 4521 * <!---> 4522 * 4523 * @since 6.5.0 4524 */ 4525 const COMMENT_AS_ABRUPTLY_CLOSED_COMMENT = 'COMMENT_AS_ABRUPTLY_CLOSED_COMMENT'; 4526 4527 /** 4528 * Indicates that a comment would be parsed as a CDATA node, 4529 * were HTML to allow CDATA nodes outside of foreign content. 4530 * 4531 * Example: 4532 * 4533 * <![CDATA[This is a CDATA node.]]> 4534 * 4535 * This is an HTML comment, but it looks like a CDATA node. 4536 * 4537 * @since 6.5.0 4538 */ 4539 const COMMENT_AS_CDATA_LOOKALIKE = 'COMMENT_AS_CDATA_LOOKALIKE'; 4540 4541 /** 4542 * Indicates that a comment was created when encountering 4543 * normative HTML comment syntax. 4544 * 4545 * Example: 4546 * 4547 * <!-- this is a comment --> 4548 * 4549 * @since 6.5.0 4550 */ 4551 const COMMENT_AS_HTML_COMMENT = 'COMMENT_AS_HTML_COMMENT'; 4552 4553 /** 4554 * Indicates that a comment would be parsed as a Processing 4555 * Instruction node, were they to exist within HTML. 4556 * 4557 * Example: 4558 * 4559 * <?wp __( 'Like' ) ?> 4560 * 4561 * This is an HTML comment, but it looks like a CDATA node. 4562 * 4563 * @since 6.5.0 4564 */ 4565 const COMMENT_AS_PI_NODE_LOOKALIKE = 'COMMENT_AS_PI_NODE_LOOKALIKE'; 4566 4567 /** 4568 * Indicates that a comment was created when encountering invalid 4569 * HTML input, a so-called "bogus comment." 4570 * 4571 * Example: 4572 * 4573 * <?nothing special> 4574 * <!{nothing special}> 4575 * 4576 * @since 6.5.0 4577 */ 4578 const COMMENT_AS_INVALID_HTML = 'COMMENT_AS_INVALID_HTML'; 4579 4580 /** 4581 * No-quirks mode document compatibility mode. 4582 * 4583 * > In no-quirks mode, the behavior is (hopefully) the desired behavior 4584 * > described by the modern HTML and CSS specifications. 4585 * 4586 * @see self::$compat_mode 4587 * @see https://developer.mozilla.org/en-US/docs/Web/HTML/Quirks_Mode_and_Standards_Mode 4588 * 4589 * @since 6.7.0 4590 * 4591 * @var string 4592 */ 4593 const NO_QUIRKS_MODE = 'no-quirks-mode'; 4594 4595 /** 4596 * Quirks mode document compatibility mode. 4597 * 4598 * > In quirks mode, layout emulates behavior in Navigator 4 and Internet 4599 * > Explorer 5. This is essential in order to support websites that were 4600 * > built before the widespread adoption of web standards. 4601 * 4602 * @see self::$compat_mode 4603 * @see https://developer.mozilla.org/en-US/docs/Web/HTML/Quirks_Mode_and_Standards_Mode 4604 * 4605 * @since 6.7.0 4606 * 4607 * @var string 4608 */ 4609 const QUIRKS_MODE = 'quirks-mode'; 4610 4611 /** 4612 * Indicates that a span of text may contain any combination of significant 4613 * kinds of characters: NULL bytes, whitespace, and others. 4614 * 4615 * @see self::$text_node_classification 4616 * @see self::subdivide_text_appropriately 4617 * 4618 * @since 6.7.0 4619 */ 4620 const TEXT_IS_GENERIC = 'TEXT_IS_GENERIC'; 4621 4622 /** 4623 * Indicates that a span of text comprises a sequence only of NULL bytes. 4624 * 4625 * @see self::$text_node_classification 4626 * @see self::subdivide_text_appropriately 4627 * 4628 * @since 6.7.0 4629 */ 4630 const TEXT_IS_NULL_SEQUENCE = 'TEXT_IS_NULL_SEQUENCE'; 4631 4632 /** 4633 * Indicates that a span of decoded text comprises only whitespace. 4634 * 4635 * @see self::$text_node_classification 4636 * @see self::subdivide_text_appropriately 4637 * 4638 * @since 6.7.0 4639 */ 4640 const TEXT_IS_WHITESPACE = 'TEXT_IS_WHITESPACE'; 4641 }
title
Description
Body
title
Description
Body
title
Description
Body
title
Body
Generated : Thu Oct 2 08:20:03 2025 | Cross-referenced by PHPXref |