| [ Index ] |
PHP Cross Reference of WordPress Trunk (Updated Daily) |
[Summary view] [Print] [Text view]
1 <?php 2 /** 3 * HTML API: WP_HTML_Tag_Processor class 4 * 5 * Scans through an HTML document to find specific tags, then 6 * transforms those tags by adding, removing, or updating the 7 * values of the HTML attributes within that tag (opener). 8 * 9 * Does not fully parse HTML or _recurse_ into the HTML structure 10 * Instead this scans linearly through a document and only parses 11 * the HTML tag openers. 12 * 13 * ### Possible future direction for this module 14 * 15 * - Prune the whitespace when removing classes/attributes: e.g. "a b c" -> "c" not " c". 16 * This would increase the size of the changes for some operations but leave more 17 * natural-looking output HTML. 18 * 19 * @package WordPress 20 * @subpackage HTML-API 21 * @since 6.2.0 22 */ 23 24 /** 25 * Core class used to modify attributes in an HTML document for tags matching a query. 26 * 27 * ## Usage 28 * 29 * Use of this class requires three steps: 30 * 31 * 1. Create a new class instance with your input HTML document. 32 * 2. Find the tag(s) you are looking for. 33 * 3. Request changes to the attributes in those tag(s). 34 * 35 * Example: 36 * 37 * $tags = new WP_HTML_Tag_Processor( $html ); 38 * if ( $tags->next_tag( 'option' ) ) { 39 * $tags->set_attribute( 'selected', true ); 40 * } 41 * 42 * ### Finding tags 43 * 44 * The `next_tag()` function moves the internal cursor through 45 * your input HTML document until it finds a tag meeting any of 46 * the supplied restrictions in the optional query argument. If 47 * no argument is provided then it will find the next HTML tag, 48 * regardless of what kind it is. 49 * 50 * If you want to _find whatever the next tag is_: 51 * 52 * $tags->next_tag(); 53 * 54 * | Goal | Query | 55 * |-----------------------------------------------------------|---------------------------------------------------------------------------------| 56 * | Find any tag. | `$tags->next_tag();` | 57 * | Find next image tag. | `$tags->next_tag( array( 'tag_name' => 'img' ) );` | 58 * | Find next image tag (without passing the array). | `$tags->next_tag( 'img' );` | 59 * | Find next tag containing the `fullwidth` CSS class. | `$tags->next_tag( array( 'class_name' => 'fullwidth' ) );` | 60 * | Find next image tag containing the `fullwidth` CSS class. | `$tags->next_tag( array( 'tag_name' => 'img', 'class_name' => 'fullwidth' ) );` | 61 * 62 * If a tag was found meeting your criteria then `next_tag()` 63 * will return `true` and you can proceed to modify it. If it 64 * returns `false`, however, it failed to find the tag and 65 * moved the cursor to the end of the file. 66 * 67 * Once the cursor reaches the end of the file the processor 68 * is done and if you want to reach an earlier tag you will 69 * need to recreate the processor and start over, as it's 70 * unable to back up or move in reverse. 71 * 72 * See the section on bookmarks for an exception to this 73 * no-backing-up rule. 74 * 75 * #### Custom queries 76 * 77 * Sometimes it's necessary to further inspect an HTML tag than 78 * the query syntax here permits. In these cases one may further 79 * inspect the search results using the read-only functions 80 * provided by the processor or external state or variables. 81 * 82 * Example: 83 * 84 * // Paint up to the first five DIV or SPAN tags marked with the "jazzy" style. 85 * $remaining_count = 5; 86 * while ( $remaining_count > 0 && $tags->next_tag() ) { 87 * if ( 88 * ( 'DIV' === $tags->get_tag() || 'SPAN' === $tags->get_tag() ) && 89 * 'jazzy' === $tags->get_attribute( 'data-style' ) 90 * ) { 91 * $tags->add_class( 'theme-style-everest-jazz' ); 92 * $remaining_count--; 93 * } 94 * } 95 * 96 * `get_attribute()` will return `null` if the attribute wasn't present 97 * on the tag when it was called. It may return `""` (the empty string) 98 * in cases where the attribute was present but its value was empty. 99 * For boolean attributes, those whose name is present but no value is 100 * given, it will return `true` (the only way to set `false` for an 101 * attribute is to remove it). 102 * 103 * #### When matching fails 104 * 105 * When `next_tag()` returns `false` it could mean different things: 106 * 107 * - The requested tag wasn't found in the input document. 108 * - The input document ended in the middle of an HTML syntax element. 109 * 110 * When a document ends in the middle of a syntax element it will pause 111 * the processor. This is to make it possible in the future to extend the 112 * input document and proceed - an important requirement for chunked 113 * streaming parsing of a document. 114 * 115 * Example: 116 * 117 * $processor = new WP_HTML_Tag_Processor( 'This <div is="a" partial="token' ); 118 * false === $processor->next_tag(); 119 * 120 * If a special element (see next section) is encountered but no closing tag 121 * is found it will count as an incomplete tag. The parser will pause as if 122 * the opening tag were incomplete. 123 * 124 * Example: 125 * 126 * $processor = new WP_HTML_Tag_Processor( '<style>// there could be more styling to come' ); 127 * false === $processor->next_tag(); 128 * 129 * $processor = new WP_HTML_Tag_Processor( '<style>// this is everything</style><div>' ); 130 * true === $processor->next_tag( 'DIV' ); 131 * 132 * #### Special self-contained elements 133 * 134 * Some HTML elements are handled in a special way; their start and end tags 135 * act like a void tag. These are special because their contents can't contain 136 * HTML markup. Everything inside these elements is handled in a special way 137 * and content that _appears_ like HTML tags inside of them isn't. There can 138 * be no nesting in these elements. 139 * 140 * In the following list, "raw text" means that all of the content in the HTML 141 * until the matching closing tag is treated verbatim without any replacements 142 * and without any parsing. 143 * 144 * - IFRAME allows no content but requires a closing tag. 145 * - NOEMBED (deprecated) content is raw text. 146 * - NOFRAMES (deprecated) content is raw text. 147 * - SCRIPT content is plaintext apart from legacy rules allowing `</script>` inside an HTML comment. 148 * - STYLE content is raw text. 149 * - TITLE content is plain text but character references are decoded. 150 * - TEXTAREA content is plain text but character references are decoded. 151 * - XMP (deprecated) content is raw text. 152 * 153 * ### Modifying HTML attributes for a found tag 154 * 155 * Once you've found the start of an opening tag you can modify 156 * any number of the attributes on that tag. You can set a new 157 * value for an attribute, remove the entire attribute, or do 158 * nothing and move on to the next opening tag. 159 * 160 * Example: 161 * 162 * if ( $tags->next_tag( array( 'class_name' => 'wp-group-block' ) ) ) { 163 * $tags->set_attribute( 'title', 'This groups the contained content.' ); 164 * $tags->remove_attribute( 'data-test-id' ); 165 * } 166 * 167 * If `set_attribute()` is called for an existing attribute it will 168 * overwrite the existing value. Similarly, calling `remove_attribute()` 169 * for a non-existing attribute has no effect on the document. Both 170 * of these methods are safe to call without knowing if a given attribute 171 * exists beforehand. 172 * 173 * ### Modifying CSS classes for a found tag 174 * 175 * The tag processor treats the `class` attribute as a special case. 176 * Because it's a common operation to add or remove CSS classes, this 177 * interface adds helper methods to make that easier. 178 * 179 * As with attribute values, adding or removing CSS classes is a safe 180 * operation that doesn't require checking if the attribute or class 181 * exists before making changes. If removing the only class then the 182 * entire `class` attribute will be removed. 183 * 184 * Example: 185 * 186 * // from `<span>Yippee!</span>` 187 * // to `<span class="is-active">Yippee!</span>` 188 * $tags->add_class( 'is-active' ); 189 * 190 * // from `<span class="excited">Yippee!</span>` 191 * // to `<span class="excited is-active">Yippee!</span>` 192 * $tags->add_class( 'is-active' ); 193 * 194 * // from `<span class="is-active heavy-accent">Yippee!</span>` 195 * // to `<span class="is-active heavy-accent">Yippee!</span>` 196 * $tags->add_class( 'is-active' ); 197 * 198 * // from `<input type="text" class="is-active rugby not-disabled" length="24">` 199 * // to `<input type="text" class="is-active not-disabled" length="24"> 200 * $tags->remove_class( 'rugby' ); 201 * 202 * // from `<input type="text" class="rugby" length="24">` 203 * // to `<input type="text" length="24"> 204 * $tags->remove_class( 'rugby' ); 205 * 206 * // from `<input type="text" length="24">` 207 * // to `<input type="text" length="24"> 208 * $tags->remove_class( 'rugby' ); 209 * 210 * When class changes are enqueued but a direct change to `class` is made via 211 * `set_attribute` then the changes to `set_attribute` (or `remove_attribute`) 212 * will take precedence over those made through `add_class` and `remove_class`. 213 * 214 * ### Bookmarks 215 * 216 * While scanning through the input HTML document it's possible to set 217 * a named bookmark when a particular tag is found. Later on, after 218 * continuing to scan other tags, it's possible to `seek` to one of 219 * the set bookmarks and then proceed again from that point forward. 220 * 221 * Because bookmarks create processing overhead one should avoid 222 * creating too many of them. As a rule, create only bookmarks 223 * of known string literal names; avoid creating "mark_{$index}" 224 * and so on. It's fine from a performance standpoint to create a 225 * bookmark and update it frequently, such as within a loop. 226 * 227 * $total_todos = 0; 228 * while ( $p->next_tag( array( 'tag_name' => 'UL', 'class_name' => 'todo' ) ) ) { 229 * $p->set_bookmark( 'list-start' ); 230 * while ( $p->next_tag( array( 'tag_closers' => 'visit' ) ) ) { 231 * if ( 'UL' === $p->get_tag() && $p->is_tag_closer() ) { 232 * $p->set_bookmark( 'list-end' ); 233 * $p->seek( 'list-start' ); 234 * $p->set_attribute( 'data-contained-todos', (string) $total_todos ); 235 * $total_todos = 0; 236 * $p->seek( 'list-end' ); 237 * break; 238 * } 239 * 240 * if ( 'LI' === $p->get_tag() && ! $p->is_tag_closer() ) { 241 * $total_todos++; 242 * } 243 * } 244 * } 245 * 246 * ## Tokens and finer-grained processing. 247 * 248 * It's possible to scan through every lexical token in the 249 * HTML document using the `next_token()` function. This 250 * alternative form takes no argument and provides no built-in 251 * query syntax. 252 * 253 * Example: 254 * 255 * $title = '(untitled)'; 256 * $text = ''; 257 * while ( $processor->next_token() ) { 258 * switch ( $processor->get_token_name() ) { 259 * case '#text': 260 * $text .= $processor->get_modifiable_text(); 261 * break; 262 * 263 * case 'BR': 264 * $text .= "\n"; 265 * break; 266 * 267 * case 'TITLE': 268 * $title = $processor->get_modifiable_text(); 269 * break; 270 * } 271 * } 272 * return trim( "# {$title}\n\n{$text}" ); 273 * 274 * ### Tokens and _modifiable text_. 275 * 276 * #### Special "atomic" HTML elements. 277 * 278 * Not all HTML elements are able to contain other elements inside of them. 279 * For instance, the contents inside a TITLE element are plaintext (except 280 * that character references like & will be decoded). This means that 281 * if the string `<img>` appears inside a TITLE element, then it's not an 282 * image tag, but rather it's text describing an image tag. Likewise, the 283 * contents of a SCRIPT or STYLE element are handled entirely separately in 284 * a browser than the contents of other elements because they represent a 285 * different language than HTML. 286 * 287 * For these elements the Tag Processor treats the entire sequence as one, 288 * from the opening tag, including its contents, through its closing tag. 289 * This means that it's not possible to match the closing tag for a 290 * SCRIPT element unless it's unexpected; the Tag Processor already matched 291 * it when it found the opening tag. 292 * 293 * The inner contents of these elements are that element's _modifiable text_. 294 * 295 * The special elements are: 296 * - `SCRIPT` whose contents are treated as raw plaintext but supports a legacy 297 * style of including JavaScript inside of HTML comments to avoid accidentally 298 * closing the SCRIPT from inside a JavaScript string. E.g. `console.log( '</script>' )`. 299 * - `TITLE` and `TEXTAREA` whose contents are treated as plaintext and then any 300 * character references are decoded. E.g. `1 < 2 < 3` becomes `1 < 2 < 3`. 301 * - `IFRAME`, `NOEMBED`, `NOFRAMES`, `STYLE` whose contents are treated as 302 * raw plaintext and left as-is. E.g. `1 < 2 < 3` remains `1 < 2 < 3`. 303 * 304 * #### Other tokens with modifiable text. 305 * 306 * There are also non-elements which are void/self-closing in nature and contain 307 * modifiable text that is part of that individual syntax token itself. 308 * 309 * - `#text` nodes, whose entire token _is_ the modifiable text. 310 * - HTML comments and tokens that become comments due to some syntax error. The 311 * text for these tokens is the portion of the comment inside of the syntax. 312 * E.g. for `<!-- comment -->` the text is `" comment "` (note the spaces are included). 313 * - `CDATA` sections, whose text is the content inside of the section itself. E.g. for 314 * `<![CDATA[some content]]>` the text is `"some content"` (with restrictions [1]). 315 * - "Funky comments," which are a special case of invalid closing tags whose name is 316 * invalid. The text for these nodes is the text that a browser would transform into 317 * an HTML comment when parsing. E.g. for `</%post_author>` the text is `%post_author`. 318 * - `DOCTYPE` declarations like `<DOCTYPE html>` which have no closing tag. 319 * - XML Processing instruction nodes like `<?wp __( "Like" ); ?>` (with restrictions [2]). 320 * - The empty end tag `</>` which is ignored in the browser and DOM. 321 * 322 * [1]: There are no CDATA sections in HTML. When encountering `<![CDATA[`, everything 323 * until the next `>` becomes a bogus HTML comment, meaning there can be no CDATA 324 * section in an HTML document containing `>`. The Tag Processor will first find 325 * all valid and bogus HTML comments, and then if the comment _would_ have been a 326 * CDATA section _were they to exist_, it will indicate this as the type of comment. 327 * 328 * [2]: XML allows a broader range of characters in a processing instruction's target name 329 * and disallows "xml" as a name, since it's special. The Tag Processor only recognizes 330 * target names with an ASCII-representable subset of characters. It also exhibits the 331 * same constraint as with CDATA sections, in that `>` cannot exist within the token 332 * since Processing Instructions do not exist within HTML and their syntax transforms 333 * into a bogus comment in the DOM. 334 * 335 * ## Design and limitations 336 * 337 * The Tag Processor is designed to linearly scan HTML documents and tokenize 338 * HTML tags and their attributes. It's designed to do this as efficiently as 339 * possible without compromising parsing integrity. Therefore it will be 340 * slower than some methods of modifying HTML, such as those incorporating 341 * over-simplified PCRE patterns, but will not introduce the defects and 342 * failures that those methods bring in, which lead to broken page renders 343 * and often to security vulnerabilities. On the other hand, it will be faster 344 * than full-blown HTML parsers such as DOMDocument and use considerably 345 * less memory. It requires a negligible memory overhead, enough to consider 346 * it a zero-overhead system. 347 * 348 * The performance characteristics are maintained by avoiding tree construction 349 * and semantic cleanups which are specified in HTML5. Because of this, for 350 * example, it's not possible for the Tag Processor to associate any given 351 * opening tag with its corresponding closing tag, or to return the inner markup 352 * inside an element. Systems may be built on top of the Tag Processor to do 353 * this, but the Tag Processor is and should be constrained so it can remain an 354 * efficient, low-level, and reliable HTML scanner. 355 * 356 * The Tag Processor's design incorporates a "garbage-in-garbage-out" philosophy. 357 * HTML5 specifies that certain invalid content be transformed into different forms 358 * for display, such as removing null bytes from an input document and replacing 359 * invalid characters with the Unicode replacement character `U+FFFD` (visually "�"). 360 * Where errors or transformations exist within the HTML5 specification, the Tag Processor 361 * leaves those invalid inputs untouched, passing them through to the final browser 362 * to handle. While this implies that certain operations will be non-spec-compliant, 363 * such as reading the value of an attribute with invalid content, it also preserves a 364 * simplicity and efficiency for handling those error cases. 365 * 366 * Most operations within the Tag Processor are designed to minimize the difference 367 * between an input and output document for any given change. For example, the 368 * `add_class` and `remove_class` methods preserve whitespace and the class ordering 369 * within the `class` attribute; and when encountering tags with duplicated attributes, 370 * the Tag Processor will leave those invalid duplicate attributes where they are but 371 * update the proper attribute which the browser will read for parsing its value. An 372 * exception to this rule is that all attribute updates store their values as 373 * double-quoted strings, meaning that attributes on input with single-quoted or 374 * unquoted values will appear in the output with double-quotes. 375 * 376 * ### Scripting Flag 377 * 378 * The Tag Processor parses HTML with the "scripting flag" disabled. This means 379 * that it doesn't run any scripts while parsing the page. In a browser with 380 * JavaScript enabled, for example, the script can change the parse of the 381 * document as it loads. On the server, however, evaluating JavaScript is not 382 * only impractical, but also unwanted. 383 * 384 * Practically this means that the Tag Processor will descend into NOSCRIPT 385 * elements and process its child tags. Were the scripting flag enabled, such 386 * as in a typical browser, the contents of NOSCRIPT are skipped entirely. 387 * 388 * This allows the HTML API to process the content that will be presented in 389 * a browser when scripting is disabled, but it offers a different view of a 390 * page than most browser sessions will experience. E.g. the tags inside the 391 * NOSCRIPT disappear. 392 * 393 * ### Text Encoding 394 * 395 * The Tag Processor assumes that the input HTML document is encoded with a 396 * text encoding compatible with 7-bit ASCII's '<', '>', '&', ';', '/', '=', 397 * "'", '"', 'a' - 'z', 'A' - 'Z', and the whitespace characters ' ', tab, 398 * carriage-return, newline, and form-feed. 399 * 400 * In practice, this includes almost every single-byte encoding as well as 401 * UTF-8. Notably, however, it does not include UTF-16. If providing input 402 * that's incompatible, then convert the encoding beforehand. 403 * 404 * @since 6.2.0 405 * @since 6.2.1 Fix: Support for various invalid comments; attribute updates are case-insensitive. 406 * @since 6.3.2 Fix: Skip HTML-like content inside rawtext elements such as STYLE. 407 * @since 6.5.0 Pauses processor when input ends in an incomplete syntax token. 408 * Introduces "special" elements which act like void elements, e.g. TITLE, STYLE. 409 * Allows scanning through all tokens and processing modifiable text, where applicable. 410 */ 411 class WP_HTML_Tag_Processor { 412 /** 413 * The maximum number of bookmarks allowed to exist at 414 * any given time. 415 * 416 * @since 6.2.0 417 * @var int 418 * 419 * @see WP_HTML_Tag_Processor::set_bookmark() 420 */ 421 const MAX_BOOKMARKS = 10; 422 423 /** 424 * Maximum number of times seek() can be called. 425 * Prevents accidental infinite loops. 426 * 427 * @since 6.2.0 428 * @var int 429 * 430 * @see WP_HTML_Tag_Processor::seek() 431 */ 432 const MAX_SEEK_OPS = 1000; 433 434 /** 435 * The HTML document to parse. 436 * 437 * @since 6.2.0 438 * @var string 439 */ 440 protected $html; 441 442 /** 443 * The last query passed to next_tag(). 444 * 445 * @since 6.2.0 446 * @var array|null 447 */ 448 private $last_query; 449 450 /** 451 * The tag name this processor currently scans for. 452 * 453 * @since 6.2.0 454 * @var string|null 455 */ 456 private $sought_tag_name; 457 458 /** 459 * The CSS class name this processor currently scans for. 460 * 461 * @since 6.2.0 462 * @var string|null 463 */ 464 private $sought_class_name; 465 466 /** 467 * The match offset this processor currently scans for. 468 * 469 * @since 6.2.0 470 * @var int|null 471 */ 472 private $sought_match_offset; 473 474 /** 475 * Whether to visit tag closers, e.g. </div>, when walking an input document. 476 * 477 * @since 6.2.0 478 * @var bool 479 */ 480 private $stop_on_tag_closers; 481 482 /** 483 * Specifies mode of operation of the parser at any given time. 484 * 485 * | State | Meaning | 486 * | ----------------|----------------------------------------------------------------------| 487 * | *Ready* | The parser is ready to run. | 488 * | *Complete* | There is nothing left to parse. | 489 * | *Incomplete* | The HTML ended in the middle of a token; nothing more can be parsed. | 490 * | *Matched tag* | Found an HTML tag; it's possible to modify its attributes. | 491 * | *Text node* | Found a #text node; this is plaintext and modifiable. | 492 * | *CDATA node* | Found a CDATA section; this is modifiable. | 493 * | *Comment* | Found a comment or bogus comment; this is modifiable. | 494 * | *Presumptuous* | Found an empty tag closer: `</>`. | 495 * | *Funky comment* | Found a tag closer with an invalid tag name; this is modifiable. | 496 * 497 * @since 6.5.0 498 * 499 * @see WP_HTML_Tag_Processor::STATE_READY 500 * @see WP_HTML_Tag_Processor::STATE_COMPLETE 501 * @see WP_HTML_Tag_Processor::STATE_INCOMPLETE_INPUT 502 * @see WP_HTML_Tag_Processor::STATE_MATCHED_TAG 503 * @see WP_HTML_Tag_Processor::STATE_TEXT_NODE 504 * @see WP_HTML_Tag_Processor::STATE_CDATA_NODE 505 * @see WP_HTML_Tag_Processor::STATE_COMMENT 506 * @see WP_HTML_Tag_Processor::STATE_DOCTYPE 507 * @see WP_HTML_Tag_Processor::STATE_PRESUMPTUOUS_TAG 508 * @see WP_HTML_Tag_Processor::STATE_FUNKY_COMMENT 509 * 510 * @var string 511 */ 512 protected $parser_state = self::STATE_READY; 513 514 /** 515 * Indicates if the document is in quirks mode or no-quirks mode. 516 * 517 * Impact on HTML parsing: 518 * 519 * - In `NO_QUIRKS_MODE` (also known as "standard mode"): 520 * - CSS class and ID selectors match byte-for-byte (case-sensitively). 521 * - A TABLE start tag `<table>` implicitly closes any open `P` element. 522 * 523 * - In `QUIRKS_MODE`: 524 * - CSS class and ID selectors match in an ASCII case-insensitive manner. 525 * - A TABLE start tag `<table>` opens a `TABLE` element as a child of a `P` 526 * element if one is open. 527 * 528 * Quirks and no-quirks mode are thus mostly about styling, but have an impact when 529 * tables are found inside paragraph elements. 530 * 531 * @see self::QUIRKS_MODE 532 * @see self::NO_QUIRKS_MODE 533 * 534 * @since 6.7.0 535 * 536 * @var string 537 */ 538 protected $compat_mode = self::NO_QUIRKS_MODE; 539 540 /** 541 * Indicates whether the parser is inside foreign content, 542 * e.g. inside an SVG or MathML element. 543 * 544 * One of 'html', 'svg', or 'math'. 545 * 546 * Several parsing rules change based on whether the parser 547 * is inside foreign content, including whether CDATA sections 548 * are allowed and whether a self-closing flag indicates that 549 * an element has no content. 550 * 551 * @since 6.7.0 552 * 553 * @var string 554 */ 555 private $parsing_namespace = 'html'; 556 557 /** 558 * What kind of syntax token became an HTML comment. 559 * 560 * Since there are many ways in which HTML syntax can create an HTML comment, 561 * this indicates which of those caused it. This allows the Tag Processor to 562 * represent more from the original input document than would appear in the DOM. 563 * 564 * @since 6.5.0 565 * 566 * @var string|null 567 */ 568 protected $comment_type = null; 569 570 /** 571 * What kind of text the matched text node represents, if it was subdivided. 572 * 573 * @see self::TEXT_IS_NULL_SEQUENCE 574 * @see self::TEXT_IS_WHITESPACE 575 * @see self::TEXT_IS_GENERIC 576 * @see self::subdivide_text_appropriately 577 * 578 * @since 6.7.0 579 * 580 * @var string 581 */ 582 protected $text_node_classification = self::TEXT_IS_GENERIC; 583 584 /** 585 * How many bytes from the original HTML document have been read and parsed. 586 * 587 * This value points to the latest byte offset in the input document which 588 * has been already parsed. It is the internal cursor for the Tag Processor 589 * and updates while scanning through the HTML tokens. 590 * 591 * @since 6.2.0 592 * @var int 593 */ 594 private $bytes_already_parsed = 0; 595 596 /** 597 * Byte offset in input document where current token starts. 598 * 599 * Example: 600 * 601 * <div id="test">... 602 * 01234 603 * - token starts at 0 604 * 605 * @since 6.5.0 606 * 607 * @var int|null 608 */ 609 private $token_starts_at; 610 611 /** 612 * Byte length of current token. 613 * 614 * Example: 615 * 616 * <div id="test">... 617 * 0123456789012345 618 * - token length is 15 - 0 = 15 619 * 620 * a <!-- comment --> is a token. 621 * 0123456789 123456789 123456789 622 * - token length is 18 - 2 = 16 623 * 624 * @since 6.5.0 625 * 626 * @var int|null 627 */ 628 private $token_length; 629 630 /** 631 * Whether the current tag token has the self-closing flag. 632 * 633 * @since 7.1.0 634 * 635 * @var bool 636 */ 637 private $has_self_closing_flag = false; 638 639 /** 640 * Byte offset in input document where current tag name starts. 641 * 642 * Example: 643 * 644 * <div id="test">... 645 * 01234 646 * - tag name starts at 1 647 * 648 * @since 6.2.0 649 * 650 * @var int|null 651 */ 652 private $tag_name_starts_at; 653 654 /** 655 * Byte length of current tag name. 656 * 657 * Example: 658 * 659 * <div id="test">... 660 * 01234 661 * --- tag name length is 3 662 * 663 * @since 6.2.0 664 * 665 * @var int|null 666 */ 667 private $tag_name_length; 668 669 /** 670 * Byte offset into input document where current modifiable text starts. 671 * 672 * @since 6.5.0 673 * 674 * @var int 675 */ 676 private $text_starts_at; 677 678 /** 679 * Byte length of modifiable text. 680 * 681 * @since 6.5.0 682 * 683 * @var int 684 */ 685 private $text_length; 686 687 /** 688 * Whether the current tag is an opening tag, e.g. <div>, or a closing tag, e.g. </div>. 689 * 690 * @var bool 691 */ 692 private $is_closing_tag; 693 694 /** 695 * Lazily-built index of attributes found within an HTML tag, keyed by the attribute name. 696 * 697 * Example: 698 * 699 * // Supposing the parser is working through this content 700 * // and stops after recognizing the `id` attribute. 701 * // <div id="test-4" class=outline title="data:text/plain;base64=asdk3nk1j3fo8"> 702 * // ^ parsing will continue from this point. 703 * $this->attributes = array( 704 * 'id' => new WP_HTML_Attribute_Token( 'id', 9, 6, 5, 11, false ) 705 * ); 706 * 707 * // When picking up parsing again, or when asking to find the 708 * // `class` attribute we will continue and add to this array. 709 * $this->attributes = array( 710 * 'id' => new WP_HTML_Attribute_Token( 'id', 9, 6, 5, 11, false ), 711 * 'class' => new WP_HTML_Attribute_Token( 'class', 23, 7, 17, 13, false ) 712 * ); 713 * 714 * // Note that only the `class` attribute value is stored in the index. 715 * // That's because it is the only value used by this class at the moment. 716 * 717 * @since 6.2.0 718 * @var WP_HTML_Attribute_Token[] 719 */ 720 private $attributes = array(); 721 722 /** 723 * Tracks spans of duplicate attributes on a given tag, used for removing 724 * all copies of an attribute when calling `remove_attribute()`. 725 * 726 * @since 6.3.2 727 * 728 * @var (WP_HTML_Span[])[]|null 729 */ 730 private $duplicate_attributes = null; 731 732 /** 733 * Which class names to add or remove from a tag. 734 * 735 * These are tracked separately from attribute updates because they are 736 * semantically distinct, whereas this interface exists for the common 737 * case of adding and removing class names while other attributes are 738 * generally modified as with DOM `setAttribute` calls. 739 * 740 * When modifying an HTML document these will eventually be collapsed 741 * into a single `set_attribute( 'class', $changes )` call. 742 * 743 * Example: 744 * 745 * // Add the `wp-block-group` class, remove the `wp-group` class. 746 * $classname_updates = array( 747 * // Indexed by a comparable class name. 748 * 'wp-block-group' => WP_HTML_Tag_Processor::ADD_CLASS, 749 * 'wp-group' => WP_HTML_Tag_Processor::REMOVE_CLASS 750 * ); 751 * 752 * @since 6.2.0 753 * @var bool[] 754 */ 755 private $classname_updates = array(); 756 757 /** 758 * Tracks a semantic location in the original HTML which 759 * shifts with updates as they are applied to the document. 760 * 761 * @since 6.2.0 762 * @var WP_HTML_Span[] 763 */ 764 protected $bookmarks = array(); 765 766 const ADD_CLASS = true; 767 const REMOVE_CLASS = false; 768 const SKIP_CLASS = null; 769 770 /** 771 * Lexical replacements to apply to input HTML document. 772 * 773 * "Lexical" in this class refers to the part of this class which 774 * operates on pure text _as text_ and not as HTML. There's a line 775 * between the public interface, with HTML-semantic methods like 776 * `set_attribute` and `add_class`, and an internal state that tracks 777 * text offsets in the input document. 778 * 779 * When higher-level HTML methods are called, those have to transform their 780 * operations (such as setting an attribute's value) into text diffing 781 * operations (such as replacing the sub-string from indices A to B with 782 * some given new string). These text-diffing operations are the lexical 783 * updates. 784 * 785 * As new higher-level methods are added they need to collapse their 786 * operations into these lower-level lexical updates since that's the 787 * Tag Processor's internal language of change. Any code which creates 788 * these lexical updates must ensure that they do not cross HTML syntax 789 * boundaries, however, so these should never be exposed outside of this 790 * class or any classes which intentionally expand its functionality. 791 * 792 * These are enqueued while editing the document instead of being immediately 793 * applied to avoid processing overhead, string allocations, and string 794 * copies when applying many updates to a single document. 795 * 796 * Example: 797 * 798 * // Replace an attribute stored with a new value, indices 799 * // sourced from the lazily-parsed HTML recognizer. 800 * $start = $attributes['src']->start; 801 * $length = $attributes['src']->length; 802 * $modifications[] = new WP_HTML_Text_Replacement( $start, $length, $new_value ); 803 * 804 * // Correspondingly, something like this will appear in this array. 805 * $lexical_updates = array( 806 * WP_HTML_Text_Replacement( 14, 28, 'https://my-site.my-domain/wp-content/uploads/2014/08/kittens.jpg' ) 807 * ); 808 * 809 * @since 6.2.0 810 * @var WP_HTML_Text_Replacement[] 811 */ 812 protected $lexical_updates = array(); 813 814 /** 815 * Tracks and limits `seek()` calls to prevent accidental infinite loops. 816 * 817 * @since 6.2.0 818 * @var int 819 * 820 * @see WP_HTML_Tag_Processor::seek() 821 */ 822 protected $seek_count = 0; 823 824 /** 825 * Whether the parser should skip over an immediately-following linefeed 826 * character, as is the case with LISTING, PRE, and TEXTAREA. 827 * 828 * > If the next token is a U+000A LINE FEED (LF) character token, then 829 * > ignore that token and move on to the next one. (Newlines at the start 830 * > of [these] elements are ignored as an authoring convenience.) 831 * 832 * @since 6.7.0 833 * 834 * @var int|null 835 */ 836 private $skip_newline_at = null; 837 838 /** 839 * Constructor. 840 * 841 * @since 6.2.0 842 * 843 * @param string $html HTML to process. 844 */ 845 public function __construct( $html ) { 846 if ( ! is_string( $html ) ) { 847 _doing_it_wrong( 848 __METHOD__, 849 __( 'The HTML parameter must be a string.' ), 850 '6.9.0' 851 ); 852 $html = ''; 853 } 854 $this->html = $html; 855 } 856 857 /** 858 * Switches parsing mode into a new namespace, such as when 859 * encountering an SVG tag and entering foreign content. 860 * 861 * @since 6.7.0 862 * 863 * @param string $new_namespace One of 'html', 'svg', or 'math' indicating into what 864 * namespace the next tokens will be processed. 865 * @return bool Whether the namespace was valid and changed. 866 */ 867 public function change_parsing_namespace( string $new_namespace ): bool { 868 if ( ! in_array( $new_namespace, array( 'html', 'math', 'svg' ), true ) ) { 869 return false; 870 } 871 872 $this->parsing_namespace = $new_namespace; 873 return true; 874 } 875 876 /** 877 * Finds the next tag matching the $query. 878 * 879 * @since 6.2.0 880 * @since 6.5.0 No longer processes incomplete tokens at end of document; pauses the processor at start of token. 881 * 882 * @param array|string|null $query { 883 * Optional. Which tag name to find, having which class, etc. Default is to find any tag. 884 * 885 * @type string|null $tag_name Which tag to find, or `null` for "any tag." 886 * @type int|null $match_offset Find the Nth tag matching all search criteria. 887 * 1 for "first" tag, 3 for "third," etc. 888 * Defaults to first tag. 889 * @type string|null $class_name Tag must contain this whole class name to match. 890 * @type string|null $tag_closers "visit" or "skip": whether to stop on tag closers, e.g. </div>. 891 * } 892 * @return bool Whether a tag was matched. 893 * 894 * @phpstan-impure 895 */ 896 public function next_tag( $query = null ): bool { 897 $this->parse_query( $query ); 898 $already_found = 0; 899 900 do { 901 if ( false === $this->next_token() ) { 902 return false; 903 } 904 905 if ( self::STATE_MATCHED_TAG !== $this->parser_state ) { 906 continue; 907 } 908 909 if ( $this->matches() ) { 910 ++$already_found; 911 } 912 } while ( $already_found < $this->sought_match_offset ); 913 914 return true; 915 } 916 917 /** 918 * Finds the next token in the HTML document. 919 * 920 * An HTML document can be viewed as a stream of tokens, 921 * where tokens are things like HTML tags, HTML comments, 922 * text nodes, etc. This method finds the next token in 923 * the HTML document and returns whether it found one. 924 * 925 * If it starts parsing a token and reaches the end of the 926 * document then it will seek to the start of the last 927 * token and pause, returning `false` to indicate that it 928 * failed to find a complete token. 929 * 930 * Possible token types, based on the HTML specification: 931 * 932 * - an HTML tag, whether opening, closing, or void. 933 * - a text node - the plaintext inside tags. 934 * - an HTML comment. 935 * - a DOCTYPE declaration. 936 * - a processing instruction, e.g. `<?xml version="1.0" ?>`. 937 * 938 * @since 6.5.0 939 * @since 6.7.0 Recognizes CDATA sections within foreign content. 940 * 941 * @return bool Whether a token was parsed. 942 */ 943 public function next_token(): bool { 944 return $this->base_class_next_token(); 945 } 946 947 /** 948 * Internal method which finds the next token in the HTML document. 949 * 950 * This method is a protected internal function which implements the logic for 951 * finding the next token in a document. It exists so that the parser can update 952 * its state without affecting the location of the cursor in the document and 953 * without triggering subclass methods for things like `next_token()`, e.g. when 954 * applying patches before searching for the next token. 955 * 956 * @since 6.5.0 957 * @ignore 958 * 959 * @return bool Whether a token was parsed. 960 */ 961 private function base_class_next_token(): bool { 962 $was_at = $this->bytes_already_parsed; 963 $this->after_tag(); 964 965 // Don't proceed if there's nothing more to scan. 966 if ( 967 self::STATE_COMPLETE === $this->parser_state || 968 self::STATE_INCOMPLETE_INPUT === $this->parser_state 969 ) { 970 return false; 971 } 972 973 /* 974 * The next step in the parsing loop determines the parsing state; 975 * clear it so that state doesn't linger from the previous step. 976 */ 977 $this->parser_state = self::STATE_READY; 978 979 if ( $this->bytes_already_parsed >= strlen( $this->html ) ) { 980 $this->parser_state = self::STATE_COMPLETE; 981 return false; 982 } 983 984 // Find the next tag if it exists. 985 if ( false === $this->parse_next_tag() ) { 986 if ( self::STATE_INCOMPLETE_INPUT === $this->parser_state ) { 987 $this->bytes_already_parsed = $was_at; 988 } 989 990 return false; 991 } 992 993 /* 994 * For legacy reasons the rest of this function handles tags and their 995 * attributes. If the processor has reached the end of the document 996 * or if it matched any other token then it should return here to avoid 997 * attempting to process tag-specific syntax. 998 */ 999 if ( 1000 self::STATE_INCOMPLETE_INPUT !== $this->parser_state && 1001 self::STATE_COMPLETE !== $this->parser_state && 1002 self::STATE_MATCHED_TAG !== $this->parser_state 1003 ) { 1004 return true; 1005 } 1006 1007 // Parse all of its attributes. 1008 while ( $this->parse_next_attribute() ) { 1009 continue; 1010 } 1011 1012 // Ensure that the tag closes before the end of the document. 1013 if ( 1014 self::STATE_INCOMPLETE_INPUT === $this->parser_state || 1015 $this->bytes_already_parsed >= strlen( $this->html ) 1016 ) { 1017 // Does this appropriately clear state (parsed attributes)? 1018 $this->parser_state = self::STATE_INCOMPLETE_INPUT; 1019 $this->bytes_already_parsed = $was_at; 1020 1021 return false; 1022 } 1023 1024 $tag_ends_at = strpos( $this->html, '>', $this->bytes_already_parsed ); 1025 if ( false === $tag_ends_at ) { 1026 $this->parser_state = self::STATE_INCOMPLETE_INPUT; 1027 $this->bytes_already_parsed = $was_at; 1028 1029 return false; 1030 } 1031 $this->parser_state = self::STATE_MATCHED_TAG; 1032 $this->bytes_already_parsed = $tag_ends_at + 1; 1033 $this->token_length = $this->bytes_already_parsed - $this->token_starts_at; 1034 1035 /* 1036 * Certain tags require additional processing. The first-letter pre-check 1037 * avoids unnecessary string allocation when comparing the tag names. 1038 * 1039 * - IFRAME 1040 * - LISTING (deprecated) 1041 * - NOEMBED (deprecated) 1042 * - NOFRAMES (deprecated) 1043 * - PRE 1044 * - SCRIPT 1045 * - STYLE 1046 * - TEXTAREA 1047 * - TITLE 1048 * - XMP (deprecated) 1049 */ 1050 if ( 1051 $this->is_closing_tag || 1052 'html' !== $this->parsing_namespace || 1053 1 !== strspn( $this->html, 'iIlLnNpPsStTxX', $this->tag_name_starts_at, 1 ) 1054 ) { 1055 return true; 1056 } 1057 1058 $tag_name = $this->get_tag(); 1059 1060 /* 1061 * For LISTING, PRE, and TEXTAREA, the first linefeed of an immediately-following 1062 * text node is ignored as an authoring convenience. 1063 * 1064 * @see static::skip_newline_at 1065 */ 1066 if ( 'LISTING' === $tag_name || 'PRE' === $tag_name ) { 1067 $this->skip_newline_at = $this->bytes_already_parsed; 1068 return true; 1069 } 1070 1071 /* 1072 * There are certain elements whose children are not DATA but are instead 1073 * RCDATA or RAWTEXT. These cannot contain other elements, and the contents 1074 * are parsed as plaintext, with character references decoded in RCDATA but 1075 * not in RAWTEXT. 1076 * 1077 * These elements are described here as "self-contained" or special atomic 1078 * elements whose end tag is consumed with the opening tag, and they will 1079 * contain modifiable text inside of them. 1080 * 1081 * Preserve the opening tag pointers, as these will be overwritten 1082 * when finding the closing tag. They will be reset after finding 1083 * the closing tag to point to the opening of the special atomic 1084 * tag sequence. 1085 */ 1086 $tag_name_starts_at = $this->tag_name_starts_at; 1087 $tag_name_length = $this->tag_name_length; 1088 $tag_ends_at = $this->token_starts_at + $this->token_length; 1089 $has_self_closing_flag = $this->has_self_closing_flag; 1090 $attributes = $this->attributes; 1091 $duplicate_attributes = $this->duplicate_attributes; 1092 1093 // Find the closing tag if necessary. 1094 switch ( $tag_name ) { 1095 case 'SCRIPT': 1096 $found_closer = $this->skip_script_data(); 1097 break; 1098 1099 case 'TEXTAREA': 1100 case 'TITLE': 1101 $found_closer = $this->skip_rcdata( $tag_name ); 1102 break; 1103 1104 /* 1105 * In the browser this list would include the NOSCRIPT element, 1106 * but the Tag Processor is an environment with the scripting 1107 * flag disabled, meaning that it needs to descend into the 1108 * NOSCRIPT element to be able to properly process what will be 1109 * sent to a browser. 1110 * 1111 * Note that this rule makes HTML5 syntax incompatible with XML, 1112 * because the parsing of this token depends on client application. 1113 * The NOSCRIPT element cannot be represented in the XHTML syntax. 1114 */ 1115 case 'IFRAME': 1116 case 'NOEMBED': 1117 case 'NOFRAMES': 1118 case 'STYLE': 1119 case 'XMP': 1120 $found_closer = $this->skip_rawtext( $tag_name ); 1121 break; 1122 1123 // No other tags should be treated in their entirety here. 1124 default: 1125 return true; 1126 } 1127 1128 if ( ! $found_closer ) { 1129 $this->parser_state = self::STATE_INCOMPLETE_INPUT; 1130 $this->bytes_already_parsed = $was_at; 1131 return false; 1132 } 1133 1134 /* 1135 * The values here look like they reference the opening tag but they reference 1136 * the closing tag instead. This is why the opening tag values were stored 1137 * above in a variable. It reads confusingly here, but that's because the 1138 * functions that skip the contents have moved all the internal cursors past 1139 * the inner content of the tag. 1140 */ 1141 $this->token_starts_at = $was_at; 1142 $this->token_length = $this->bytes_already_parsed - $this->token_starts_at; 1143 $this->text_starts_at = $tag_ends_at; 1144 $this->text_length = $this->tag_name_starts_at - $this->text_starts_at; 1145 $this->tag_name_starts_at = $tag_name_starts_at; 1146 $this->tag_name_length = $tag_name_length; 1147 $this->has_self_closing_flag = $has_self_closing_flag; 1148 $this->attributes = $attributes; 1149 $this->duplicate_attributes = $duplicate_attributes; 1150 1151 return true; 1152 } 1153 1154 /** 1155 * Whether the processor paused because the input HTML document ended 1156 * in the middle of a syntax element, such as in the middle of a tag. 1157 * 1158 * Example: 1159 * 1160 * $processor = new WP_HTML_Tag_Processor( '<input type="text" value="Th' ); 1161 * false === $processor->next_tag(); 1162 * true === $processor->paused_at_incomplete_token(); 1163 * 1164 * @since 6.5.0 1165 * 1166 * @return bool Whether the parse paused at the start of an incomplete token. 1167 */ 1168 public function paused_at_incomplete_token(): bool { 1169 return self::STATE_INCOMPLETE_INPUT === $this->parser_state; 1170 } 1171 1172 /** 1173 * Generator for a foreach loop to step through each class name for the matched tag. 1174 * 1175 * This generator function is designed to be used inside a "foreach" loop. 1176 * 1177 * Example: 1178 * 1179 * $p = new WP_HTML_Tag_Processor( "<div class='free <egg<\tlang-en'>" ); 1180 * $p->next_tag(); 1181 * foreach ( $p->class_list() as $class_name ) { 1182 * echo "{$class_name} "; 1183 * } 1184 * // Outputs: "free <egg> lang-en " 1185 * 1186 * @since 6.4.0 1187 * 1188 * @return Generator<int, non-empty-string> 1189 */ 1190 public function class_list() { 1191 if ( self::STATE_MATCHED_TAG !== $this->parser_state ) { 1192 return; 1193 } 1194 1195 /** @var string $class contains the string value of the class attribute, with character references decoded. */ 1196 $class = $this->get_attribute( 'class' ); 1197 1198 if ( ! is_string( $class ) ) { 1199 return; 1200 } 1201 1202 $seen = array(); 1203 1204 $is_quirks = self::QUIRKS_MODE === $this->compat_mode; 1205 1206 $at = 0; 1207 while ( $at < strlen( $class ) ) { 1208 // Skip past any initial boundary characters. 1209 $at += strspn( $class, " \t\f\r\n", $at ); 1210 if ( $at >= strlen( $class ) ) { 1211 return; 1212 } 1213 1214 // Find the byte length until the next boundary. 1215 $length = strcspn( $class, " \t\f\r\n", $at ); 1216 if ( 0 === $length ) { 1217 return; 1218 } 1219 1220 $name = str_replace( "\x00", "\u{FFFD}", substr( $class, $at, $length ) ); 1221 if ( $is_quirks ) { 1222 $name = strtolower( $name ); 1223 } 1224 $at += $length; 1225 1226 /* 1227 * It's expected that the number of class names for a given tag is relatively small. 1228 * Given this, it is probably faster overall to scan an array for a value rather 1229 * than to use the class name as a key and check if it's a key of $seen. 1230 */ 1231 if ( in_array( $name, $seen, true ) ) { 1232 continue; 1233 } 1234 1235 $seen[] = $name; 1236 yield $name; 1237 } 1238 } 1239 1240 1241 /** 1242 * Returns if a matched tag contains the given ASCII case-insensitive class name. 1243 * 1244 * @since 6.4.0 1245 * 1246 * @param string $wanted_class Look for this CSS class name, ASCII case-insensitive. 1247 * @return bool|null Whether the matched tag contains the given class name, or null if not matched. 1248 */ 1249 public function has_class( $wanted_class ): ?bool { 1250 if ( self::STATE_MATCHED_TAG !== $this->parser_state ) { 1251 return null; 1252 } 1253 1254 $case_insensitive = self::QUIRKS_MODE === $this->compat_mode; 1255 1256 $wanted_length = strlen( $wanted_class ); 1257 foreach ( $this->class_list() as $class_name ) { 1258 if ( 1259 strlen( $class_name ) === $wanted_length && 1260 0 === substr_compare( $class_name, $wanted_class, 0, strlen( $wanted_class ), $case_insensitive ) 1261 ) { 1262 return true; 1263 } 1264 } 1265 1266 return false; 1267 } 1268 1269 1270 /** 1271 * Sets a bookmark in the HTML document. 1272 * 1273 * Bookmarks represent specific places or tokens in the HTML 1274 * document, such as a tag opener or closer. When applying 1275 * edits to a document, such as setting an attribute, the 1276 * text offsets of that token may shift; the bookmark is 1277 * kept updated with those shifts and remains stable unless 1278 * the entire span of text in which the token sits is removed. 1279 * 1280 * Release bookmarks when they are no longer needed. 1281 * 1282 * Example: 1283 * 1284 * <main><h2>Surprising fact you may not know!</h2></main> 1285 * ^ ^ 1286 * \-|-- this `H2` opener bookmark tracks the token 1287 * 1288 * <main class="clickbait"><h2>Surprising fact you may no… 1289 * ^ ^ 1290 * \-|-- it shifts with edits 1291 * 1292 * Bookmarks provide the ability to seek to a previously-scanned 1293 * place in the HTML document. This avoids the need to re-scan 1294 * the entire document. 1295 * 1296 * Example: 1297 * 1298 * <ul><li>One</li><li>Two</li><li>Three</li></ul> 1299 * ^^^^ 1300 * want to note this last item 1301 * 1302 * $p = new WP_HTML_Tag_Processor( $html ); 1303 * $in_list = false; 1304 * while ( $p->next_tag( array( 'tag_closers' => $in_list ? 'visit' : 'skip' ) ) ) { 1305 * if ( 'UL' === $p->get_tag() ) { 1306 * if ( $p->is_tag_closer() ) { 1307 * $in_list = false; 1308 * $p->set_bookmark( 'resume' ); 1309 * if ( $p->seek( 'last-li' ) ) { 1310 * $p->add_class( 'last-li' ); 1311 * } 1312 * $p->seek( 'resume' ); 1313 * $p->release_bookmark( 'last-li' ); 1314 * $p->release_bookmark( 'resume' ); 1315 * } else { 1316 * $in_list = true; 1317 * } 1318 * } 1319 * 1320 * if ( 'LI' === $p->get_tag() ) { 1321 * $p->set_bookmark( 'last-li' ); 1322 * } 1323 * } 1324 * 1325 * Bookmarks intentionally hide the internal string offsets 1326 * to which they refer. They are maintained internally as 1327 * updates are applied to the HTML document and therefore 1328 * retain their "position" - the location to which they 1329 * originally pointed. The inability to use bookmarks with 1330 * functions like `substr` is therefore intentional to guard 1331 * against accidentally breaking the HTML. 1332 * 1333 * Because bookmarks allocate memory and require processing 1334 * for every applied update, they are limited and require 1335 * a name. They should not be created with programmatically-made 1336 * names, such as "li_{$index}" with some loop. As a general 1337 * rule they should only be created with string-literal names 1338 * like "start-of-section" or "last-paragraph". 1339 * 1340 * Bookmarks are a powerful tool to enable complicated behavior. 1341 * Consider double-checking that you need this tool if you are 1342 * reaching for it, as inappropriate use could lead to broken 1343 * HTML structure or unwanted processing overhead. 1344 * 1345 * @since 6.2.0 1346 * 1347 * @param string $name Identifies this particular bookmark. 1348 * @return bool Whether the bookmark was successfully created. 1349 */ 1350 public function set_bookmark( $name ): bool { 1351 // It only makes sense to set a bookmark if the parser has paused on a concrete token. 1352 if ( 1353 self::STATE_COMPLETE === $this->parser_state || 1354 self::STATE_INCOMPLETE_INPUT === $this->parser_state 1355 ) { 1356 return false; 1357 } 1358 1359 if ( ! array_key_exists( $name, $this->bookmarks ) && count( $this->bookmarks ) >= static::MAX_BOOKMARKS ) { 1360 _doing_it_wrong( 1361 __METHOD__, 1362 __( 'Too many bookmarks: cannot create any more.' ), 1363 '6.2.0' 1364 ); 1365 return false; 1366 } 1367 1368 $this->bookmarks[ $name ] = new WP_HTML_Span( $this->token_starts_at, $this->token_length ); 1369 1370 return true; 1371 } 1372 1373 1374 /** 1375 * Removes a bookmark that is no longer needed. 1376 * 1377 * Releasing a bookmark frees up the small 1378 * performance overhead it requires. 1379 * 1380 * @param string $name Name of the bookmark to remove. 1381 * @return bool Whether the bookmark already existed before removal. 1382 */ 1383 public function release_bookmark( $name ): bool { 1384 if ( ! array_key_exists( $name, $this->bookmarks ) ) { 1385 return false; 1386 } 1387 1388 unset( $this->bookmarks[ $name ] ); 1389 1390 return true; 1391 } 1392 1393 /** 1394 * Skips contents of generic rawtext elements. 1395 * 1396 * @since 6.3.2 1397 * @ignore 1398 * 1399 * @see https://html.spec.whatwg.org/#generic-raw-text-element-parsing-algorithm 1400 * 1401 * @param string $tag_name The uppercase tag name which will close the RAWTEXT region. 1402 * @return bool Whether an end to the RAWTEXT region was found before the end of the document. 1403 */ 1404 private function skip_rawtext( string $tag_name ): bool { 1405 /* 1406 * These two functions distinguish themselves on whether character references are 1407 * decoded, and since functionality to read the inner markup isn't supported, it's 1408 * not necessary to implement these two functions separately. 1409 */ 1410 return $this->skip_rcdata( $tag_name ); 1411 } 1412 1413 /** 1414 * Skips contents of RCDATA elements, namely title and textarea tags. 1415 * 1416 * @since 6.2.0 1417 * @ignore 1418 * 1419 * @see https://html.spec.whatwg.org/multipage/parsing.html#rcdata-state 1420 * 1421 * @param string $tag_name The uppercase tag name which will close the RCDATA region. 1422 * @return bool Whether an end to the RCDATA region was found before the end of the document. 1423 */ 1424 private function skip_rcdata( string $tag_name ): bool { 1425 $html = $this->html; 1426 $doc_length = strlen( $html ); 1427 $tag_length = strlen( $tag_name ); 1428 1429 $at = $this->bytes_already_parsed; 1430 1431 while ( false !== $at && $at < $doc_length ) { 1432 $at = strpos( $this->html, '</', $at ); 1433 $this->tag_name_starts_at = $at; 1434 1435 // Fail if there is no possible tag closer. 1436 if ( false === $at || ( $at + 2 + $tag_length ) >= $doc_length ) { 1437 return false; 1438 } 1439 1440 $at += 2; 1441 1442 /* 1443 * Find a case-insensitive match to the tag name. 1444 * 1445 * Because tag names are limited to US-ASCII there is no 1446 * need to perform any kind of Unicode normalization when 1447 * comparing; any character which could be impacted by such 1448 * normalization could not be part of a tag name. 1449 */ 1450 for ( $i = 0; $i < $tag_length; $i++ ) { 1451 $tag_char = $tag_name[ $i ]; 1452 $html_char = $html[ $at + $i ]; 1453 1454 if ( $html_char !== $tag_char && strtoupper( $html_char ) !== $tag_char ) { 1455 $at += $i; 1456 continue 2; 1457 } 1458 } 1459 1460 $at += $tag_length; 1461 $this->bytes_already_parsed = $at; 1462 1463 if ( $at >= strlen( $html ) ) { 1464 return false; 1465 } 1466 1467 /* 1468 * Ensure that the tag name terminates to avoid matching on 1469 * substrings of a longer tag name. For example, the sequence 1470 * "</textarearug" should not match for "</textarea" even 1471 * though "textarea" is found within the text. 1472 */ 1473 $c = $html[ $at ]; 1474 if ( ' ' !== $c && "\t" !== $c && "\r" !== $c && "\n" !== $c && '/' !== $c && '>' !== $c ) { 1475 continue; 1476 } 1477 1478 while ( $this->parse_next_attribute() ) { 1479 continue; 1480 } 1481 1482 $at = $this->bytes_already_parsed; 1483 if ( $at >= strlen( $this->html ) ) { 1484 return false; 1485 } 1486 1487 if ( '>' === $html[ $at ] ) { 1488 $this->bytes_already_parsed = $at + 1; 1489 return true; 1490 } 1491 1492 if ( $at + 1 >= strlen( $this->html ) ) { 1493 return false; 1494 } 1495 1496 if ( '/' === $html[ $at ] && '>' === $html[ $at + 1 ] ) { 1497 $this->bytes_already_parsed = $at + 2; 1498 return true; 1499 } 1500 } 1501 1502 return false; 1503 } 1504 1505 /** 1506 * Skips contents of script tags. 1507 * 1508 * @since 6.2.0 1509 * @ignore 1510 * 1511 * @return bool Whether the script tag was closed before the end of the document. 1512 */ 1513 private function skip_script_data(): bool { 1514 $state = 'unescaped'; 1515 $html = $this->html; 1516 $doc_length = strlen( $html ); 1517 $at = $this->bytes_already_parsed; 1518 1519 while ( false !== $at && $at < $doc_length ) { 1520 $at += strcspn( $html, '-<', $at ); 1521 1522 /* 1523 * Optimization: Terminating a complete script element requires at least eight 1524 * additional bytes in the document. Some checks below may cause local escaped 1525 * state transitions when processing shorter strings, but those transitions are 1526 * irrelevant if the script tag is incomplete and the function must return false. 1527 * 1528 * This may need updating if those transitions become significant or exported from 1529 * this function in some way, such as when building safe methods to embed JavaScript 1530 * or data inside a SCRIPT element. 1531 * 1532 * $at may be here. 1533 * ↓ 1534 * ...</script> 1535 * ╰──┬───╯ 1536 * $at + 8 additional bytes are required for a non-false return value. 1537 * 1538 * This single check eliminates the need to check lengths for the shorter spans: 1539 * 1540 * $at may be here. 1541 * ↓ 1542 * <script><!-- --></script> 1543 * ├╯ 1544 * $at + 2 additional characters does not require a length check. 1545 * 1546 * The transition from "escaped" to "unescaped" is not relevant if the document ends: 1547 * 1548 * $at may be here. 1549 * ↓ 1550 * <script><!-- -->[[END-OF-DOCUMENT]] 1551 * ╰──┬───╯ 1552 * $at + 8 additional bytes is not satisfied, return false. 1553 */ 1554 if ( $at + 8 >= $doc_length ) { 1555 return false; 1556 } 1557 1558 /* 1559 * For all script states a "-->" transitions 1560 * back into the normal unescaped script mode, 1561 * even if that's the current state. 1562 */ 1563 if ( 1564 '-' === $html[ $at ] && 1565 '-' === $html[ $at + 1 ] && 1566 '>' === $html[ $at + 2 ] 1567 ) { 1568 $at += 3; 1569 $state = 'unescaped'; 1570 continue; 1571 } 1572 1573 /* 1574 * Everything of interest past here starts with "<". 1575 * Check this character and advance position regardless. 1576 */ 1577 if ( '<' !== $html[ $at++ ] ) { 1578 continue; 1579 } 1580 1581 /* 1582 * "<!--" only transitions from _unescaped_ to _escaped_. This byte sequence is only 1583 * significant in the _unescaped_ state and is ignored in any other state. 1584 */ 1585 if ( 1586 'unescaped' === $state && 1587 '!' === $html[ $at ] && 1588 '-' === $html[ $at + 1 ] && 1589 '-' === $html[ $at + 2 ] 1590 ) { 1591 $at += 3; 1592 1593 /* 1594 * The parser is ready to enter the _escaped_ state, but may remain in the 1595 * _unescaped_ state. This occurs when "<!--" is immediately followed by a 1596 * sequence of 0 or more "-" followed by ">". This is similar to abruptly closed 1597 * HTML comments like "<!-->" or "<!--->". 1598 * 1599 * Note that this check may advance the position significantly and requires a 1600 * length check to prevent bad offsets on inputs like `<script><!---------`. 1601 */ 1602 $at += strspn( $html, '-', $at ); 1603 if ( $at < $doc_length && '>' === $html[ $at ] ) { 1604 ++$at; 1605 continue; 1606 } 1607 1608 $state = 'escaped'; 1609 continue; 1610 } 1611 1612 if ( '/' === $html[ $at ] ) { 1613 $closer_potentially_starts_at = $at - 1; 1614 $is_closing = true; 1615 ++$at; 1616 } else { 1617 $is_closing = false; 1618 } 1619 1620 /* 1621 * At this point the only remaining state-changes occur with the 1622 * <script> and </script> tags; unless one of these appears next, 1623 * proceed scanning to the next potential token in the text. 1624 */ 1625 if ( ! ( 1626 ( 's' === $html[ $at ] || 'S' === $html[ $at ] ) && 1627 ( 'c' === $html[ $at + 1 ] || 'C' === $html[ $at + 1 ] ) && 1628 ( 'r' === $html[ $at + 2 ] || 'R' === $html[ $at + 2 ] ) && 1629 ( 'i' === $html[ $at + 3 ] || 'I' === $html[ $at + 3 ] ) && 1630 ( 'p' === $html[ $at + 4 ] || 'P' === $html[ $at + 4 ] ) && 1631 ( 't' === $html[ $at + 5 ] || 'T' === $html[ $at + 5 ] ) 1632 ) ) { 1633 continue; 1634 } 1635 1636 /* 1637 * Ensure that the script tag terminates to avoid matching on 1638 * substrings of a non-match. For example, the sequence 1639 * "<script123" should not end a script region even though 1640 * "<script" is found within the text. 1641 */ 1642 $at += 6; 1643 $c = $html[ $at ]; 1644 if ( 1645 /** 1646 * These characters trigger state transitions of interest: 1647 * 1648 * - @see {https://html.spec.whatwg.org/multipage/parsing.html#script-data-end-tag-name-state} 1649 * - @see {https://html.spec.whatwg.org/multipage/parsing.html#script-data-escaped-end-tag-name-state} 1650 * - @see {https://html.spec.whatwg.org/multipage/parsing.html#script-data-double-escape-start-state} 1651 * - @see {https://html.spec.whatwg.org/multipage/parsing.html#script-data-double-escape-end-state} 1652 * 1653 * The "\r" character is not present in the above references. However, "\r" must be 1654 * treated the same as "\n". This is because the HTML Standard requires newline 1655 * normalization during preprocessing which applies this replacement. 1656 * 1657 * - @see https://html.spec.whatwg.org/multipage/parsing.html#preprocessing-the-input-stream 1658 * - @see https://infra.spec.whatwg.org/#normalize-newlines 1659 */ 1660 '>' !== $c && 1661 ' ' !== $c && 1662 "\n" !== $c && 1663 '/' !== $c && 1664 "\t" !== $c && 1665 "\f" !== $c && 1666 "\r" !== $c 1667 ) { 1668 continue; 1669 } 1670 1671 if ( 'escaped' === $state && ! $is_closing ) { 1672 $state = 'double-escaped'; 1673 continue; 1674 } 1675 1676 if ( 'double-escaped' === $state && $is_closing ) { 1677 $state = 'escaped'; 1678 continue; 1679 } 1680 1681 if ( $is_closing ) { 1682 $this->bytes_already_parsed = $closer_potentially_starts_at; 1683 $this->tag_name_starts_at = $closer_potentially_starts_at; 1684 if ( $this->bytes_already_parsed >= $doc_length ) { 1685 return false; 1686 } 1687 1688 while ( $this->parse_next_attribute() ) { 1689 continue; 1690 } 1691 1692 if ( $this->bytes_already_parsed >= $doc_length ) { 1693 return false; 1694 } 1695 1696 if ( '>' === $html[ $this->bytes_already_parsed ] ) { 1697 ++$this->bytes_already_parsed; 1698 return true; 1699 } 1700 } 1701 1702 ++$at; 1703 } 1704 1705 return false; 1706 } 1707 1708 /** 1709 * Parses the next tag. 1710 * 1711 * This will find and start parsing the next tag, including 1712 * the opening `<`, the potential closer `/`, and the tag 1713 * name. It does not parse the attributes or scan to the 1714 * closing `>`; these are left for other methods. 1715 * 1716 * @since 6.2.0 1717 * @since 6.2.1 Support abruptly-closed comments, invalid-tag-closer-comments, and empty elements. 1718 * @ignore 1719 * 1720 * @return bool Whether a tag was found before the end of the document. 1721 */ 1722 private function parse_next_tag(): bool { 1723 $this->after_tag(); 1724 1725 $html = $this->html; 1726 $doc_length = strlen( $html ); 1727 $was_at = $this->bytes_already_parsed; 1728 $at = $was_at; 1729 1730 while ( $at < $doc_length ) { 1731 $at = strpos( $html, '<', $at ); 1732 if ( false === $at ) { 1733 break; 1734 } 1735 1736 if ( $at > $was_at ) { 1737 /* 1738 * A "<" normally starts a new HTML tag or syntax token, but in cases where the 1739 * following character can't produce a valid token, the "<" is instead treated 1740 * as plaintext and the parser should skip over it. This avoids a problem when 1741 * following earlier practices of typing emoji with text, e.g. "<3". This 1742 * should be a heart, not a tag. It's supposed to be rendered, not hidden. 1743 * 1744 * At this point the parser checks if this is one of those cases and if it is 1745 * will continue searching for the next "<" in search of a token boundary. 1746 * 1747 * @see https://html.spec.whatwg.org/#tag-open-state 1748 */ 1749 if ( 1 !== strspn( $html, '!/?abcdefghijklmnopqrstuvwxyzABCDEFGHIJKLMNOPQRSTUVWXYZ', $at + 1, 1 ) ) { 1750 ++$at; 1751 continue; 1752 } 1753 1754 $this->parser_state = self::STATE_TEXT_NODE; 1755 $this->token_starts_at = $was_at; 1756 $this->token_length = $at - $was_at; 1757 $this->text_starts_at = $was_at; 1758 $this->text_length = $this->token_length; 1759 $this->bytes_already_parsed = $at; 1760 return true; 1761 } 1762 1763 $this->token_starts_at = $at; 1764 1765 if ( $at + 1 < $doc_length && '/' === $this->html[ $at + 1 ] ) { 1766 $this->is_closing_tag = true; 1767 ++$at; 1768 } else { 1769 $this->is_closing_tag = false; 1770 } 1771 1772 /* 1773 * HTML tag names must start with [a-zA-Z] otherwise they are not tags. 1774 * For example, "<3" is rendered as text, not a tag opener. If at least 1775 * one letter follows the "<" then _it is_ a tag, but if the following 1776 * character is anything else it _is not a tag_. 1777 * 1778 * It's not uncommon to find non-tags starting with `<` in an HTML 1779 * document, so it's good for performance to make this pre-check before 1780 * continuing to attempt to parse a tag name. 1781 * 1782 * Reference: 1783 * * https://html.spec.whatwg.org/multipage/parsing.html#data-state 1784 * * https://html.spec.whatwg.org/multipage/parsing.html#tag-open-state 1785 */ 1786 $tag_name_prefix_length = strspn( $html, 'abcdefghijklmnopqrstuvwxyzABCDEFGHIJKLMNOPQRSTUVWXYZ', $at + 1 ); 1787 if ( $tag_name_prefix_length > 0 ) { 1788 ++$at; 1789 $this->parser_state = self::STATE_MATCHED_TAG; 1790 $this->tag_name_starts_at = $at; 1791 $this->tag_name_length = $tag_name_prefix_length + strcspn( $html, " \t\f\r\n/>", $at + $tag_name_prefix_length ); 1792 $this->bytes_already_parsed = $at + $this->tag_name_length; 1793 return true; 1794 } 1795 1796 /* 1797 * Abort if no tag is found before the end of 1798 * the document. There is nothing left to parse. 1799 */ 1800 if ( $at + 1 >= $doc_length ) { 1801 $this->parser_state = self::STATE_INCOMPLETE_INPUT; 1802 1803 return false; 1804 } 1805 1806 /* 1807 * `<!` transitions to markup declaration open state 1808 * https://html.spec.whatwg.org/multipage/parsing.html#markup-declaration-open-state 1809 */ 1810 if ( ! $this->is_closing_tag && '!' === $html[ $at + 1 ] ) { 1811 /* 1812 * `<!--` transitions to a comment state – apply further comment rules. 1813 * https://html.spec.whatwg.org/multipage/parsing.html#tag-open-state 1814 */ 1815 if ( 0 === substr_compare( $html, '--', $at + 2, 2 ) ) { 1816 $closer_at = $at + 4; 1817 // If it's not possible to close the comment then there is nothing more to scan. 1818 if ( $doc_length <= $closer_at ) { 1819 $this->parser_state = self::STATE_INCOMPLETE_INPUT; 1820 1821 return false; 1822 } 1823 1824 // Abruptly-closed empty comments are a sequence of dashes followed by `>`. 1825 $span_of_dashes = strspn( $html, '-', $closer_at ); 1826 if ( $doc_length <= $span_of_dashes + $closer_at ) { 1827 $this->parser_state = self::STATE_INCOMPLETE_INPUT; 1828 1829 return false; 1830 } 1831 1832 if ( '>' === $html[ $closer_at + $span_of_dashes ] ) { 1833 /* 1834 * @todo When implementing `set_modifiable_text()` ensure that updates to this token 1835 * don't break the syntax for short comments, e.g. `<!--->`. Unlike other comment 1836 * and bogus comment syntax, these leave no clear insertion point for text and 1837 * they need to be modified specially in order to contain text. E.g. to store 1838 * `?` as the modifiable text, the `<!--->` needs to become `<!--?-->`, which 1839 * involves inserting an additional `-` into the token after the modifiable text. 1840 */ 1841 $this->parser_state = self::STATE_COMMENT; 1842 $this->comment_type = self::COMMENT_AS_ABRUPTLY_CLOSED_COMMENT; 1843 $this->token_length = $closer_at + $span_of_dashes + 1 - $this->token_starts_at; 1844 1845 // Only provide modifiable text if the token is long enough to contain it. 1846 if ( $span_of_dashes >= 2 ) { 1847 $this->comment_type = self::COMMENT_AS_HTML_COMMENT; 1848 $this->text_starts_at = $this->token_starts_at + 4; 1849 $this->text_length = $span_of_dashes - 2; 1850 } 1851 1852 $this->bytes_already_parsed = $closer_at + $span_of_dashes + 1; 1853 return true; 1854 } 1855 1856 /* 1857 * Comments may be closed by either a --> or an invalid --!>. 1858 * The first occurrence closes the comment. 1859 * 1860 * See https://html.spec.whatwg.org/#parse-error-incorrectly-closed-comment 1861 */ 1862 --$closer_at; // Pre-increment inside condition below reduces risk of accidental infinite looping. 1863 while ( ++$closer_at < $doc_length ) { 1864 $closer_at = strpos( $html, '--', $closer_at ); 1865 if ( false === $closer_at ) { 1866 $this->parser_state = self::STATE_INCOMPLETE_INPUT; 1867 1868 return false; 1869 } 1870 1871 if ( $closer_at + 2 < $doc_length && '>' === $html[ $closer_at + 2 ] ) { 1872 $this->parser_state = self::STATE_COMMENT; 1873 $this->comment_type = self::COMMENT_AS_HTML_COMMENT; 1874 $this->token_length = $closer_at + 3 - $this->token_starts_at; 1875 $this->text_starts_at = $this->token_starts_at + 4; 1876 $this->text_length = $closer_at - $this->text_starts_at; 1877 $this->bytes_already_parsed = $closer_at + 3; 1878 return true; 1879 } 1880 1881 if ( 1882 $closer_at + 3 < $doc_length && 1883 '!' === $html[ $closer_at + 2 ] && 1884 '>' === $html[ $closer_at + 3 ] 1885 ) { 1886 $this->parser_state = self::STATE_COMMENT; 1887 $this->comment_type = self::COMMENT_AS_HTML_COMMENT; 1888 $this->token_length = $closer_at + 4 - $this->token_starts_at; 1889 $this->text_starts_at = $this->token_starts_at + 4; 1890 $this->text_length = $closer_at - $this->text_starts_at; 1891 $this->bytes_already_parsed = $closer_at + 4; 1892 return true; 1893 } 1894 } 1895 } 1896 1897 /* 1898 * `<!DOCTYPE` transitions to DOCTYPE state – skip to the nearest > 1899 * These are ASCII-case-insensitive. 1900 * https://html.spec.whatwg.org/multipage/parsing.html#tag-open-state 1901 */ 1902 if ( 1903 $doc_length > $at + 8 && 1904 ( 'D' === $html[ $at + 2 ] || 'd' === $html[ $at + 2 ] ) && 1905 ( 'O' === $html[ $at + 3 ] || 'o' === $html[ $at + 3 ] ) && 1906 ( 'C' === $html[ $at + 4 ] || 'c' === $html[ $at + 4 ] ) && 1907 ( 'T' === $html[ $at + 5 ] || 't' === $html[ $at + 5 ] ) && 1908 ( 'Y' === $html[ $at + 6 ] || 'y' === $html[ $at + 6 ] ) && 1909 ( 'P' === $html[ $at + 7 ] || 'p' === $html[ $at + 7 ] ) && 1910 ( 'E' === $html[ $at + 8 ] || 'e' === $html[ $at + 8 ] ) 1911 ) { 1912 $closer_at = strpos( $html, '>', $at + 9 ); 1913 if ( false === $closer_at ) { 1914 $this->parser_state = self::STATE_INCOMPLETE_INPUT; 1915 1916 return false; 1917 } 1918 1919 $this->parser_state = self::STATE_DOCTYPE; 1920 $this->token_length = $closer_at + 1 - $this->token_starts_at; 1921 $this->text_starts_at = $this->token_starts_at + 9; 1922 $this->text_length = $closer_at - $this->text_starts_at; 1923 $this->bytes_already_parsed = $closer_at + 1; 1924 return true; 1925 } 1926 1927 if ( 1928 'html' !== $this->parsing_namespace && 1929 strlen( $html ) > $at + 8 && 1930 '[' === $html[ $at + 2 ] && 1931 'C' === $html[ $at + 3 ] && 1932 'D' === $html[ $at + 4 ] && 1933 'A' === $html[ $at + 5 ] && 1934 'T' === $html[ $at + 6 ] && 1935 'A' === $html[ $at + 7 ] && 1936 '[' === $html[ $at + 8 ] 1937 ) { 1938 $closer_at = strpos( $html, ']]>', $at + 9 ); 1939 if ( false === $closer_at ) { 1940 $this->parser_state = self::STATE_INCOMPLETE_INPUT; 1941 1942 return false; 1943 } 1944 1945 $this->parser_state = self::STATE_CDATA_NODE; 1946 $this->text_starts_at = $at + 9; 1947 $this->text_length = $closer_at - $this->text_starts_at; 1948 $this->token_length = $closer_at + 3 - $this->token_starts_at; 1949 $this->bytes_already_parsed = $closer_at + 3; 1950 return true; 1951 } 1952 1953 /* 1954 * Anything else here is an incorrectly-opened comment and transitions 1955 * to the bogus comment state - skip to the nearest >. If no closer is 1956 * found then the HTML was truncated inside the markup declaration. 1957 */ 1958 $closer_at = strpos( $html, '>', $at + 1 ); 1959 if ( false === $closer_at ) { 1960 $this->parser_state = self::STATE_INCOMPLETE_INPUT; 1961 1962 return false; 1963 } 1964 1965 $this->parser_state = self::STATE_COMMENT; 1966 $this->comment_type = self::COMMENT_AS_INVALID_HTML; 1967 $this->token_length = $closer_at + 1 - $this->token_starts_at; 1968 $this->text_starts_at = $this->token_starts_at + 2; 1969 $this->text_length = $closer_at - $this->text_starts_at; 1970 $this->bytes_already_parsed = $closer_at + 1; 1971 1972 /* 1973 * Identify nodes that would be CDATA if HTML had CDATA sections. 1974 * 1975 * This section must occur after identifying the bogus comment end 1976 * because in an HTML parser it will span to the nearest `>`, even 1977 * if there's no `]]>` as would be required in an XML document. It 1978 * is therefore not possible to parse a CDATA section containing 1979 * a `>` in the HTML syntax. 1980 * 1981 * Inside foreign elements there is a discrepancy between browsers 1982 * and the specification on this. 1983 * 1984 * @todo Track whether the Tag Processor is inside a foreign element 1985 * and require the proper closing `]]>` in those cases. 1986 */ 1987 if ( 1988 $this->token_length >= 10 && 1989 '[' === $html[ $this->token_starts_at + 2 ] && 1990 'C' === $html[ $this->token_starts_at + 3 ] && 1991 'D' === $html[ $this->token_starts_at + 4 ] && 1992 'A' === $html[ $this->token_starts_at + 5 ] && 1993 'T' === $html[ $this->token_starts_at + 6 ] && 1994 'A' === $html[ $this->token_starts_at + 7 ] && 1995 '[' === $html[ $this->token_starts_at + 8 ] && 1996 ']' === $html[ $closer_at - 1 ] && 1997 ']' === $html[ $closer_at - 2 ] 1998 ) { 1999 $this->parser_state = self::STATE_COMMENT; 2000 $this->comment_type = self::COMMENT_AS_CDATA_LOOKALIKE; 2001 $this->text_starts_at += 7; 2002 $this->text_length -= 9; 2003 } 2004 2005 return true; 2006 } 2007 2008 /* 2009 * </> is a missing end tag name, which is ignored. 2010 * 2011 * This was also known as the "presumptuous empty tag" 2012 * in early discussions as it was proposed to close 2013 * the nearest previous opening tag. 2014 * 2015 * See https://html.spec.whatwg.org/#parse-error-missing-end-tag-name 2016 */ 2017 if ( '>' === $html[ $at + 1 ] ) { 2018 // `<>` is interpreted as plaintext. 2019 if ( ! $this->is_closing_tag ) { 2020 ++$at; 2021 continue; 2022 } 2023 2024 $this->parser_state = self::STATE_PRESUMPTUOUS_TAG; 2025 $this->token_length = $at + 2 - $this->token_starts_at; 2026 $this->bytes_already_parsed = $at + 2; 2027 return true; 2028 } 2029 2030 /* 2031 * `<?` transitions to a bogus comment state – skip to the nearest > 2032 * See https://html.spec.whatwg.org/multipage/parsing.html#tag-open-state 2033 */ 2034 if ( ! $this->is_closing_tag && '?' === $html[ $at + 1 ] ) { 2035 $closer_at = strpos( $html, '>', $at + 2 ); 2036 if ( false === $closer_at ) { 2037 $this->parser_state = self::STATE_INCOMPLETE_INPUT; 2038 2039 return false; 2040 } 2041 2042 $this->parser_state = self::STATE_COMMENT; 2043 $this->comment_type = self::COMMENT_AS_INVALID_HTML; 2044 $this->token_length = $closer_at + 1 - $this->token_starts_at; 2045 $this->text_starts_at = $this->token_starts_at + 2; 2046 $this->text_length = $closer_at - $this->text_starts_at; 2047 $this->bytes_already_parsed = $closer_at + 1; 2048 2049 /* 2050 * Identify a Processing Instruction node were HTML to have them. 2051 * 2052 * This section must occur after identifying the bogus comment end 2053 * because in an HTML parser it will span to the nearest `>`, even 2054 * if there's no `?>` as would be required in an XML document. It 2055 * is therefore not possible to parse a Processing Instruction node 2056 * containing a `>` in the HTML syntax. 2057 * 2058 * XML allows for more target names, but this code only identifies 2059 * those with ASCII-representable target names. This means that it 2060 * may identify some Processing Instruction nodes as bogus comments, 2061 * but it will not misinterpret the HTML structure. By limiting the 2062 * identification to these target names the Tag Processor can avoid 2063 * the need to start parsing UTF-8 sequences. 2064 * 2065 * > NameStartChar ::= ":" | [A-Z] | "_" | [a-z] | [#xC0-#xD6] | [#xD8-#xF6] | [#xF8-#x2FF] | 2066 * [#x370-#x37D] | [#x37F-#x1FFF] | [#x200C-#x200D] | [#x2070-#x218F] | 2067 * [#x2C00-#x2FEF] | [#x3001-#xD7FF] | [#xF900-#xFDCF] | [#xFDF0-#xFFFD] | 2068 * [#x10000-#xEFFFF] 2069 * > NameChar ::= NameStartChar | "-" | "." | [0-9] | #xB7 | [#x0300-#x036F] | [#x203F-#x2040] 2070 * 2071 * @todo Processing instruction nodes in SGML may contain any kind of markup. XML defines a 2072 * special case with `<?xml ... ?>` syntax, but the `?` is part of the bogus comment. 2073 * 2074 * @see https://www.w3.org/TR/2006/REC-xml11-20060816/#NT-PITarget 2075 */ 2076 if ( $this->token_length >= 5 && '?' === $html[ $closer_at - 1 ] ) { 2077 $comment_text = substr( $html, $this->token_starts_at + 2, $this->token_length - 4 ); 2078 $pi_target_length = strspn( $comment_text, 'abcdefghijklmnopqrstuvwxyzABCDEFGHIJKLMNOPQRSTUVWXYZ:_' ); 2079 2080 if ( 0 < $pi_target_length ) { 2081 $pi_target_length += strspn( $comment_text, 'abcdefghijklmnopqrstuvwxyzABCDEFGHIJKLMNOPQRSTUVWXYZ0123456789:_-.', $pi_target_length ); 2082 2083 $this->comment_type = self::COMMENT_AS_PI_NODE_LOOKALIKE; 2084 $this->tag_name_starts_at = $this->token_starts_at + 2; 2085 $this->tag_name_length = $pi_target_length; 2086 $this->text_starts_at += $pi_target_length; 2087 $this->text_length -= $pi_target_length + 1; 2088 } 2089 } 2090 2091 return true; 2092 } 2093 2094 /* 2095 * If a non-alpha starts the tag name in a tag closer it's a comment. 2096 * Find the first `>`, which closes the comment. 2097 * 2098 * This parser classifies these particular comments as special "funky comments" 2099 * which are made available for further processing. 2100 * 2101 * See https://html.spec.whatwg.org/#parse-error-invalid-first-character-of-tag-name 2102 */ 2103 if ( $this->is_closing_tag ) { 2104 // No chance of finding a closer. 2105 if ( $at + 3 > $doc_length ) { 2106 $this->parser_state = self::STATE_INCOMPLETE_INPUT; 2107 2108 return false; 2109 } 2110 2111 $closer_at = strpos( $html, '>', $at + 2 ); 2112 if ( false === $closer_at ) { 2113 $this->parser_state = self::STATE_INCOMPLETE_INPUT; 2114 2115 return false; 2116 } 2117 2118 $this->parser_state = self::STATE_FUNKY_COMMENT; 2119 $this->token_length = $closer_at + 1 - $this->token_starts_at; 2120 $this->text_starts_at = $this->token_starts_at + 2; 2121 $this->text_length = $closer_at - $this->text_starts_at; 2122 $this->bytes_already_parsed = $closer_at + 1; 2123 return true; 2124 } 2125 2126 ++$at; 2127 } 2128 2129 /* 2130 * This does not imply an incomplete parse; it indicates that there 2131 * can be nothing left in the document other than a #text node. 2132 */ 2133 $this->parser_state = self::STATE_TEXT_NODE; 2134 $this->token_starts_at = $was_at; 2135 $this->token_length = $doc_length - $was_at; 2136 $this->text_starts_at = $was_at; 2137 $this->text_length = $this->token_length; 2138 $this->bytes_already_parsed = $doc_length; 2139 return true; 2140 } 2141 2142 /** 2143 * Parses the next attribute. 2144 * 2145 * @since 6.2.0 2146 * @ignore 2147 * 2148 * @return bool Whether an attribute was found before the end of the document. 2149 */ 2150 private function parse_next_attribute(): bool { 2151 $doc_length = strlen( $this->html ); 2152 2153 // Skip whitespace and slashes. 2154 $skipped_length = strspn( $this->html, " \t\f\r\n/", $this->bytes_already_parsed ); 2155 $this->bytes_already_parsed += $skipped_length; 2156 if ( $this->bytes_already_parsed >= $doc_length ) { 2157 $this->parser_state = self::STATE_INCOMPLETE_INPUT; 2158 2159 return false; 2160 } 2161 2162 /** 2163 * This block serves two purposes: 2164 * 2165 * - A fast path for common tag-ending `>`. 2166 * - A check for the self-closing flag which must appear as `/>`. 2167 * 2168 * In a tag like `<g attr=/>`, `/` is the attribute value, not a self-closing 2169 * flag. When it appears in this form, the parser has already consumed the 2170 * attribute value, `$skipped_length` is 0, and this checks below correctly 2171 * identify whether there is a self-closing flag. 2172 * 2173 * Note: Both start and end tags may have the self-closing flag. 2174 */ 2175 if ( '>' === $this->html[ $this->bytes_already_parsed ] ) { 2176 if ( $skipped_length > 0 && '/' === $this->html[ $this->bytes_already_parsed - 1 ] ) { 2177 $this->has_self_closing_flag = true; 2178 } 2179 return false; 2180 } 2181 2182 /* 2183 * Treat the equal sign as a part of the attribute 2184 * name if it is the first encountered byte. 2185 * 2186 * @see https://html.spec.whatwg.org/multipage/parsing.html#before-attribute-name-state 2187 */ 2188 $name_length = '=' === $this->html[ $this->bytes_already_parsed ] 2189 ? 1 + strcspn( $this->html, "=/> \t\f\r\n", $this->bytes_already_parsed + 1 ) 2190 : strcspn( $this->html, "=/> \t\f\r\n", $this->bytes_already_parsed ); 2191 2192 // No attribute, just tag closer. 2193 if ( 0 === $name_length || $this->bytes_already_parsed + $name_length >= $doc_length ) { 2194 return false; 2195 } 2196 2197 $attribute_start = $this->bytes_already_parsed; 2198 $attribute_name = substr( $this->html, $attribute_start, $name_length ); 2199 $this->bytes_already_parsed += $name_length; 2200 if ( $this->bytes_already_parsed >= $doc_length ) { 2201 $this->parser_state = self::STATE_INCOMPLETE_INPUT; 2202 2203 return false; 2204 } 2205 2206 $this->skip_whitespace(); 2207 if ( $this->bytes_already_parsed >= $doc_length ) { 2208 $this->parser_state = self::STATE_INCOMPLETE_INPUT; 2209 2210 return false; 2211 } 2212 2213 $has_value = '=' === $this->html[ $this->bytes_already_parsed ]; 2214 if ( $has_value ) { 2215 ++$this->bytes_already_parsed; 2216 $this->skip_whitespace(); 2217 if ( $this->bytes_already_parsed >= $doc_length ) { 2218 $this->parser_state = self::STATE_INCOMPLETE_INPUT; 2219 2220 return false; 2221 } 2222 2223 switch ( $this->html[ $this->bytes_already_parsed ] ) { 2224 case "'": 2225 case '"': 2226 $quote = $this->html[ $this->bytes_already_parsed ]; 2227 $value_start = $this->bytes_already_parsed + 1; 2228 $end_quote_at = strpos( $this->html, $quote, $value_start ); 2229 $end_quote_at = false === $end_quote_at ? $doc_length : $end_quote_at; 2230 $value_length = $end_quote_at - $value_start; 2231 $attribute_end = $end_quote_at + 1; 2232 $this->bytes_already_parsed = $attribute_end; 2233 break; 2234 2235 default: 2236 $value_start = $this->bytes_already_parsed; 2237 $value_length = strcspn( $this->html, "> \t\f\r\n", $value_start ); 2238 $attribute_end = $value_start + $value_length; 2239 $this->bytes_already_parsed = $attribute_end; 2240 } 2241 } else { 2242 $value_start = $this->bytes_already_parsed; 2243 $value_length = 0; 2244 $attribute_end = $attribute_start + $name_length; 2245 } 2246 2247 if ( $attribute_end >= $doc_length ) { 2248 $this->parser_state = self::STATE_INCOMPLETE_INPUT; 2249 2250 return false; 2251 } 2252 2253 if ( $this->is_closing_tag ) { 2254 return true; 2255 } 2256 2257 /* 2258 * > There must never be two or more attributes on 2259 * > the same start tag whose names are an ASCII 2260 * > case-insensitive match for each other. 2261 * - HTML 5 spec 2262 * 2263 * @see https://html.spec.whatwg.org/multipage/syntax.html#attributes-2:ascii-case-insensitive 2264 */ 2265 $comparable_name = strtolower( $attribute_name ); 2266 2267 // If an attribute is listed many times, only use the first declaration and ignore the rest. 2268 if ( ! isset( $this->attributes[ $comparable_name ] ) ) { 2269 $this->attributes[ $comparable_name ] = new WP_HTML_Attribute_Token( 2270 $attribute_name, 2271 $value_start, 2272 $value_length, 2273 $attribute_start, 2274 $attribute_end - $attribute_start, 2275 ! $has_value 2276 ); 2277 2278 return true; 2279 } 2280 2281 /* 2282 * Track the duplicate attributes so if we remove it, all disappear together. 2283 * 2284 * While `$this->duplicated_attributes` could always be stored as an `array()`, 2285 * which would simplify the logic here, storing a `null` and only allocating 2286 * an array when encountering duplicates avoids needless allocations in the 2287 * normative case of parsing tags with no duplicate attributes. 2288 */ 2289 $duplicate_span = new WP_HTML_Span( $attribute_start, $attribute_end - $attribute_start ); 2290 if ( null === $this->duplicate_attributes ) { 2291 $this->duplicate_attributes = array( $comparable_name => array( $duplicate_span ) ); 2292 } elseif ( ! isset( $this->duplicate_attributes[ $comparable_name ] ) ) { 2293 $this->duplicate_attributes[ $comparable_name ] = array( $duplicate_span ); 2294 } else { 2295 $this->duplicate_attributes[ $comparable_name ][] = $duplicate_span; 2296 } 2297 2298 return true; 2299 } 2300 2301 /** 2302 * Move the internal cursor past any immediate successive whitespace. 2303 * 2304 * @since 6.2.0 2305 * @ignore 2306 */ 2307 private function skip_whitespace(): void { 2308 $this->bytes_already_parsed += strspn( $this->html, " \t\f\r\n", $this->bytes_already_parsed ); 2309 } 2310 2311 /** 2312 * Applies attribute updates and cleans up once a tag is fully parsed. 2313 * 2314 * @since 6.2.0 2315 * @ignore 2316 */ 2317 private function after_tag(): void { 2318 /* 2319 * There could be lexical updates enqueued for an attribute that 2320 * also exists on the next tag. In order to avoid conflating the 2321 * attributes across the two tags, lexical updates with names 2322 * need to be flushed to raw lexical updates. 2323 */ 2324 $this->class_name_updates_to_attributes_updates(); 2325 2326 /* 2327 * Purge updates if there are too many. The actual count isn't 2328 * scientific, but a few values from 100 to a few thousand were 2329 * tests to find a practically-useful limit. 2330 * 2331 * If the update queue grows too big, then the Tag Processor 2332 * will spend more time iterating through them and lose the 2333 * efficiency gains of deferring applying them. 2334 */ 2335 if ( 1000 < count( $this->lexical_updates ) ) { 2336 $this->get_updated_html(); 2337 } 2338 2339 foreach ( $this->lexical_updates as $name => $update ) { 2340 /* 2341 * Any updates appearing after the cursor should be applied 2342 * before proceeding, otherwise they may be overlooked. 2343 */ 2344 if ( $update->start >= $this->bytes_already_parsed ) { 2345 $this->get_updated_html(); 2346 break; 2347 } 2348 2349 if ( is_int( $name ) ) { 2350 continue; 2351 } 2352 2353 $this->lexical_updates[] = $update; 2354 unset( $this->lexical_updates[ $name ] ); 2355 } 2356 2357 $this->token_starts_at = null; 2358 $this->token_length = null; 2359 $this->has_self_closing_flag = false; 2360 $this->tag_name_starts_at = null; 2361 $this->tag_name_length = null; 2362 $this->text_starts_at = 0; 2363 $this->text_length = 0; 2364 $this->is_closing_tag = null; 2365 $this->attributes = array(); 2366 $this->comment_type = null; 2367 $this->text_node_classification = self::TEXT_IS_GENERIC; 2368 $this->duplicate_attributes = null; 2369 } 2370 2371 /** 2372 * Converts class name updates into tag attributes updates 2373 * (they are accumulated in different data formats for performance). 2374 * 2375 * @since 6.2.0 2376 * @ignore 2377 * 2378 * @see WP_HTML_Tag_Processor::$lexical_updates 2379 * @see WP_HTML_Tag_Processor::$classname_updates 2380 */ 2381 private function class_name_updates_to_attributes_updates(): void { 2382 if ( count( $this->classname_updates ) === 0 ) { 2383 return; 2384 } 2385 2386 $existing_class = $this->get_enqueued_attribute_value( 'class' ); 2387 if ( null === $existing_class || true === $existing_class ) { 2388 $existing_class = ''; 2389 } 2390 2391 if ( false === $existing_class && isset( $this->attributes['class'] ) ) { 2392 $existing_class = WP_HTML_Decoder::decode_attribute( 2393 substr( 2394 $this->html, 2395 $this->attributes['class']->value_starts_at, 2396 $this->attributes['class']->value_length 2397 ) 2398 ); 2399 } 2400 2401 if ( false === $existing_class ) { 2402 $existing_class = ''; 2403 } 2404 2405 /** 2406 * Updated "class" attribute value. 2407 * 2408 * This is incrementally built while scanning through the existing class 2409 * attribute, skipping removed classes on the way, and then appending 2410 * added classes at the end. Only when finished processing will the 2411 * value contain the final new value. 2412 2413 * @var string $class 2414 */ 2415 $class = ''; 2416 2417 /** 2418 * Tracks the cursor position in the existing 2419 * class attribute value while parsing. 2420 * 2421 * @var int $at 2422 */ 2423 $at = 0; 2424 2425 /** 2426 * Indicates if there's any need to modify the existing class attribute. 2427 * 2428 * If a call to `add_class()` and `remove_class()` wouldn't impact 2429 * the `class` attribute value then there's no need to rebuild it. 2430 * For example, when adding a class that's already present or 2431 * removing one that isn't. 2432 * 2433 * This flag enables a performance optimization when none of the enqueued 2434 * class updates would impact the `class` attribute; namely, that the 2435 * processor can continue without modifying the input document, as if 2436 * none of the `add_class()` or `remove_class()` calls had been made. 2437 * 2438 * This flag is set upon the first change that requires a string update. 2439 * 2440 * @var bool $modified 2441 */ 2442 $modified = false; 2443 2444 $seen = array(); 2445 $to_remove = array(); 2446 $is_quirks = self::QUIRKS_MODE === $this->compat_mode; 2447 if ( $is_quirks ) { 2448 foreach ( $this->classname_updates as $updated_name => $action ) { 2449 if ( self::REMOVE_CLASS === $action ) { 2450 $to_remove[] = strtolower( $updated_name ); 2451 } 2452 } 2453 } else { 2454 foreach ( $this->classname_updates as $updated_name => $action ) { 2455 if ( self::REMOVE_CLASS === $action ) { 2456 $to_remove[] = $updated_name; 2457 } 2458 } 2459 } 2460 2461 // Remove unwanted classes by only copying the new ones. 2462 $existing_class_length = strlen( $existing_class ); 2463 while ( $at < $existing_class_length ) { 2464 // Skip to the first non-whitespace character. 2465 $ws_at = $at; 2466 $ws_length = strspn( $existing_class, " \t\f\r\n", $ws_at ); 2467 $at += $ws_length; 2468 2469 // Capture the class name – it's everything until the next whitespace. 2470 $name_length = strcspn( $existing_class, " \t\f\r\n", $at ); 2471 if ( 0 === $name_length ) { 2472 // If no more class names are found then that's the end. 2473 break; 2474 } 2475 2476 $name = substr( $existing_class, $at, $name_length ); 2477 $comparable_class_name = $is_quirks ? strtolower( $name ) : $name; 2478 $at += $name_length; 2479 2480 // If this class is marked for removal, remove it and move on to the next one. 2481 if ( in_array( $comparable_class_name, $to_remove, true ) ) { 2482 $modified = true; 2483 continue; 2484 } 2485 2486 // If a class has already been seen then skip it; it should not be added twice. 2487 if ( in_array( $comparable_class_name, $seen, true ) ) { 2488 continue; 2489 } 2490 2491 $seen[] = $comparable_class_name; 2492 2493 /* 2494 * Otherwise, append it to the new "class" attribute value. 2495 * 2496 * There are options for handling whitespace between tags. 2497 * Preserving the existing whitespace produces fewer changes 2498 * to the HTML content and should clarify the before/after 2499 * content when debugging the modified output. 2500 * 2501 * This approach contrasts normalizing the inter-class 2502 * whitespace to a single space, which might appear cleaner 2503 * in the output HTML but produce a noisier change. 2504 */ 2505 if ( '' !== $class ) { 2506 $class .= substr( $existing_class, $ws_at, $ws_length ); 2507 } 2508 $class .= $name; 2509 } 2510 2511 // Add new classes by appending those which haven't already been seen. 2512 foreach ( $this->classname_updates as $name => $operation ) { 2513 $comparable_name = $is_quirks ? strtolower( $name ) : $name; 2514 if ( self::ADD_CLASS === $operation && ! in_array( $comparable_name, $seen, true ) ) { 2515 $modified = true; 2516 2517 $class .= strlen( $class ) > 0 ? ' ' : ''; 2518 $class .= $name; 2519 } 2520 } 2521 2522 $this->classname_updates = array(); 2523 if ( ! $modified ) { 2524 return; 2525 } 2526 2527 if ( strlen( $class ) > 0 ) { 2528 $this->set_attribute( 'class', $class ); 2529 } else { 2530 $this->remove_attribute( 'class' ); 2531 } 2532 } 2533 2534 /** 2535 * Applies attribute updates to HTML document. 2536 * 2537 * @since 6.2.0 2538 * @since 6.2.1 Accumulates shift for internal cursor and passed pointer. 2539 * @since 6.3.0 Invalidate any bookmarks whose targets are overwritten. 2540 * @ignore 2541 * 2542 * @param int $shift_this_point Accumulate and return shift for this position. 2543 * @return int How many bytes the given pointer moved in response to the updates. 2544 */ 2545 private function apply_attributes_updates( int $shift_this_point ): int { 2546 if ( ! count( $this->lexical_updates ) ) { 2547 return 0; 2548 } 2549 2550 $accumulated_shift_for_given_point = 0; 2551 2552 /* 2553 * Attribute updates can be enqueued in any order but updates 2554 * to the document must occur in lexical order; that is, each 2555 * replacement must be made before all others which follow it 2556 * at later string indices in the input document. 2557 * 2558 * Sorting avoids making out-of-order replacements which 2559 * can lead to mangled output, partially-duplicated 2560 * attributes, and overwritten attributes. 2561 */ 2562 usort( $this->lexical_updates, array( self::class, 'sort_start_ascending' ) ); 2563 2564 $bytes_already_copied = 0; 2565 $output_buffer = ''; 2566 foreach ( $this->lexical_updates as $diff ) { 2567 $shift = strlen( $diff->text ) - $diff->length; 2568 2569 // Adjust the cursor position by however much an update affects it. 2570 if ( $diff->start < $this->bytes_already_parsed ) { 2571 $this->bytes_already_parsed += $shift; 2572 } 2573 2574 // Accumulate shift of the given pointer within this function call. 2575 if ( $diff->start < $shift_this_point ) { 2576 $accumulated_shift_for_given_point += $shift; 2577 } 2578 2579 $output_buffer .= substr( $this->html, $bytes_already_copied, $diff->start - $bytes_already_copied ); 2580 $output_buffer .= $diff->text; 2581 $bytes_already_copied = $diff->start + $diff->length; 2582 } 2583 2584 $this->html = $output_buffer . substr( $this->html, $bytes_already_copied ); 2585 2586 /* 2587 * Adjust bookmark locations to account for how the text 2588 * replacements adjust offsets in the input document. 2589 */ 2590 foreach ( $this->bookmarks as $bookmark_name => $bookmark ) { 2591 $bookmark_end = $bookmark->start + $bookmark->length; 2592 2593 /* 2594 * Each lexical update which appears before the bookmark's endpoints 2595 * might shift the offsets for those endpoints. Loop through each change 2596 * and accumulate the total shift for each bookmark, then apply that 2597 * shift after tallying the full delta. 2598 */ 2599 $head_delta = 0; 2600 $tail_delta = 0; 2601 2602 foreach ( $this->lexical_updates as $diff ) { 2603 $diff_end = $diff->start + $diff->length; 2604 2605 if ( $bookmark->start < $diff->start && $bookmark_end < $diff->start ) { 2606 break; 2607 } 2608 2609 if ( $bookmark->start >= $diff->start && $bookmark_end < $diff_end ) { 2610 $this->release_bookmark( $bookmark_name ); 2611 continue 2; 2612 } 2613 2614 $delta = strlen( $diff->text ) - $diff->length; 2615 2616 if ( $bookmark->start >= $diff->start ) { 2617 $head_delta += $delta; 2618 } 2619 2620 if ( $bookmark_end >= $diff_end ) { 2621 $tail_delta += $delta; 2622 } 2623 } 2624 2625 $bookmark->start += $head_delta; 2626 $bookmark->length += $tail_delta - $head_delta; 2627 } 2628 2629 $this->lexical_updates = array(); 2630 2631 return $accumulated_shift_for_given_point; 2632 } 2633 2634 /** 2635 * Checks whether a bookmark with the given name exists. 2636 * 2637 * @since 6.3.0 2638 * 2639 * @param string $bookmark_name Name to identify a bookmark that potentially exists. 2640 * @return bool Whether that bookmark exists. 2641 */ 2642 public function has_bookmark( $bookmark_name ): bool { 2643 return array_key_exists( $bookmark_name, $this->bookmarks ); 2644 } 2645 2646 /** 2647 * Move the internal cursor in the Tag Processor to a given bookmark's location. 2648 * 2649 * In order to prevent accidental infinite loops, there's a 2650 * maximum limit on the number of times seek() can be called. 2651 * 2652 * @since 6.2.0 2653 * 2654 * @param string $bookmark_name Jump to the place in the document identified by this bookmark name. 2655 * @return bool Whether the internal cursor was successfully moved to the bookmark's location. 2656 */ 2657 public function seek( $bookmark_name ): bool { 2658 if ( ! array_key_exists( $bookmark_name, $this->bookmarks ) ) { 2659 _doing_it_wrong( 2660 __METHOD__, 2661 __( 'Unknown bookmark name.' ), 2662 '6.2.0' 2663 ); 2664 return false; 2665 } 2666 2667 $existing_bookmark = $this->bookmarks[ $bookmark_name ]; 2668 2669 if ( 2670 $this->token_starts_at === $existing_bookmark->start && 2671 $this->token_length === $existing_bookmark->length 2672 ) { 2673 return true; 2674 } 2675 2676 if ( ++$this->seek_count > static::MAX_SEEK_OPS ) { 2677 _doing_it_wrong( 2678 __METHOD__, 2679 __( 'Too many calls to seek() - this can lead to performance issues.' ), 2680 '6.2.0' 2681 ); 2682 return false; 2683 } 2684 2685 // Flush out any pending updates to the document. 2686 $this->get_updated_html(); 2687 2688 // Point this tag processor before the sought tag opener and consume it. 2689 $this->bytes_already_parsed = $this->bookmarks[ $bookmark_name ]->start; 2690 $this->parser_state = self::STATE_READY; 2691 return $this->next_token(); 2692 } 2693 2694 /** 2695 * Compare two WP_HTML_Text_Replacement objects. 2696 * 2697 * @since 6.2.0 2698 * @ignore 2699 * 2700 * @param WP_HTML_Text_Replacement $a First attribute update. 2701 * @param WP_HTML_Text_Replacement $b Second attribute update. 2702 * @return int Comparison value for string order. 2703 */ 2704 private static function sort_start_ascending( WP_HTML_Text_Replacement $a, WP_HTML_Text_Replacement $b ): int { 2705 $by_start = $a->start - $b->start; 2706 if ( 0 !== $by_start ) { 2707 return $by_start; 2708 } 2709 2710 $by_text = isset( $a->text, $b->text ) ? strcmp( $a->text, $b->text ) : 0; 2711 if ( 0 !== $by_text ) { 2712 return $by_text; 2713 } 2714 2715 /* 2716 * This code should be unreachable, because it implies the two replacements 2717 * start at the same location and contain the same text. 2718 */ 2719 return $a->length - $b->length; 2720 } 2721 2722 /** 2723 * Return the enqueued value for a given attribute, if one exists. 2724 * 2725 * Enqueued updates can take different data types: 2726 * - If an update is enqueued and is boolean, the return will be `true` 2727 * - If an update is otherwise enqueued, the return will be the string value of that update. 2728 * - If an attribute is enqueued to be removed, the return will be `null` to indicate that. 2729 * - If no updates are enqueued, the return will be `false` to differentiate from "removed." 2730 * 2731 * @since 6.2.0 2732 * @ignore 2733 * 2734 * @param string $comparable_name The attribute name in its comparable form. 2735 * @return string|boolean|null Value of enqueued update if present, otherwise false. 2736 */ 2737 private function get_enqueued_attribute_value( string $comparable_name ) { 2738 if ( self::STATE_MATCHED_TAG !== $this->parser_state ) { 2739 return false; 2740 } 2741 2742 if ( ! isset( $this->lexical_updates[ $comparable_name ] ) ) { 2743 return false; 2744 } 2745 2746 $enqueued_text = $this->lexical_updates[ $comparable_name ]->text; 2747 2748 // Removed attributes erase the entire span. 2749 if ( '' === $enqueued_text ) { 2750 return null; 2751 } 2752 2753 /* 2754 * Boolean attribute updates are just the attribute name without a corresponding value. 2755 * 2756 * This value might differ from the given comparable name in that there could be leading 2757 * or trailing whitespace, and that the casing follows the name given in `set_attribute`. 2758 * 2759 * Example: 2760 * 2761 * $p->set_attribute( 'data-TEST-id', 'update' ); 2762 * 'update' === $p->get_enqueued_attribute_value( 'data-test-id' ); 2763 * 2764 * Detect this difference based on the absence of the `=`, which _must_ exist in any 2765 * attribute containing a value, e.g. `<input type="text" enabled />`. 2766 * ¹ ² 2767 * 1. Attribute with a string value. 2768 * 2. Boolean attribute whose value is `true`. 2769 */ 2770 $equals_at = strpos( $enqueued_text, '=' ); 2771 if ( false === $equals_at ) { 2772 return true; 2773 } 2774 2775 /* 2776 * Finally, a normal update's value will appear after the `=` and 2777 * be double-quoted, as performed incidentally by `set_attribute`. 2778 * 2779 * e.g. `type="text"` 2780 * ¹² ³ 2781 * 1. Equals is here. 2782 * 2. Double-quoting starts one after the equals sign. 2783 * 3. Double-quoting ends at the last character in the update. 2784 */ 2785 $enqueued_value = substr( $enqueued_text, $equals_at + 2, -1 ); 2786 return WP_HTML_Decoder::decode_attribute( $enqueued_value ); 2787 } 2788 2789 /** 2790 * Returns the value of a requested attribute from a matched tag opener if that attribute exists. 2791 * 2792 * Example: 2793 * 2794 * $p = new WP_HTML_Tag_Processor( '<div enabled class="test" data-test-id="14">Test</div>' ); 2795 * $p->next_tag( array( 'class_name' => 'test' ) ) === true; 2796 * $p->get_attribute( 'data-test-id' ) === '14'; 2797 * $p->get_attribute( 'enabled' ) === true; 2798 * $p->get_attribute( 'aria-label' ) === null; 2799 * 2800 * $p->next_tag() === false; 2801 * $p->get_attribute( 'class' ) === null; 2802 * 2803 * @since 6.2.0 2804 * 2805 * @param string $name Name of attribute whose value is requested. 2806 * @return string|true|null Value of attribute or `null` if not available. Boolean attributes return `true`. 2807 */ 2808 public function get_attribute( $name ) { 2809 if ( self::STATE_MATCHED_TAG !== $this->parser_state ) { 2810 return null; 2811 } 2812 2813 $comparable = strtolower( $name ); 2814 2815 /* 2816 * For every attribute other than `class` it's possible to perform a quick check if 2817 * there's an enqueued lexical update whose value takes priority over what's found in 2818 * the input document. 2819 * 2820 * The `class` attribute is special though because of the exposed helpers `add_class` 2821 * and `remove_class`. These form a builder for the `class` attribute, so an additional 2822 * check for enqueued class changes is required in addition to the check for any enqueued 2823 * attribute values. If any exist, those enqueued class changes must first be flushed out 2824 * into an attribute value update. 2825 */ 2826 if ( 'class' === $name ) { 2827 $this->class_name_updates_to_attributes_updates(); 2828 } 2829 2830 // Return any enqueued attribute value updates if they exist. 2831 $enqueued_value = $this->get_enqueued_attribute_value( $comparable ); 2832 if ( false !== $enqueued_value ) { 2833 return $enqueued_value; 2834 } 2835 2836 if ( ! isset( $this->attributes[ $comparable ] ) ) { 2837 return null; 2838 } 2839 2840 $attribute = $this->attributes[ $comparable ]; 2841 2842 /* 2843 * This flag distinguishes an attribute with no value 2844 * from an attribute with an empty string value. For 2845 * unquoted attributes this could look very similar. 2846 * It refers to whether an `=` follows the name. 2847 * 2848 * e.g. <div boolean-attribute empty-attribute=></div> 2849 * ¹ ² 2850 * 1. Attribute `boolean-attribute` is `true`. 2851 * 2. Attribute `empty-attribute` is `""`. 2852 */ 2853 if ( true === $attribute->is_true ) { 2854 return true; 2855 } 2856 2857 $raw_value = substr( $this->html, $attribute->value_starts_at, $attribute->value_length ); 2858 2859 return WP_HTML_Decoder::decode_attribute( $raw_value ); 2860 } 2861 2862 /** 2863 * Gets lowercase names of all attributes matching a given prefix in the current tag. 2864 * 2865 * Note that matching is case-insensitive. This is in accordance with the spec: 2866 * 2867 * > There must never be two or more attributes on 2868 * > the same start tag whose names are an ASCII 2869 * > case-insensitive match for each other. 2870 * - HTML 5 spec 2871 * 2872 * Example: 2873 * 2874 * $p = new WP_HTML_Tag_Processor( '<div data-ENABLED class="test" DATA-test-id="14">Test</div>' ); 2875 * $p->next_tag( array( 'class_name' => 'test' ) ) === true; 2876 * $p->get_attribute_names_with_prefix( 'data-' ) === array( 'data-enabled', 'data-test-id' ); 2877 * 2878 * $p->next_tag() === false; 2879 * $p->get_attribute_names_with_prefix( 'data-' ) === null; 2880 * 2881 * @since 6.2.0 2882 * 2883 * @see https://html.spec.whatwg.org/multipage/syntax.html#attributes-2:ascii-case-insensitive 2884 * 2885 * @param string $prefix Prefix of requested attribute names. 2886 * @return array|null List of attribute names, or `null` when no tag opener is matched. 2887 */ 2888 public function get_attribute_names_with_prefix( $prefix ): ?array { 2889 if ( 2890 self::STATE_MATCHED_TAG !== $this->parser_state || 2891 $this->is_closing_tag 2892 ) { 2893 return null; 2894 } 2895 2896 $comparable = strtolower( $prefix ); 2897 2898 $matches = array(); 2899 foreach ( array_keys( $this->attributes ) as $attr_name ) { 2900 if ( str_starts_with( $attr_name, $comparable ) ) { 2901 $matches[] = $attr_name; 2902 } 2903 } 2904 return $matches; 2905 } 2906 2907 /** 2908 * Returns the namespace of the matched token. 2909 * 2910 * @since 6.7.0 2911 * 2912 * @return string One of 'html', 'math', or 'svg'. 2913 */ 2914 public function get_namespace(): string { 2915 return $this->parsing_namespace; 2916 } 2917 2918 /** 2919 * Returns the uppercase name of the matched tag. 2920 * 2921 * Example: 2922 * 2923 * $p = new WP_HTML_Tag_Processor( '<div class="test">Test</div>' ); 2924 * $p->next_tag() === true; 2925 * $p->get_tag() === 'DIV'; 2926 * 2927 * $p->next_tag() === false; 2928 * $p->get_tag() === null; 2929 * 2930 * @since 6.2.0 2931 * 2932 * @return string|null Name of currently matched tag in input HTML, or `null` if none found. 2933 */ 2934 public function get_tag(): ?string { 2935 if ( null === $this->tag_name_starts_at ) { 2936 return null; 2937 } 2938 2939 $tag_name = substr( $this->html, $this->tag_name_starts_at, $this->tag_name_length ); 2940 2941 if ( self::STATE_MATCHED_TAG === $this->parser_state ) { 2942 return strtoupper( $tag_name ); 2943 } 2944 2945 if ( 2946 self::STATE_COMMENT === $this->parser_state && 2947 self::COMMENT_AS_PI_NODE_LOOKALIKE === $this->get_comment_type() 2948 ) { 2949 return $tag_name; 2950 } 2951 2952 return null; 2953 } 2954 2955 /** 2956 * Returns the adjusted tag name for a given token, taking into 2957 * account the current parsing context, whether HTML, SVG, or MathML. 2958 * 2959 * @since 6.7.0 2960 * 2961 * @return string|null Name of current tag name. 2962 */ 2963 public function get_qualified_tag_name(): ?string { 2964 $tag_name = $this->get_tag(); 2965 if ( null === $tag_name ) { 2966 return null; 2967 } 2968 2969 if ( 'html' === $this->get_namespace() ) { 2970 return $tag_name; 2971 } 2972 2973 $lower_tag_name = strtolower( $tag_name ); 2974 if ( 'math' === $this->get_namespace() ) { 2975 return $lower_tag_name; 2976 } 2977 2978 if ( 'svg' === $this->get_namespace() ) { 2979 switch ( $lower_tag_name ) { 2980 case 'altglyph': 2981 return 'altGlyph'; 2982 2983 case 'altglyphdef': 2984 return 'altGlyphDef'; 2985 2986 case 'altglyphitem': 2987 return 'altGlyphItem'; 2988 2989 case 'animatecolor': 2990 return 'animateColor'; 2991 2992 case 'animatemotion': 2993 return 'animateMotion'; 2994 2995 case 'animatetransform': 2996 return 'animateTransform'; 2997 2998 case 'clippath': 2999 return 'clipPath'; 3000 3001 case 'feblend': 3002 return 'feBlend'; 3003 3004 case 'fecolormatrix': 3005 return 'feColorMatrix'; 3006 3007 case 'fecomponenttransfer': 3008 return 'feComponentTransfer'; 3009 3010 case 'fecomposite': 3011 return 'feComposite'; 3012 3013 case 'feconvolvematrix': 3014 return 'feConvolveMatrix'; 3015 3016 case 'fediffuselighting': 3017 return 'feDiffuseLighting'; 3018 3019 case 'fedisplacementmap': 3020 return 'feDisplacementMap'; 3021 3022 case 'fedistantlight': 3023 return 'feDistantLight'; 3024 3025 case 'fedropshadow': 3026 return 'feDropShadow'; 3027 3028 case 'feflood': 3029 return 'feFlood'; 3030 3031 case 'fefunca': 3032 return 'feFuncA'; 3033 3034 case 'fefuncb': 3035 return 'feFuncB'; 3036 3037 case 'fefuncg': 3038 return 'feFuncG'; 3039 3040 case 'fefuncr': 3041 return 'feFuncR'; 3042 3043 case 'fegaussianblur': 3044 return 'feGaussianBlur'; 3045 3046 case 'feimage': 3047 return 'feImage'; 3048 3049 case 'femerge': 3050 return 'feMerge'; 3051 3052 case 'femergenode': 3053 return 'feMergeNode'; 3054 3055 case 'femorphology': 3056 return 'feMorphology'; 3057 3058 case 'feoffset': 3059 return 'feOffset'; 3060 3061 case 'fepointlight': 3062 return 'fePointLight'; 3063 3064 case 'fespecularlighting': 3065 return 'feSpecularLighting'; 3066 3067 case 'fespotlight': 3068 return 'feSpotLight'; 3069 3070 case 'fetile': 3071 return 'feTile'; 3072 3073 case 'feturbulence': 3074 return 'feTurbulence'; 3075 3076 case 'foreignobject': 3077 return 'foreignObject'; 3078 3079 case 'glyphref': 3080 return 'glyphRef'; 3081 3082 case 'lineargradient': 3083 return 'linearGradient'; 3084 3085 case 'radialgradient': 3086 return 'radialGradient'; 3087 3088 case 'textpath': 3089 return 'textPath'; 3090 3091 default: 3092 return $lower_tag_name; 3093 } 3094 } 3095 3096 // This unnecessary return prevents tools from inaccurately reporting type errors. 3097 return $tag_name; 3098 } 3099 3100 /** 3101 * Returns the adjusted attribute name for a given attribute, taking into 3102 * account the current parsing context, whether HTML, SVG, or MathML. 3103 * 3104 * In SVG and MathML contexts, adjusted foreign attributes with a namespace 3105 * prefix use a space between the prefix and local name. For example, 3106 * `xlink:href` is returned as `xlink href`, while the unprefixed `xmlns` 3107 * attribute is returned as `xmlns`. Non-adjusted attributes with a colon in 3108 * their name, such as `foo:bar`, are returned unchanged. 3109 * 3110 * @since 6.7.0 3111 * 3112 * @param string $attribute_name Which attribute to adjust. 3113 * 3114 * @return string|null 3115 */ 3116 public function get_qualified_attribute_name( $attribute_name ): ?string { 3117 if ( self::STATE_MATCHED_TAG !== $this->parser_state ) { 3118 return null; 3119 } 3120 3121 $namespace = $this->get_namespace(); 3122 $lower_name = strtolower( $attribute_name ); 3123 3124 if ( 'math' === $namespace && 'definitionurl' === $lower_name ) { 3125 return 'definitionURL'; 3126 } 3127 3128 if ( 'svg' === $this->get_namespace() ) { 3129 switch ( $lower_name ) { 3130 case 'attributename': 3131 return 'attributeName'; 3132 3133 case 'attributetype': 3134 return 'attributeType'; 3135 3136 case 'basefrequency': 3137 return 'baseFrequency'; 3138 3139 case 'baseprofile': 3140 return 'baseProfile'; 3141 3142 case 'calcmode': 3143 return 'calcMode'; 3144 3145 case 'clippathunits': 3146 return 'clipPathUnits'; 3147 3148 case 'diffuseconstant': 3149 return 'diffuseConstant'; 3150 3151 case 'edgemode': 3152 return 'edgeMode'; 3153 3154 case 'filterunits': 3155 return 'filterUnits'; 3156 3157 case 'glyphref': 3158 return 'glyphRef'; 3159 3160 case 'gradienttransform': 3161 return 'gradientTransform'; 3162 3163 case 'gradientunits': 3164 return 'gradientUnits'; 3165 3166 case 'kernelmatrix': 3167 return 'kernelMatrix'; 3168 3169 case 'kernelunitlength': 3170 return 'kernelUnitLength'; 3171 3172 case 'keypoints': 3173 return 'keyPoints'; 3174 3175 case 'keysplines': 3176 return 'keySplines'; 3177 3178 case 'keytimes': 3179 return 'keyTimes'; 3180 3181 case 'lengthadjust': 3182 return 'lengthAdjust'; 3183 3184 case 'limitingconeangle': 3185 return 'limitingConeAngle'; 3186 3187 case 'markerheight': 3188 return 'markerHeight'; 3189 3190 case 'markerunits': 3191 return 'markerUnits'; 3192 3193 case 'markerwidth': 3194 return 'markerWidth'; 3195 3196 case 'maskcontentunits': 3197 return 'maskContentUnits'; 3198 3199 case 'maskunits': 3200 return 'maskUnits'; 3201 3202 case 'numoctaves': 3203 return 'numOctaves'; 3204 3205 case 'pathlength': 3206 return 'pathLength'; 3207 3208 case 'patterncontentunits': 3209 return 'patternContentUnits'; 3210 3211 case 'patterntransform': 3212 return 'patternTransform'; 3213 3214 case 'patternunits': 3215 return 'patternUnits'; 3216 3217 case 'pointsatx': 3218 return 'pointsAtX'; 3219 3220 case 'pointsaty': 3221 return 'pointsAtY'; 3222 3223 case 'pointsatz': 3224 return 'pointsAtZ'; 3225 3226 case 'preservealpha': 3227 return 'preserveAlpha'; 3228 3229 case 'preserveaspectratio': 3230 return 'preserveAspectRatio'; 3231 3232 case 'primitiveunits': 3233 return 'primitiveUnits'; 3234 3235 case 'refx': 3236 return 'refX'; 3237 3238 case 'refy': 3239 return 'refY'; 3240 3241 case 'repeatcount': 3242 return 'repeatCount'; 3243 3244 case 'repeatdur': 3245 return 'repeatDur'; 3246 3247 case 'requiredextensions': 3248 return 'requiredExtensions'; 3249 3250 case 'requiredfeatures': 3251 return 'requiredFeatures'; 3252 3253 case 'specularconstant': 3254 return 'specularConstant'; 3255 3256 case 'specularexponent': 3257 return 'specularExponent'; 3258 3259 case 'spreadmethod': 3260 return 'spreadMethod'; 3261 3262 case 'startoffset': 3263 return 'startOffset'; 3264 3265 case 'stddeviation': 3266 return 'stdDeviation'; 3267 3268 case 'stitchtiles': 3269 return 'stitchTiles'; 3270 3271 case 'surfacescale': 3272 return 'surfaceScale'; 3273 3274 case 'systemlanguage': 3275 return 'systemLanguage'; 3276 3277 case 'tablevalues': 3278 return 'tableValues'; 3279 3280 case 'targetx': 3281 return 'targetX'; 3282 3283 case 'targety': 3284 return 'targetY'; 3285 3286 case 'textlength': 3287 return 'textLength'; 3288 3289 case 'viewbox': 3290 return 'viewBox'; 3291 3292 case 'viewtarget': 3293 return 'viewTarget'; 3294 3295 case 'xchannelselector': 3296 return 'xChannelSelector'; 3297 3298 case 'ychannelselector': 3299 return 'yChannelSelector'; 3300 3301 case 'zoomandpan': 3302 return 'zoomAndPan'; 3303 } 3304 } 3305 3306 if ( 'html' !== $namespace ) { 3307 switch ( $lower_name ) { 3308 case 'xlink:actuate': 3309 return 'xlink actuate'; 3310 3311 case 'xlink:arcrole': 3312 return 'xlink arcrole'; 3313 3314 case 'xlink:href': 3315 return 'xlink href'; 3316 3317 case 'xlink:role': 3318 return 'xlink role'; 3319 3320 case 'xlink:show': 3321 return 'xlink show'; 3322 3323 case 'xlink:title': 3324 return 'xlink title'; 3325 3326 case 'xlink:type': 3327 return 'xlink type'; 3328 3329 case 'xml:lang': 3330 return 'xml lang'; 3331 3332 case 'xml:space': 3333 return 'xml space'; 3334 3335 case 'xmlns': 3336 return 'xmlns'; 3337 3338 case 'xmlns:xlink': 3339 return 'xmlns xlink'; 3340 } 3341 } 3342 3343 return $attribute_name; 3344 } 3345 3346 /** 3347 * Indicates if the currently matched tag contains the self-closing flag. 3348 * 3349 * No HTML elements ought to have the self-closing flag and for those, the self-closing 3350 * flag will be ignored. For void elements this is benign because they "self close" 3351 * automatically. For non-void HTML elements though problems will appear if someone 3352 * intends to use a self-closing element in place of that element with an empty body. 3353 * For HTML foreign elements and custom elements the self-closing flag determines if 3354 * they self-close or not. 3355 * 3356 * This function does not determine if a tag is self-closing, 3357 * but only if the self-closing flag is present in the syntax. 3358 * 3359 * @since 6.3.0 3360 * 3361 * @return bool Whether the currently matched tag contains the self-closing flag. 3362 */ 3363 public function has_self_closing_flag(): bool { 3364 if ( self::STATE_MATCHED_TAG !== $this->parser_state ) { 3365 return false; 3366 } 3367 3368 return $this->has_self_closing_flag; 3369 } 3370 3371 /** 3372 * Indicates if the current tag token is a tag closer. 3373 * 3374 * Example: 3375 * 3376 * $p = new WP_HTML_Tag_Processor( '<div></div>' ); 3377 * $p->next_tag( array( 'tag_name' => 'div', 'tag_closers' => 'visit' ) ); 3378 * $p->is_tag_closer() === false; 3379 * 3380 * $p->next_tag( array( 'tag_name' => 'div', 'tag_closers' => 'visit' ) ); 3381 * $p->is_tag_closer() === true; 3382 * 3383 * @since 6.2.0 3384 * @since 6.7.0 Reports all BR tags as opening tags. 3385 * 3386 * @return bool Whether the current tag is a tag closer. 3387 */ 3388 public function is_tag_closer(): bool { 3389 return ( 3390 self::STATE_MATCHED_TAG === $this->parser_state && 3391 $this->is_closing_tag && 3392 3393 /* 3394 * The BR tag can only exist as an opening tag. If something like `</br>` 3395 * appears then the HTML parser will treat it as an opening tag with no 3396 * attributes. The BR tag is unique in this way. 3397 * 3398 * @see https://html.spec.whatwg.org/#parsing-main-inbody 3399 */ 3400 'BR' !== $this->get_tag() 3401 ); 3402 } 3403 3404 /** 3405 * Indicates the kind of matched token, if any. 3406 * 3407 * This differs from `get_token_name()` in that it always 3408 * returns a static string indicating the type, whereas 3409 * `get_token_name()` may return values derived from the 3410 * token itself, such as a tag name or processing 3411 * instruction tag. 3412 * 3413 * Possible values: 3414 * - `#tag` when matched on a tag. 3415 * - `#text` when matched on a text node. 3416 * - `#cdata-section` when matched on a CDATA node. 3417 * - `#comment` when matched on a comment. 3418 * - `#doctype` when matched on a DOCTYPE declaration. 3419 * - `#presumptuous-tag` when matched on an empty tag closer. 3420 * - `#funky-comment` when matched on a funky comment. 3421 * 3422 * @since 6.5.0 3423 * 3424 * @return string|null What kind of token is matched, or null. 3425 */ 3426 public function get_token_type(): ?string { 3427 switch ( $this->parser_state ) { 3428 case self::STATE_MATCHED_TAG: 3429 return '#tag'; 3430 3431 case self::STATE_DOCTYPE: 3432 return '#doctype'; 3433 3434 default: 3435 return $this->get_token_name(); 3436 } 3437 } 3438 3439 /** 3440 * Returns the node name represented by the token. 3441 * 3442 * This matches the DOM API value `nodeName`. Some values 3443 * are static, such as `#text` for a text node, while others 3444 * are dynamically generated from the token itself. 3445 * 3446 * Dynamic names: 3447 * - Uppercase tag name for tag matches. 3448 * - `html` for DOCTYPE declarations. 3449 * 3450 * Note that if the Tag Processor is not matched on a token 3451 * then this function will return `null`, either because it 3452 * hasn't yet found a token or because it reached the end 3453 * of the document without matching a token. 3454 * 3455 * @since 6.5.0 3456 * 3457 * @return string|null Name of the matched token. 3458 */ 3459 public function get_token_name(): ?string { 3460 switch ( $this->parser_state ) { 3461 case self::STATE_MATCHED_TAG: 3462 return $this->get_tag(); 3463 3464 case self::STATE_TEXT_NODE: 3465 return '#text'; 3466 3467 case self::STATE_CDATA_NODE: 3468 return '#cdata-section'; 3469 3470 case self::STATE_COMMENT: 3471 return '#comment'; 3472 3473 case self::STATE_DOCTYPE: 3474 return 'html'; 3475 3476 case self::STATE_PRESUMPTUOUS_TAG: 3477 return '#presumptuous-tag'; 3478 3479 case self::STATE_FUNKY_COMMENT: 3480 return '#funky-comment'; 3481 } 3482 3483 return null; 3484 } 3485 3486 /** 3487 * Indicates what kind of comment produced the comment node. 3488 * 3489 * Because there are different kinds of HTML syntax which produce 3490 * comments, the Tag Processor tracks and exposes this as a type 3491 * for the comment. Nominally only regular HTML comments exist as 3492 * they are commonly known, but a number of unrelated syntax errors 3493 * also produce comments. 3494 * 3495 * @see self::COMMENT_AS_ABRUPTLY_CLOSED_COMMENT 3496 * @see self::COMMENT_AS_CDATA_LOOKALIKE 3497 * @see self::COMMENT_AS_INVALID_HTML 3498 * @see self::COMMENT_AS_HTML_COMMENT 3499 * @see self::COMMENT_AS_PI_NODE_LOOKALIKE 3500 * 3501 * @since 6.5.0 3502 * 3503 * @return string|null 3504 */ 3505 public function get_comment_type(): ?string { 3506 if ( self::STATE_COMMENT !== $this->parser_state ) { 3507 return null; 3508 } 3509 3510 return $this->comment_type; 3511 } 3512 3513 /** 3514 * Returns the text of a matched comment or null if not on a comment type node. 3515 * 3516 * This method returns the entire text content of a comment node as it 3517 * would appear in the browser. 3518 * 3519 * This differs from {@see ::get_modifiable_text()} in that certain comment 3520 * types in the HTML API cannot allow their entire comment text content to 3521 * be modified. Namely, "bogus comments" of the form `<?not allowed in html>` 3522 * will create a comment whose text content starts with `?`. Note that if 3523 * that character were modified, it would be possible to change the node 3524 * type. 3525 * 3526 * @since 6.7.0 3527 * 3528 * @return string|null The comment text as it would appear in the browser or null 3529 * if not on a comment type node. 3530 */ 3531 public function get_full_comment_text(): ?string { 3532 if ( self::STATE_FUNKY_COMMENT === $this->parser_state ) { 3533 return $this->get_modifiable_text(); 3534 } 3535 3536 if ( self::STATE_COMMENT !== $this->parser_state ) { 3537 return null; 3538 } 3539 3540 switch ( $this->get_comment_type() ) { 3541 case self::COMMENT_AS_HTML_COMMENT: 3542 case self::COMMENT_AS_ABRUPTLY_CLOSED_COMMENT: 3543 return $this->get_modifiable_text(); 3544 3545 case self::COMMENT_AS_CDATA_LOOKALIKE: 3546 return "[CDATA[{$this->get_modifiable_text()}]]"; 3547 3548 case self::COMMENT_AS_PI_NODE_LOOKALIKE: 3549 return "?{$this->get_tag()}{$this->get_modifiable_text()}?"; 3550 3551 /* 3552 * This represents "bogus comments state" from HTML tokenization. 3553 * This can be entered by `<?` or `<!`, where `?` is included in 3554 * the comment text but `!` is not. 3555 */ 3556 case self::COMMENT_AS_INVALID_HTML: 3557 $preceding_character = $this->html[ $this->text_starts_at - 1 ]; 3558 $comment_start = '?' === $preceding_character ? '?' : ''; 3559 return "{$comment_start}{$this->get_modifiable_text()}"; 3560 } 3561 3562 return null; 3563 } 3564 3565 /** 3566 * Subdivides a matched text node, splitting NULL byte sequences and decoded whitespace as 3567 * distinct nodes prefixes. 3568 * 3569 * Note that once anything that's neither a NULL byte nor decoded whitespace is 3570 * encountered, then the remainder of the text node is left intact as generic text. 3571 * 3572 * - The HTML Processor uses this to apply distinct rules for different kinds of text. 3573 * - Inter-element whitespace can be detected and skipped with this method. 3574 * 3575 * Text nodes aren't eagerly subdivided because there's no need to split them unless 3576 * decisions are being made on NULL byte sequences or whitespace-only text. 3577 * 3578 * Example: 3579 * 3580 * $processor = new WP_HTML_Tag_Processor( "\x00Apples & Oranges" ); 3581 * true === $processor->next_token(); // Text is "Apples & Oranges". 3582 * true === $processor->subdivide_text_appropriately(); // Text is "". 3583 * true === $processor->next_token(); // Text is "Apples & Oranges". 3584 * false === $processor->subdivide_text_appropriately(); 3585 * 3586 * $processor = new WP_HTML_Tag_Processor( "
 \r\n\tMore" ); 3587 * true === $processor->next_token(); // Text is "␍ ␊␉More". 3588 * true === $processor->subdivide_text_appropriately(); // Text is "␍ ␊␉". 3589 * true === $processor->next_token(); // Text is "More". 3590 * false === $processor->subdivide_text_appropriately(); 3591 * 3592 * @since 6.7.0 3593 * 3594 * @return bool Whether the text node was subdivided. 3595 */ 3596 public function subdivide_text_appropriately(): bool { 3597 if ( self::STATE_TEXT_NODE !== $this->parser_state ) { 3598 return false; 3599 } 3600 3601 $this->text_node_classification = self::TEXT_IS_GENERIC; 3602 3603 /* 3604 * NULL bytes are treated categorically different than numeric character 3605 * references whose number is zero. `�` is not the same as `"\x00"`. 3606 */ 3607 $leading_nulls = strspn( $this->html, "\x00", $this->text_starts_at, $this->text_length ); 3608 if ( $leading_nulls > 0 ) { 3609 $this->token_length = $leading_nulls; 3610 $this->text_length = $leading_nulls; 3611 $this->bytes_already_parsed = $this->token_starts_at + $leading_nulls; 3612 $this->text_node_classification = self::TEXT_IS_NULL_SEQUENCE; 3613 return true; 3614 } 3615 3616 /* 3617 * Start a decoding loop to determine the point at which the 3618 * text subdivides. This entails raw whitespace bytes and any 3619 * character reference that decodes to the same. 3620 */ 3621 $at = $this->text_starts_at; 3622 $end = $this->text_starts_at + $this->text_length; 3623 while ( $at < $end ) { 3624 $skipped = strspn( $this->html, " \t\f\r\n", $at, $end - $at ); 3625 $at += $skipped; 3626 3627 if ( $at < $end && '&' === $this->html[ $at ] ) { 3628 $matched_byte_length = null; 3629 $replacement = WP_HTML_Decoder::read_character_reference( 'data', $this->html, $at, $matched_byte_length ); 3630 if ( isset( $replacement ) && 1 === strspn( $replacement, " \t\f\r\n" ) ) { 3631 $at += $matched_byte_length; 3632 continue; 3633 } 3634 } 3635 3636 break; 3637 } 3638 3639 if ( $at > $this->text_starts_at ) { 3640 $new_length = $at - $this->text_starts_at; 3641 $this->text_length = $new_length; 3642 $this->token_length = $new_length; 3643 $this->bytes_already_parsed = $at; 3644 $this->text_node_classification = self::TEXT_IS_WHITESPACE; 3645 return true; 3646 } 3647 3648 return false; 3649 } 3650 3651 /** 3652 * Returns the modifiable text for a matched token, or an empty string. 3653 * 3654 * Modifiable text is text content that may be read and changed without 3655 * changing the HTML structure of the document around it. This includes 3656 * the contents of `#text` nodes in the HTML as well as the inner 3657 * contents of HTML comments, Processing Instructions, and others, even 3658 * though these nodes aren't part of a parsed DOM tree. They also contain 3659 * the contents of SCRIPT and STYLE tags, of TEXTAREA tags, and of any 3660 * other section in an HTML document which cannot contain HTML markup (DATA). 3661 * 3662 * If a token has no modifiable text then an empty string is returned to 3663 * avoid needless crashing or type errors. An empty string does not mean 3664 * that a token has modifiable text, and a token with modifiable text may 3665 * have an empty string (e.g. a comment with no contents). 3666 * 3667 * Limitations: 3668 * 3669 * - This function will not strip the leading newline appropriately 3670 * after seeking into a LISTING or PRE element. To ensure that the 3671 * newline is treated properly, seek to the LISTING or PRE opening 3672 * tag instead of to the first text node inside the element. 3673 * 3674 * @since 6.5.0 3675 * @since 6.7.0 Replaces NULL bytes (U+0000) and newlines appropriately. 3676 * 3677 * @return string 3678 */ 3679 public function get_modifiable_text(): string { 3680 $has_enqueued_update = isset( $this->lexical_updates['modifiable text'] ); 3681 3682 if ( ! $has_enqueued_update && ( null === $this->text_starts_at || 0 === $this->text_length ) ) { 3683 return ''; 3684 } 3685 3686 $text = $has_enqueued_update 3687 ? $this->lexical_updates['modifiable text']->text 3688 : substr( $this->html, $this->text_starts_at, $this->text_length ); 3689 3690 /* 3691 * Pre-processing the input stream would normally happen before 3692 * any parsing is done, but deferring it means it's possible to 3693 * skip in most cases. When getting the modifiable text, however 3694 * it's important to apply the pre-processing steps, which is 3695 * normalizing newlines. 3696 * 3697 * @see https://html.spec.whatwg.org/#preprocessing-the-input-stream 3698 * @see https://infra.spec.whatwg.org/#normalize-newlines 3699 */ 3700 $text = str_replace( "\r\n", "\n", $text ); 3701 $text = str_replace( "\r", "\n", $text ); 3702 3703 // Comment data is not decoded. 3704 if ( 3705 self::STATE_CDATA_NODE === $this->parser_state || 3706 self::STATE_COMMENT === $this->parser_state || 3707 self::STATE_DOCTYPE === $this->parser_state || 3708 self::STATE_FUNKY_COMMENT === $this->parser_state 3709 ) { 3710 return str_replace( "\x00", "\u{FFFD}", $text ); 3711 } 3712 3713 $tag_name = $this->get_token_name(); 3714 if ( 3715 // Script data is not decoded. 3716 'SCRIPT' === $tag_name || 3717 3718 // RAWTEXT data is not decoded. 3719 'IFRAME' === $tag_name || 3720 'NOEMBED' === $tag_name || 3721 'NOFRAMES' === $tag_name || 3722 'STYLE' === $tag_name || 3723 'XMP' === $tag_name 3724 ) { 3725 return str_replace( "\x00", "\u{FFFD}", $text ); 3726 } 3727 3728 $decoded = WP_HTML_Decoder::decode_text_node( $text ); 3729 3730 /* 3731 * Skip the first line feed after LISTING, PRE, and TEXTAREA opening tags. 3732 * 3733 * Note that this first newline may come in the form of a character 3734 * reference, such as `
`, and so it's important to perform 3735 * this transformation only after decoding the raw text content. 3736 */ 3737 if ( 3738 ( "\n" === ( $decoded[0] ?? '' ) ) && 3739 ( ( $this->skip_newline_at === $this->token_starts_at && '#text' === $tag_name ) || 'TEXTAREA' === $tag_name ) 3740 ) { 3741 $decoded = substr( $decoded, 1 ); 3742 } 3743 3744 /* 3745 * Only in normative text nodes does the NULL byte (U+0000) get removed. 3746 * In all other contexts it's replaced by the replacement character (U+FFFD) 3747 * for security reasons (to avoid joining together strings that were safe 3748 * when separated, but not when joined). 3749 * 3750 * @todo Inside HTML integration points and MathML integration points, the 3751 * text is processed according to the insertion mode, not according 3752 * to the foreign content rules. This should strip the NULL bytes. 3753 */ 3754 return ( '#text' === $tag_name && 'html' === $this->get_namespace() ) 3755 ? str_replace( "\x00", '', $decoded ) 3756 : str_replace( "\x00", "\u{FFFD}", $decoded ); 3757 } 3758 3759 /** 3760 * Sets the modifiable text for the matched token, if matched. 3761 * 3762 * Modifiable text is text content that may be read and changed without 3763 * changing the HTML structure of the document around it. This includes 3764 * the contents of `#text` nodes in the HTML as well as the inner 3765 * contents of HTML comments, Processing Instructions, and others, even 3766 * though these nodes aren't part of a parsed DOM tree. They also contain 3767 * the contents of SCRIPT and STYLE tags, of TEXTAREA tags, and of any 3768 * other section in an HTML document which cannot contain HTML markup (DATA). 3769 * 3770 * Not all modifiable text may be set by this method, and not all content 3771 * may be set as modifiable text. In the case that this fails it will return 3772 * `false` indicating as much. For instance, if the contents of a SCRIPT 3773 * element are neither JavaScript nor JSON, it’s not possible to guarantee 3774 * that escaping strings like `</script>` won’t break the script; in these 3775 * cases, updates will be rejected and it’s up to calling code to perform 3776 * language-specific escaping or workarounds. Similarly, it will not allow 3777 * setting content into a comment which would prematurely terminate the comment. 3778 * 3779 * Example: 3780 * 3781 * // Add a preface to all STYLE contents. 3782 * while ( $processor->next_tag( 'STYLE' ) ) { 3783 * $style = $processor->get_modifiable_text(); 3784 * $processor->set_modifiable_text( "// Made with love on the World Wide Web\n{$style}" ); 3785 * } 3786 * 3787 * // Replace smiley text with Emoji smilies. 3788 * while ( $processor->next_token() ) { 3789 * if ( '#text' !== $processor->get_token_name() ) { 3790 * continue; 3791 * } 3792 * 3793 * $chunk = $processor->get_modifiable_text(); 3794 * if ( ! str_contains( $chunk, ':)' ) ) { 3795 * continue; 3796 * } 3797 * 3798 * $processor->set_modifiable_text( str_replace( ':)', '🙂', $chunk ) ); 3799 * } 3800 * 3801 * This function handles all necessary HTML encoding. Provide normal, unescaped string values. 3802 * The HTML API will encode the strings appropriately so that the browser will interpret them 3803 * as the intended value. 3804 * 3805 * Example: 3806 * 3807 * // Renders as “Eggs & Milk” in a browser, encoded as `<p>Eggs & Milk</p>`. 3808 * $processor->set_modifiable_text( 'Eggs & Milk' ); 3809 * 3810 * // Renders as “Eggs & Milk” in a browser, encoded as `<p>Eggs &amp; Milk</p>`. 3811 * $processor->set_modifiable_text( 'Eggs & Milk' ); 3812 * 3813 * @since 6.7.0 3814 * @since 6.9.0 Escapes all character references instead of trying to avoid double-escaping. 3815 * 3816 * @param string $plaintext_content New text content to represent in the matched token. 3817 * @return bool Whether the text was able to update. 3818 */ 3819 public function set_modifiable_text( string $plaintext_content ): bool { 3820 if ( self::STATE_TEXT_NODE === $this->parser_state ) { 3821 $this->lexical_updates['modifiable text'] = new WP_HTML_Text_Replacement( 3822 $this->text_starts_at, 3823 $this->text_length, 3824 strtr( 3825 $plaintext_content, 3826 array( 3827 '<' => '<', 3828 '>' => '>', 3829 '&' => '&', 3830 '"' => '"', 3831 "'" => ''', 3832 ) 3833 ) 3834 ); 3835 3836 return true; 3837 } 3838 3839 // Comment data is not encoded. 3840 if ( 3841 self::STATE_COMMENT === $this->parser_state && 3842 self::COMMENT_AS_HTML_COMMENT === $this->comment_type 3843 ) { 3844 // Check if the text could close the comment. 3845 if ( 1 === preg_match( '/--!?>/', $plaintext_content ) ) { 3846 return false; 3847 } 3848 3849 $this->lexical_updates['modifiable text'] = new WP_HTML_Text_Replacement( 3850 $this->text_starts_at, 3851 $this->text_length, 3852 $plaintext_content 3853 ); 3854 3855 return true; 3856 } 3857 3858 /* 3859 * The rest of this function handles modifiable text for special "atomic" HTML elements. 3860 * Only tags in the HTML namespace should be processed. 3861 */ 3862 if ( 3863 self::STATE_MATCHED_TAG !== $this->parser_state || 3864 'html' !== $this->get_namespace() 3865 ) { 3866 return false; 3867 } 3868 3869 switch ( $this->get_tag() ) { 3870 case 'SCRIPT': 3871 $script_content_type = $this->get_script_content_type(); 3872 3873 switch ( $script_content_type ) { 3874 case 'javascript': 3875 case 'json': 3876 $this->lexical_updates['modifiable text'] = new WP_HTML_Text_Replacement( 3877 $this->text_starts_at, 3878 $this->text_length, 3879 self::escape_javascript_script_contents( $plaintext_content ) 3880 ); 3881 return true; 3882 } 3883 3884 /* 3885 * If the script’s content type isn’t recognized and understandable then it’s 3886 * impossible to guarantee that escaping the content won’t cause runtime breakage. 3887 * For instance, if the script content type were PHP code then escaping with 3888 * `\u0073` would not be met by unescaping; rather, it could result in corrupted 3889 * data or even syntax errors. 3890 * 3891 * Because of this, content which could potentially modify the SCRIPT tag’s 3892 * HTML structure is rejected here. It’s the responsibility of calling code to 3893 * perform whatever semantic escaping is necessary to avoid problematic strings. 3894 */ 3895 if ( 3896 false !== stripos( $plaintext_content, '<script' ) || 3897 false !== stripos( $plaintext_content, '</script' ) 3898 ) { 3899 return false; 3900 } 3901 $this->lexical_updates['modifiable text'] = new WP_HTML_Text_Replacement( 3902 $this->text_starts_at, 3903 $this->text_length, 3904 $plaintext_content 3905 ); 3906 return true; 3907 3908 case 'STYLE': 3909 $plaintext_content = preg_replace_callback( 3910 '~</(?P<TAG_NAME>style)~i', 3911 static function ( $tag_match ) { 3912 return "\\3c\\2f{$tag_match['TAG_NAME']}"; 3913 }, 3914 $plaintext_content 3915 ); 3916 3917 $this->lexical_updates['modifiable text'] = new WP_HTML_Text_Replacement( 3918 $this->text_starts_at, 3919 $this->text_length, 3920 $plaintext_content 3921 ); 3922 3923 return true; 3924 3925 case 'TEXTAREA': 3926 case 'TITLE': 3927 $plaintext_content = preg_replace_callback( 3928 "~</(?P<TAG_NAME>{$this->get_tag()})~i", 3929 static function ( $tag_match ) { 3930 return "</{$tag_match['TAG_NAME']}"; 3931 }, 3932 $plaintext_content 3933 ); 3934 3935 /* 3936 * HTML ignores a single leading newline in this context. If a leading newline 3937 * is intended, preserve it by adding an extra newline. 3938 */ 3939 if ( 3940 'TEXTAREA' === $this->get_tag() && 3941 1 === strspn( $plaintext_content, "\n\r", 0, 1 ) 3942 ) { 3943 $plaintext_content = "\n{$plaintext_content}"; 3944 } 3945 3946 /* 3947 * These don't _need_ to be escaped, but since they are decoded it's 3948 * safe to leave them escaped and this can prevent other code from 3949 * naively detecting tags within the contents. 3950 * 3951 * @todo It would be useful to prefix a multiline replacement text 3952 * with a newline, but not necessary. This is for aesthetics. 3953 */ 3954 $this->lexical_updates['modifiable text'] = new WP_HTML_Text_Replacement( 3955 $this->text_starts_at, 3956 $this->text_length, 3957 $plaintext_content 3958 ); 3959 3960 return true; 3961 } 3962 3963 return false; 3964 } 3965 3966 /** 3967 * Returns the content type of the currently-matched HTML SCRIPT tag, if matched and 3968 * recognized, otherwise returns `null` to indicate an unrecognized content type. 3969 * 3970 * An HTML SCRIPT tag is a normal SCRIPT tag, but there can be SCRIPT elements inside 3971 * SVG and MathML elements as well, and these have different parsing rules than those 3972 * in general HTML. For this reason, no content-type inference is performed on those. 3973 * 3974 * Note! This concept is related but distinct from the MIME type of the script. 3975 * Parsing MUST match the specific algorithm in the HTML specification, which 3976 * relies on exact string comparison in some cases. MIME type decoding may be 3977 * performed on SVG or MathML SCRIPT tags. 3978 * 3979 * Only 'javascript' and 'json' content types are currently recognized. 3980 * 3981 * @see https://html.spec.whatwg.org/multipage/scripting.html#prepare-the-script-element 3982 * 3983 * @since 7.0.0 3984 * @ignore 3985 * 3986 * @return 'javascript'|'json'|null Type of script element content if matched and recognized. 3987 */ 3988 private function get_script_content_type(): ?string { 3989 // SVG and MathML SCRIPT elements are not recognized. 3990 if ( 'SCRIPT' !== $this->get_tag() || $this->get_namespace() !== 'html' ) { 3991 return null; 3992 } 3993 3994 /* 3995 * > If any of the following are true: 3996 * > - el has a type attribute whose value is the empty string; 3997 * > - el has no type attribute but it has a language attribute and that attribute's 3998 * > value is the empty string; or 3999 * > - el has neither a type attribute nor a language attribute, 4000 * > then let the script block's type string for this script element be "text/javascript". 4001 */ 4002 $type = $this->get_attribute( 'type' ); 4003 $lang = $this->get_attribute( 'language' ); 4004 4005 if ( true === $type || '' === $type ) { 4006 return 'javascript'; 4007 } 4008 4009 if ( null === $type && ( null === $lang || true === $lang || '' === $lang ) ) { 4010 return 'javascript'; 4011 } 4012 4013 /* 4014 * > Otherwise, if el has a type attribute, then let the script block's type string be 4015 * > the value of that attribute with leading and trailing ASCII whitespace stripped. 4016 * > Otherwise, el has a non-empty language attribute; let the script block's type string 4017 * > be the concatenation of "text/" and the value of el's language attribute. 4018 */ 4019 $type_string = is_string( $type ) ? trim( $type, " \t\f\r\n" ) : "text/{$lang}"; 4020 4021 // All matches are ASCII case-insensitive; eagerly lower-case for comparison. 4022 $type_string = strtolower( $type_string ); 4023 4024 /* 4025 * > If the script block's type string is a JavaScript MIME type essence match, then 4026 * > set el's type to "classic". 4027 * 4028 * > A string is a JavaScript MIME type essence match if it is an ASCII case-insensitive 4029 * > match for one of the JavaScript MIME type essence strings. 4030 * 4031 * > A JavaScript MIME type is any MIME type whose essence is one of the following: 4032 * > 4033 * > - application/ecmascript 4034 * > - application/javascript 4035 * > - application/x-ecmascript 4036 * > - application/x-javascript 4037 * > - text/ecmascript 4038 * > - text/javascript 4039 * > - text/javascript1.0 4040 * > - text/javascript1.1 4041 * > - text/javascript1.2 4042 * > - text/javascript1.3 4043 * > - text/javascript1.4 4044 * > - text/javascript1.5 4045 * > - text/jscript 4046 * > - text/livescript 4047 * > - text/x-ecmascript 4048 * > - text/x-javascript 4049 * 4050 * @see https://mimesniff.spec.whatwg.org/#javascript-mime-type-essence-match 4051 * @see https://mimesniff.spec.whatwg.org/#javascript-mime-type 4052 */ 4053 switch ( $type_string ) { 4054 case 'application/ecmascript': 4055 case 'application/javascript': 4056 case 'application/x-ecmascript': 4057 case 'application/x-javascript': 4058 case 'text/ecmascript': 4059 case 'text/javascript': 4060 case 'text/javascript1.0': 4061 case 'text/javascript1.1': 4062 case 'text/javascript1.2': 4063 case 'text/javascript1.3': 4064 case 'text/javascript1.4': 4065 case 'text/javascript1.5': 4066 case 'text/jscript': 4067 case 'text/livescript': 4068 case 'text/x-ecmascript': 4069 case 'text/x-javascript': 4070 return 'javascript'; 4071 4072 /* 4073 * > Otherwise, if the script block's type string is an ASCII case-insensitive match for 4074 * > the string "module", then set el's type to "module". 4075 * 4076 * A module is evaluated as JavaScript. 4077 */ 4078 case 'module': 4079 return 'javascript'; 4080 4081 /* 4082 * > Otherwise, if the script block's type string is an ASCII case-insensitive match for the string "importmap", then set el's type to "importmap". 4083 * > Otherwise, if the script block's type string is an ASCII case-insensitive match for the string "speculationrules", then set el's type to "speculationrules". 4084 * 4085 * These conditions indicate JSON content. 4086 */ 4087 case 'importmap': 4088 case 'speculationrules': 4089 return 'json'; 4090 4091 /** @todo Rely on a full MIME parser for determining JSON content. */ 4092 case 'application/json': 4093 case 'text/json': 4094 return 'json'; 4095 } 4096 4097 /* 4098 * > Otherwise, return. (No script is executed, and el's type is left as null.) 4099 */ 4100 return null; 4101 } 4102 4103 /** 4104 * Escape JavaScript and JSON script tag contents. 4105 * 4106 * Ensure that the script contents cannot modify the HTML structure or break out 4107 * of its containing SCRIPT element. JavaScript and JSON may both be escaped with 4108 * the same rules, even though there are additional escaping measures available 4109 * to JavaScript source code which aren’t applicable to serialized JSON data. 4110 * 4111 * A simple method safely escapes all content except for a few extremely rare and 4112 * unlikely exceptions: prevent the appearance of `<script` and `</script` within 4113 * the contents by replacing the first letter of the tag name with a Unicode escape. 4114 * 4115 * Example: 4116 * 4117 * $plaintext = '<script>document.write( "A </script> closes a script." );</script>'; 4118 * $escaped = '<script>document.write( "A </\u0073cript> closes a script." );</script>'; 4119 * 4120 * This works because of how parsing changes after encountering an opening SCRIPT 4121 * tag. The actual parsing comprises a complicated state machine, the result of 4122 * legacy behaviors and diverse browser support. However, without these two strings 4123 * in the script contents, two key things are ensured: `</script>` cannot appear to 4124 * prematurely close the tag, and the problematic double-escaped state becomes 4125 * unreachable. A JavaScript engine or JSON decoder will then decode the Unicode 4126 * escape (`\u0073`) back into its original plaintext value, but only after having 4127 * been safely extracted from the HTML. 4128 * 4129 * While it may seem tempting to replace the `<` character instead, doing so would 4130 * break JavaScript syntax. The `<` character is used in comparison operators and 4131 * other JavaScript syntax; replacing it would break valid JavaScript. Replacing 4132 * only the `s` in `<script` and `</script` avoids modifying JavaScript syntax. 4133 * 4134 * ### Exceptions 4135 * 4136 * This _should_ work everywhere, but there are some extreme exceptions. 4137 * 4138 * - Comments. 4139 * - Tagged templates, such as `String.raw()`, which provide access to “raw” strings. 4140 * - The `source` property of a RegExp object. 4141 * 4142 * Each of these exceptions appear at the source code level, not at the semantic or 4143 * evaluation level. Normal JavaScript will remain semantically equivalent after escaping, 4144 * but any JavaScript which analyzes the raw source code will see potentially-different 4145 * values. 4146 * 4147 * #### Comments 4148 * 4149 * Comments are never unescaped because they aren’t parsed by the JavaScript engine. 4150 * When viewing the source in a browser’s developer tools, the comments will retain 4151 * their escaped text. 4152 * 4153 * Example: 4154 * 4155 * // A comment: "</script>" 4156 * …becomes… 4157 * // A comment: "</\u0073cript>" 4158 * 4159 * #### Tagged templates. 4160 * 4161 * Tagged templates “enable the embedding of arbitrary string content, where escape 4162 * sequences may follow a different syntax.” For example, they can aid representing 4163 * a RegExp pattern or LaTex snippet within a JavaScript string, where the string 4164 * escape characters might get noisy and distracting. 4165 * 4166 * Example: 4167 * 4168 * console.log( 'A \notin B' ); // Prints a newline because of the "\n". 4169 * console.log( 'A \\notin B' ); // Prints "A \notin B". 4170 * console.log( String.raw`A \notin B` ); // Prints "A \notin B". 4171 * 4172 * This means that if `<script` transforms into `<\u0073cript` _inside_ a raw string 4173 * or tagged template literal which relies on its `.raw` property, the output of the 4174 * code will be different after escaping. 4175 * 4176 * Example: 4177 * 4178 * console.log( String.raw`</script>` ); // Prematurely closes the SCRIPT element. 4179 * console.log( String.raw`</\u0073cript>` ); // Prints "</\u0073cript". 4180 * 4181 * #### RegExp sources. 4182 * 4183 * The RegExp object exposes its raw source in a similar way to how tagged templates and raw 4184 * strings do. Thankfully, because escape sequences are decoded when compiling the pattern, 4185 * escaped RegExp patterns will match the same way as the plaintext sequences would. 4186 * 4187 * Example: 4188 * 4189 * true === /<script>/.test( '<script>' ); 4190 * true === /<\u0073cript>/.test( '<script>' ); 4191 * 4192 * However, as with raw strings, any code which reads the source will see the escaped value 4193 * instead of the decoded one. 4194 * 4195 * Example: 4196 * 4197 * console.log( /<script>/.source ); // Prints "<script>". 4198 * console.log( /<\u0073cript>/.source ); // Prints "<\u0073cript>". 4199 * 4200 * #### Unsupported escaping. 4201 * 4202 * It is not possible to properly represent every possible JavaScript source file 4203 * inside a SCRIPT element. As with CSS stylesheets, SVG images, and MathML, the 4204 * only 100% reliable way to represent all possible inputs is to link to external 4205 * files of the given content-type. 4206 * 4207 * In some cases it’s possible to manually prevent escaping issues. These are not 4208 * automatically handled by this function because doing so would require a full 4209 * JavaScript tokenizer. Consider the following example listing various ways to 4210 * manually escape a closing script tag. 4211 * 4212 * Example: 4213 * 4214 * console.log( String.raw`</script>` ); // !!UNSAFE!! Will be escaped. 4215 * console.log( String.raw`</\u0073cript>` ); // "</\u0073cript>" 4216 * console.log( String.raw`</scr` + String.raw`ipt>` ); // "</script>" 4217 * console.log( String.raw`</${"script"}>` ); // "</script>" 4218 * console.log( '</scr' + 'ipt>' ); // "</script>" 4219 * console.log( "\x3C/script>" ); // "</script>" 4220 * console.log( "<\/script>" ); // "</script>" 4221 * 4222 * The following graph is a simplified interpretation of how HTML interprets the contents 4223 * of a SCRIPT tag and identifies the closing tag. It is useful to understand what text 4224 * is dangerous inside of a SCRIPT tag and why different approaches to escaping work. 4225 * 4226 * Open script 4227 * │ 4228 * ▼ 4229 * ╔═════════════════════════════════════════╗ <!--(…)> 4230 * ║ ║ (all dashes) 4231 * ║ script ╟────────────────╮ 4232 * ║ data ║ │ 4233 * ╭───────────╢ ║ ◀──────────────╯ 4234 * │ ╚═╤═══════════════════════════════════════╝ 4235 * │ │ ▲ ▲ 4236 * │ │ <!-- │ --> ╰─────╮ 4237 * │ ▼ │ │ 4238 * │ ┌─────────────────┴───────────────────────┐ │ 4239 * │ </script¹ │ escaped │ │ 4240 * │ └─┬─────────────────────────────┬─────────┘ │ 4241 * │ │ ▲ │ │ --> 4242 * │ │ </script¹ │ </script¹ │ <script¹ │ 4243 * │ ▼ │ ▼ │ 4244 * │ ╔══════════════╗ │ ┌───────────┐ │ 4245 * │ ║ Close script ║ │ │ double │ │ 4246 * ╰──────────▶║ ║ ╰───────────┤ escaped ├──╯ 4247 * ╚══════════════╝ └───────────┘ 4248 * 4249 * ¹ = Case insensitive 'script' followed by one of ' \t\f\r\n/>', known 4250 * as “tag-name-terminating characters.” This sequence forms the start 4251 * of what could be a SCRIPT opening or closing tag. 4252 * 4253 * @see https://html.spec.whatwg.org/#restrictions-for-contents-of-script-elements 4254 * @see https://developer.mozilla.org/en-US/docs/Web/JavaScript/Reference/Template_literals#specifications 4255 * @see wp_html_api_script_element_escaping_diagram_source() 4256 * 4257 * @since 7.0.0 4258 * @ignore 4259 * 4260 * @param string $sourcecode Raw contents intended to be serialized into an HTML SCRIPT element. 4261 * @return string Escaped form of input contents which will not lead to premature closing of the containing SCRIPT element. 4262 */ 4263 private static function escape_javascript_script_contents( string $sourcecode ): string { 4264 $at = 0; 4265 $was_at = 0; 4266 $end = strlen( $sourcecode ); 4267 $escaped = ''; 4268 4269 /* 4270 * Replace all instances of the ASCII case-insensitive match of "<script" 4271 * and "</script", when followed by whitespace or "/" or ">", by using a 4272 * character replacement for the "s" (or the "S"). 4273 */ 4274 while ( $at < $end ) { 4275 $tag_at = strpos( $sourcecode, '<', $at ); 4276 if ( false === $tag_at ) { 4277 break; 4278 } 4279 4280 $tag_name_at = $tag_at + 1; 4281 $has_closing_slash = $tag_name_at < $end && '/' === $sourcecode[ $tag_name_at ]; 4282 $tag_name_at += $has_closing_slash ? 1 : 0; 4283 4284 if ( 0 !== substr_compare( $sourcecode, 'script', $tag_name_at, 6, true ) ) { 4285 $at = $tag_at + 1; 4286 continue; 4287 } 4288 4289 if ( 1 !== strspn( $sourcecode, " \t\f\r\n/>", $tag_name_at + 6, 1 ) ) { 4290 $at = $tag_name_at + 6; 4291 continue; 4292 } 4293 4294 $escaped .= substr( $sourcecode, $was_at, $tag_name_at - $was_at ); 4295 $escaped .= 's' === $sourcecode[ $tag_name_at ] ? '\u0073' : '\u0053'; 4296 $was_at = $tag_name_at + 1; 4297 $at = $tag_name_at + 7; 4298 } 4299 4300 if ( '' === $escaped ) { 4301 return $sourcecode; 4302 } 4303 4304 if ( $was_at < $end ) { 4305 $escaped .= substr( $sourcecode, $was_at ); 4306 } 4307 4308 return $escaped; 4309 } 4310 4311 /** 4312 * Updates or creates a new attribute on the currently matched tag with the passed value. 4313 * 4314 * This function handles all necessary HTML encoding. Provide normal, unescaped string values. 4315 * The HTML API will encode the strings appropriately so that the browser will interpret them 4316 * as the intended value. 4317 * 4318 * Example: 4319 * 4320 * // Renders “Eggs & Milk” in a browser, encoded as `<abbr title="Eggs & Milk">`. 4321 * $processor->set_attribute( 'title', 'Eggs & Milk' ); 4322 * 4323 * // Renders “Eggs & Milk” in a browser, encoded as `<abbr title="Eggs &amp; Milk">`. 4324 * $processor->set_attribute( 'title', 'Eggs & Milk' ); 4325 * 4326 * // Renders `true` as `<abbr title>`. 4327 * $processor->set_attribute( 'title', true ); 4328 * 4329 * // Renders without the attribute for `false` as `<abbr>`. 4330 * $processor->set_attribute( 'title', false ); 4331 * 4332 * Special handling is provided for boolean attribute values: 4333 * - When `true` is passed as the value, then only the attribute name is added to the tag. 4334 * - When `false` is passed, the attribute gets removed if it existed before. 4335 * 4336 * @since 6.2.0 4337 * @since 6.2.1 Fix: Only create a single update for multiple calls with case-variant attribute names. 4338 * @since 6.9.0 Escapes all character references instead of trying to avoid double-escaping. 4339 * 4340 * @param string $name The attribute name to target. 4341 * @param string|bool $value The new attribute value. 4342 * @return bool Whether an attribute value was set. 4343 */ 4344 public function set_attribute( $name, $value ): bool { 4345 if ( 4346 self::STATE_MATCHED_TAG !== $this->parser_state || 4347 $this->is_closing_tag 4348 ) { 4349 return false; 4350 } 4351 4352 $name_length = strlen( $name ); 4353 4354 /** 4355 * WordPress rejects more characters than are strictly forbidden 4356 * in HTML5. This is to prevent additional security risks deeper 4357 * in the WordPress and plugin stack. Specifically the following 4358 * are not allowed to be set as part of an HTML attribute name: 4359 * 4360 * - greater-than “>” 4361 * - ampersand “&” 4362 * 4363 * @see https://html.spec.whatwg.org/#attributes-2 4364 */ 4365 if ( 4366 0 === $name_length || 4367 // Syntax-like characters. 4368 strcspn( $name, '"\'>&</ =' ) !== $name_length || 4369 // Control characters. 4370 strcspn( 4371 $name, 4372 "\x00\x01\x02\x03\x04\x05\x06\x07\x08\x09\x0A\x0B\x0C\x0D\x0E\x0F" . 4373 "\x10\x11\x12\x13\x14\x15\x16\x17\x18\x19\x1A\x1B\x1C\x1D\x1E\x1F" 4374 ) !== $name_length || 4375 // Unicode noncharacters. 4376 wp_has_noncharacters( $name ) 4377 ) { 4378 _doing_it_wrong( 4379 __METHOD__, 4380 __( 'Invalid attribute name.' ), 4381 '6.2.0' 4382 ); 4383 4384 return false; 4385 } 4386 4387 /* 4388 * > The values "true" and "false" are not allowed on boolean attributes. 4389 * > To represent a false value, the attribute has to be omitted altogether. 4390 * - HTML5 spec, https://html.spec.whatwg.org/#boolean-attributes 4391 */ 4392 if ( false === $value ) { 4393 return $this->remove_attribute( $name ); 4394 } 4395 4396 if ( true === $value ) { 4397 $updated_attribute = $name; 4398 } else { 4399 $comparable_name = strtolower( $name ); 4400 4401 /** 4402 * Escape attribute values appropriately. 4403 * 4404 * @see https://html.spec.whatwg.org/#attributes-3 4405 */ 4406 $escaped_new_value = in_array( $comparable_name, wp_kses_uri_attributes(), true ) 4407 ? esc_url( $value ) 4408 : strtr( 4409 $value, 4410 array( 4411 '<' => '<', 4412 '>' => '>', 4413 '&' => '&', 4414 '"' => '"', 4415 "'" => ''', 4416 ) 4417 ); 4418 4419 // If the escaping functions wiped out the update, reject it and indicate it was rejected. 4420 if ( '' === $escaped_new_value && '' !== $value ) { 4421 return false; 4422 } 4423 4424 $updated_attribute = "{$name}=\"{$escaped_new_value}\""; 4425 } 4426 4427 /* 4428 * > There must never be two or more attributes on 4429 * > the same start tag whose names are an ASCII 4430 * > case-insensitive match for each other. 4431 * - HTML 5 spec 4432 * 4433 * @see https://html.spec.whatwg.org/multipage/syntax.html#attributes-2:ascii-case-insensitive 4434 */ 4435 $comparable_name = strtolower( $name ); 4436 4437 if ( isset( $this->attributes[ $comparable_name ] ) ) { 4438 /* 4439 * Update an existing attribute. 4440 * 4441 * Example – set attribute id to "new" in <div id="initial_id" />: 4442 * 4443 * <div id="initial_id"/> 4444 * ^-------------^ 4445 * start end 4446 * replacement: `id="new"` 4447 * 4448 * Result: <div id="new"/> 4449 */ 4450 $existing_attribute = $this->attributes[ $comparable_name ]; 4451 $this->lexical_updates[ $comparable_name ] = new WP_HTML_Text_Replacement( 4452 $existing_attribute->start, 4453 $existing_attribute->length, 4454 $updated_attribute 4455 ); 4456 } else { 4457 /* 4458 * Create a new attribute at the tag's name end. 4459 * 4460 * Example – add attribute id="new" to <div />: 4461 * 4462 * <div/> 4463 * ^ 4464 * start and end 4465 * replacement: ` id="new"` 4466 * 4467 * Result: <div id="new"/> 4468 */ 4469 $this->lexical_updates[ $comparable_name ] = new WP_HTML_Text_Replacement( 4470 $this->tag_name_starts_at + $this->tag_name_length, 4471 0, 4472 ' ' . $updated_attribute 4473 ); 4474 } 4475 4476 /* 4477 * Any calls to update the `class` attribute directly should wipe out any 4478 * enqueued class changes from `add_class` and `remove_class`. 4479 */ 4480 if ( 'class' === $comparable_name && ! empty( $this->classname_updates ) ) { 4481 $this->classname_updates = array(); 4482 } 4483 4484 return true; 4485 } 4486 4487 /** 4488 * Remove an attribute from the currently-matched tag. 4489 * 4490 * @since 6.2.0 4491 * 4492 * @param string $name The attribute name to remove. 4493 * @return bool Whether an attribute was removed. 4494 */ 4495 public function remove_attribute( $name ): bool { 4496 if ( 4497 self::STATE_MATCHED_TAG !== $this->parser_state || 4498 $this->is_closing_tag 4499 ) { 4500 return false; 4501 } 4502 4503 /* 4504 * > There must never be two or more attributes on 4505 * > the same start tag whose names are an ASCII 4506 * > case-insensitive match for each other. 4507 * - HTML 5 spec 4508 * 4509 * @see https://html.spec.whatwg.org/multipage/syntax.html#attributes-2:ascii-case-insensitive 4510 */ 4511 $name = strtolower( $name ); 4512 4513 /* 4514 * Any calls to update the `class` attribute directly should wipe out any 4515 * enqueued class changes from `add_class` and `remove_class`. 4516 */ 4517 if ( 'class' === $name && count( $this->classname_updates ) !== 0 ) { 4518 $this->classname_updates = array(); 4519 } 4520 4521 /* 4522 * If updating an attribute that didn't exist in the input 4523 * document, then remove the enqueued update and move on. 4524 * 4525 * For example, this might occur when calling `remove_attribute()` 4526 * after calling `set_attribute()` for the same attribute 4527 * and when that attribute wasn't originally present. 4528 */ 4529 if ( ! isset( $this->attributes[ $name ] ) ) { 4530 if ( isset( $this->lexical_updates[ $name ] ) ) { 4531 unset( $this->lexical_updates[ $name ] ); 4532 } 4533 return false; 4534 } 4535 4536 /* 4537 * Removes an existing tag attribute. 4538 * 4539 * Example – remove the attribute id from <div id="main"/>: 4540 * <div id="initial_id"/> 4541 * ^-------------^ 4542 * start end 4543 * replacement: `` 4544 * 4545 * Result: <div /> 4546 */ 4547 $this->lexical_updates[ $name ] = new WP_HTML_Text_Replacement( 4548 $this->attributes[ $name ]->start, 4549 $this->attributes[ $name ]->length, 4550 '' 4551 ); 4552 4553 // Removes any duplicated attributes if they were also present. 4554 foreach ( $this->duplicate_attributes[ $name ] ?? array() as $attribute_token ) { 4555 $this->lexical_updates[] = new WP_HTML_Text_Replacement( 4556 $attribute_token->start, 4557 $attribute_token->length, 4558 '' 4559 ); 4560 } 4561 4562 return true; 4563 } 4564 4565 /** 4566 * Adds a new class name to the currently matched tag. 4567 * 4568 * @since 6.2.0 4569 * 4570 * @param string $class_name The class name to add. 4571 * @return bool Whether the class was set to be added. 4572 */ 4573 public function add_class( $class_name ): bool { 4574 if ( 4575 self::STATE_MATCHED_TAG !== $this->parser_state || 4576 $this->is_closing_tag 4577 ) { 4578 return false; 4579 } 4580 4581 if ( self::QUIRKS_MODE !== $this->compat_mode ) { 4582 $this->classname_updates[ $class_name ] = self::ADD_CLASS; 4583 return true; 4584 } 4585 4586 /* 4587 * Because class names are matched ASCII-case-insensitively in quirks mode, 4588 * this needs to see if a case variant of the given class name is already 4589 * enqueued and update that existing entry, if so. This picks the casing of 4590 * the first-provided class name for all lexical variations. 4591 */ 4592 $class_name_length = strlen( $class_name ); 4593 foreach ( $this->classname_updates as $updated_name => $action ) { 4594 if ( 4595 strlen( $updated_name ) === $class_name_length && 4596 0 === substr_compare( $updated_name, $class_name, 0, $class_name_length, true ) 4597 ) { 4598 $this->classname_updates[ $updated_name ] = self::ADD_CLASS; 4599 return true; 4600 } 4601 } 4602 4603 $this->classname_updates[ $class_name ] = self::ADD_CLASS; 4604 return true; 4605 } 4606 4607 /** 4608 * Removes a class name from the currently matched tag. 4609 * 4610 * @since 6.2.0 4611 * 4612 * @param string $class_name The class name to remove. 4613 * @return bool Whether the class was set to be removed. 4614 */ 4615 public function remove_class( $class_name ): bool { 4616 if ( 4617 self::STATE_MATCHED_TAG !== $this->parser_state || 4618 $this->is_closing_tag 4619 ) { 4620 return false; 4621 } 4622 4623 if ( self::QUIRKS_MODE !== $this->compat_mode ) { 4624 $this->classname_updates[ $class_name ] = self::REMOVE_CLASS; 4625 return true; 4626 } 4627 4628 /* 4629 * Because class names are matched ASCII-case-insensitively in quirks mode, 4630 * this needs to see if a case variant of the given class name is already 4631 * enqueued and update that existing entry, if so. This picks the casing of 4632 * the first-provided class name for all lexical variations. 4633 */ 4634 $class_name_length = strlen( $class_name ); 4635 foreach ( $this->classname_updates as $updated_name => $action ) { 4636 if ( 4637 strlen( $updated_name ) === $class_name_length && 4638 0 === substr_compare( $updated_name, $class_name, 0, $class_name_length, true ) 4639 ) { 4640 $this->classname_updates[ $updated_name ] = self::REMOVE_CLASS; 4641 return true; 4642 } 4643 } 4644 4645 $this->classname_updates[ $class_name ] = self::REMOVE_CLASS; 4646 return true; 4647 } 4648 4649 /** 4650 * Returns the string representation of the HTML Tag Processor. 4651 * 4652 * @since 6.2.0 4653 * 4654 * @see WP_HTML_Tag_Processor::get_updated_html() 4655 * 4656 * @return string The processed HTML. 4657 */ 4658 public function __toString(): string { 4659 return $this->get_updated_html(); 4660 } 4661 4662 /** 4663 * Returns the string representation of the HTML Tag Processor. 4664 * 4665 * @since 6.2.0 4666 * @since 6.2.1 Shifts the internal cursor corresponding to the applied updates. 4667 * @since 6.4.0 No longer calls subclass method `next_tag()` after updating HTML. 4668 * 4669 * @return string The processed HTML. 4670 */ 4671 public function get_updated_html(): string { 4672 $requires_no_updating = 0 === count( $this->classname_updates ) && 0 === count( $this->lexical_updates ); 4673 4674 /* 4675 * When there is nothing more to update and nothing has already been 4676 * updated, return the original document and avoid a string copy. 4677 */ 4678 if ( $requires_no_updating ) { 4679 return $this->html; 4680 } 4681 4682 /* 4683 * Keep track of the position right before the current tag. This will 4684 * be necessary for reparsing the current tag after updating the HTML. 4685 */ 4686 $before_current_tag = $this->token_starts_at ?? 0; 4687 4688 /* 4689 * 1. Apply the enqueued edits and update all the pointers to reflect those changes. 4690 */ 4691 $this->class_name_updates_to_attributes_updates(); 4692 $before_current_tag += $this->apply_attributes_updates( $before_current_tag ); 4693 4694 /* 4695 * 2. Rewind to before the current tag and reparse to get updated attributes. 4696 * 4697 * At this point the internal cursor points to the end of the tag name. 4698 * Rewind before the tag name starts so that it's as if the cursor didn't 4699 * move; a call to `next_tag()` will reparse the recently-updated attributes 4700 * and additional calls to modify the attributes will apply at this same 4701 * location, but in order to avoid issues with subclasses that might add 4702 * behaviors to `next_tag()`, the internal methods should be called here 4703 * instead. 4704 * 4705 * It's important to note that in this specific place there will be no change 4706 * because the processor was already at a tag when this was called and it's 4707 * rewinding only to the beginning of this very tag before reprocessing it 4708 * and its attributes. 4709 * 4710 * <p>Previous HTML<em>More HTML</em></p> 4711 * ↑ │ back up by the length of the tag name plus the opening < 4712 * └←─┘ back up by strlen("em") + 1 ==> 3 4713 */ 4714 $this->bytes_already_parsed = $before_current_tag; 4715 $this->base_class_next_token(); 4716 4717 return $this->html; 4718 } 4719 4720 /** 4721 * Parses tag query input into internal search criteria. 4722 * 4723 * @since 6.2.0 4724 * @ignore 4725 * 4726 * @param array|string|null $query { 4727 * Optional. Which tag name to find, having which class, etc. Default is to find any tag. 4728 * 4729 * @type string|null $tag_name Which tag to find, or `null` for "any tag." 4730 * @type int|null $match_offset Find the Nth tag matching all search criteria. 4731 * 1 for "first" tag, 3 for "third," etc. 4732 * Defaults to first tag. 4733 * @type string|null $class_name Tag must contain this class name to match. 4734 * @type string $tag_closers "visit" or "skip": whether to stop on tag closers, e.g. </div>. 4735 * } 4736 */ 4737 private function parse_query( $query ) { 4738 if ( null !== $query && $query === $this->last_query ) { 4739 return; 4740 } 4741 4742 $this->last_query = $query; 4743 $this->sought_tag_name = null; 4744 $this->sought_class_name = null; 4745 $this->sought_match_offset = 1; 4746 $this->stop_on_tag_closers = false; 4747 4748 // A single string value means "find the tag of this name". 4749 if ( is_string( $query ) ) { 4750 $this->sought_tag_name = $query; 4751 return; 4752 } 4753 4754 // An empty query parameter applies no restrictions on the search. 4755 if ( null === $query ) { 4756 return; 4757 } 4758 4759 // If not using the string interface, an associative array is required. 4760 if ( ! is_array( $query ) ) { 4761 _doing_it_wrong( 4762 __METHOD__, 4763 __( 'The query argument must be an array or a tag name.' ), 4764 '6.2.0' 4765 ); 4766 return; 4767 } 4768 4769 if ( isset( $query['tag_name'] ) && is_string( $query['tag_name'] ) ) { 4770 $this->sought_tag_name = $query['tag_name']; 4771 } 4772 4773 if ( isset( $query['class_name'] ) && is_string( $query['class_name'] ) ) { 4774 $this->sought_class_name = $query['class_name']; 4775 } 4776 4777 if ( isset( $query['match_offset'] ) && is_int( $query['match_offset'] ) && 0 < $query['match_offset'] ) { 4778 $this->sought_match_offset = $query['match_offset']; 4779 } 4780 4781 if ( isset( $query['tag_closers'] ) ) { 4782 $this->stop_on_tag_closers = 'visit' === $query['tag_closers']; 4783 } 4784 } 4785 4786 4787 /** 4788 * Checks whether a given tag and its attributes match the search criteria. 4789 * 4790 * @since 6.2.0 4791 * @ignore 4792 * 4793 * @return bool Whether the given tag and its attribute match the search criteria. 4794 */ 4795 private function matches(): bool { 4796 if ( $this->is_closing_tag && ! $this->stop_on_tag_closers ) { 4797 return false; 4798 } 4799 4800 // Does the tag name match the requested tag name in a case-insensitive manner? 4801 if ( 4802 isset( $this->sought_tag_name ) && 4803 ( 4804 strlen( $this->sought_tag_name ) !== $this->tag_name_length || 4805 0 !== substr_compare( $this->html, $this->sought_tag_name, $this->tag_name_starts_at, $this->tag_name_length, true ) 4806 ) 4807 ) { 4808 return false; 4809 } 4810 4811 if ( null !== $this->sought_class_name && ! $this->has_class( $this->sought_class_name ) ) { 4812 return false; 4813 } 4814 4815 return true; 4816 } 4817 4818 /** 4819 * Gets DOCTYPE declaration info from a DOCTYPE token. 4820 * 4821 * DOCTYPE tokens may appear in many places in an HTML document. In most places, they are 4822 * simply ignored. The main parsing functions find the basic shape of DOCTYPE tokens but 4823 * do not perform detailed parsing. 4824 * 4825 * This method can be called to perform a full parse of the DOCTYPE token and retrieve 4826 * its information. 4827 * 4828 * @return WP_HTML_Doctype_Info|null The DOCTYPE declaration information or `null` if not 4829 * currently at a DOCTYPE node. 4830 */ 4831 public function get_doctype_info(): ?WP_HTML_Doctype_Info { 4832 if ( self::STATE_DOCTYPE !== $this->parser_state ) { 4833 return null; 4834 } 4835 4836 return WP_HTML_Doctype_Info::from_doctype_token( substr( $this->html, $this->token_starts_at, $this->token_length ) ); 4837 } 4838 4839 /** 4840 * Parser Ready State. 4841 * 4842 * Indicates that the parser is ready to run and waiting for a state transition. 4843 * It may not have started yet, or it may have just finished parsing a token and 4844 * is ready to find the next one. 4845 * 4846 * @since 6.5.0 4847 * 4848 * @access private 4849 */ 4850 const STATE_READY = 'STATE_READY'; 4851 4852 /** 4853 * Parser Complete State. 4854 * 4855 * Indicates that the parser has reached the end of the document and there is 4856 * nothing left to scan. It finished parsing the last token completely. 4857 * 4858 * @since 6.5.0 4859 * 4860 * @access private 4861 */ 4862 const STATE_COMPLETE = 'STATE_COMPLETE'; 4863 4864 /** 4865 * Parser Incomplete Input State. 4866 * 4867 * Indicates that the parser has reached the end of the document before finishing 4868 * a token. It started parsing a token but there is a possibility that the input 4869 * HTML document was truncated in the middle of a token. 4870 * 4871 * The parser is reset at the start of the incomplete token and has paused. There 4872 * is nothing more than can be scanned unless provided a more complete document. 4873 * 4874 * @since 6.5.0 4875 * 4876 * @access private 4877 */ 4878 const STATE_INCOMPLETE_INPUT = 'STATE_INCOMPLETE_INPUT'; 4879 4880 /** 4881 * Parser Matched Tag State. 4882 * 4883 * Indicates that the parser has found an HTML tag and it's possible to get 4884 * the tag name and read or modify its attributes (if it's not a closing tag). 4885 * 4886 * @since 6.5.0 4887 * 4888 * @access private 4889 */ 4890 const STATE_MATCHED_TAG = 'STATE_MATCHED_TAG'; 4891 4892 /** 4893 * Parser Text Node State. 4894 * 4895 * Indicates that the parser has found a text node and it's possible 4896 * to read and modify that text. 4897 * 4898 * @since 6.5.0 4899 * 4900 * @access private 4901 */ 4902 const STATE_TEXT_NODE = 'STATE_TEXT_NODE'; 4903 4904 /** 4905 * Parser CDATA Node State. 4906 * 4907 * Indicates that the parser has found a CDATA node and it's possible 4908 * to read and modify its modifiable text. Note that in HTML there are 4909 * no CDATA nodes outside of foreign content (SVG and MathML). Outside 4910 * of foreign content, they are treated as HTML comments. 4911 * 4912 * @since 6.5.0 4913 * 4914 * @access private 4915 */ 4916 const STATE_CDATA_NODE = 'STATE_CDATA_NODE'; 4917 4918 /** 4919 * Indicates that the parser has found an HTML comment and it's 4920 * possible to read and modify its modifiable text. 4921 * 4922 * @since 6.5.0 4923 * 4924 * @access private 4925 */ 4926 const STATE_COMMENT = 'STATE_COMMENT'; 4927 4928 /** 4929 * Indicates that the parser has found a DOCTYPE node and it's 4930 * possible to read its DOCTYPE information via `get_doctype_info()`. 4931 * 4932 * @since 6.5.0 4933 * 4934 * @access private 4935 */ 4936 const STATE_DOCTYPE = 'STATE_DOCTYPE'; 4937 4938 /** 4939 * Indicates that the parser has found an empty tag closer `</>`. 4940 * 4941 * Note that in HTML there are no empty tag closers, and they 4942 * are ignored. Nonetheless, the Tag Processor still 4943 * recognizes them as they appear in the HTML stream. 4944 * 4945 * These were historically discussed as a "presumptuous tag 4946 * closer," which would close the nearest open tag, but were 4947 * dismissed in favor of explicitly-closing tags. 4948 * 4949 * @since 6.5.0 4950 * 4951 * @access private 4952 */ 4953 const STATE_PRESUMPTUOUS_TAG = 'STATE_PRESUMPTUOUS_TAG'; 4954 4955 /** 4956 * Indicates that the parser has found a "funky comment" 4957 * and it's possible to read and modify its modifiable text. 4958 * 4959 * Example: 4960 * 4961 * </%url> 4962 * </{"wp-bit":"query/post-author"}> 4963 * </2> 4964 * 4965 * Funky comments are tag closers with invalid tag names. Note 4966 * that in HTML these are turned into bogus comments. Nonetheless, 4967 * the Tag Processor recognizes them in a stream of HTML and 4968 * exposes them for inspection and modification. 4969 * 4970 * @since 6.5.0 4971 * 4972 * @access private 4973 */ 4974 const STATE_FUNKY_COMMENT = 'STATE_WP_FUNKY'; 4975 4976 /** 4977 * Indicates that a comment was created when encountering abruptly-closed HTML comment. 4978 * 4979 * Example: 4980 * 4981 * <!--> 4982 * <!---> 4983 * 4984 * @since 6.5.0 4985 */ 4986 const COMMENT_AS_ABRUPTLY_CLOSED_COMMENT = 'COMMENT_AS_ABRUPTLY_CLOSED_COMMENT'; 4987 4988 /** 4989 * Indicates that a comment would be parsed as a CDATA node, 4990 * were HTML to allow CDATA nodes outside of foreign content. 4991 * 4992 * Example: 4993 * 4994 * <![CDATA[This is a CDATA node.]]> 4995 * 4996 * This is an HTML comment, but it looks like a CDATA node. 4997 * 4998 * @since 6.5.0 4999 */ 5000 const COMMENT_AS_CDATA_LOOKALIKE = 'COMMENT_AS_CDATA_LOOKALIKE'; 5001 5002 /** 5003 * Indicates that a comment was created when encountering 5004 * normative HTML comment syntax. 5005 * 5006 * Example: 5007 * 5008 * <!-- this is a comment --> 5009 * 5010 * @since 6.5.0 5011 */ 5012 const COMMENT_AS_HTML_COMMENT = 'COMMENT_AS_HTML_COMMENT'; 5013 5014 /** 5015 * Indicates that a comment would be parsed as a Processing 5016 * Instruction node, were they to exist within HTML. 5017 * 5018 * Example: 5019 * 5020 * <?wp __( 'Like' ) ?> 5021 * 5022 * This is an HTML comment, but it looks like a CDATA node. 5023 * 5024 * @since 6.5.0 5025 */ 5026 const COMMENT_AS_PI_NODE_LOOKALIKE = 'COMMENT_AS_PI_NODE_LOOKALIKE'; 5027 5028 /** 5029 * Indicates that a comment was created when encountering invalid 5030 * HTML input, a so-called "bogus comment." 5031 * 5032 * Example: 5033 * 5034 * <?nothing special> 5035 * <!{nothing special}> 5036 * 5037 * @since 6.5.0 5038 */ 5039 const COMMENT_AS_INVALID_HTML = 'COMMENT_AS_INVALID_HTML'; 5040 5041 /** 5042 * No-quirks mode document compatibility mode. 5043 * 5044 * > In no-quirks mode, the behavior is (hopefully) the desired behavior 5045 * > described by the modern HTML and CSS specifications. 5046 * 5047 * @see self::$compat_mode 5048 * @see https://developer.mozilla.org/en-US/docs/Web/HTML/Quirks_Mode_and_Standards_Mode 5049 * 5050 * @since 6.7.0 5051 * 5052 * @var string 5053 */ 5054 const NO_QUIRKS_MODE = 'no-quirks-mode'; 5055 5056 /** 5057 * Quirks mode document compatibility mode. 5058 * 5059 * > In quirks mode, layout emulates behavior in Navigator 4 and Internet 5060 * > Explorer 5. This is essential in order to support websites that were 5061 * > built before the widespread adoption of web standards. 5062 * 5063 * @see self::$compat_mode 5064 * @see https://developer.mozilla.org/en-US/docs/Web/HTML/Quirks_Mode_and_Standards_Mode 5065 * 5066 * @since 6.7.0 5067 * 5068 * @var string 5069 */ 5070 const QUIRKS_MODE = 'quirks-mode'; 5071 5072 /** 5073 * Indicates that a span of text may contain any combination of significant 5074 * kinds of characters: NULL bytes, whitespace, and others. 5075 * 5076 * @see self::$text_node_classification 5077 * @see self::subdivide_text_appropriately 5078 * 5079 * @since 6.7.0 5080 */ 5081 const TEXT_IS_GENERIC = 'TEXT_IS_GENERIC'; 5082 5083 /** 5084 * Indicates that a span of text comprises a sequence only of NULL bytes. 5085 * 5086 * @see self::$text_node_classification 5087 * @see self::subdivide_text_appropriately 5088 * 5089 * @since 6.7.0 5090 */ 5091 const TEXT_IS_NULL_SEQUENCE = 'TEXT_IS_NULL_SEQUENCE'; 5092 5093 /** 5094 * Indicates that a span of decoded text comprises only whitespace. 5095 * 5096 * @see self::$text_node_classification 5097 * @see self::subdivide_text_appropriately 5098 * 5099 * @since 6.7.0 5100 */ 5101 const TEXT_IS_WHITESPACE = 'TEXT_IS_WHITESPACE'; 5102 5103 /** 5104 * Wakeup magic method. 5105 * 5106 * @since 6.9.2 5107 */ 5108 public function __wakeup() { 5109 throw new \LogicException( __CLASS__ . ' should never be unserialized' ); 5110 } 5111 }
title
Description
Body
title
Description
Body
title
Description
Body
title
Body
| Generated : Sat Jul 4 08:20:12 2026 | Cross-referenced by PHPXref |