[ Index ] |
PHP Cross Reference of WordPress Trunk (Updated Daily) |
[Summary view] [Print] [Text view]
1 <?php 2 /** 3 * HTML API: WP_HTML_Processor class 4 * 5 * @package WordPress 6 * @subpackage HTML-API 7 * @since 6.4.0 8 */ 9 10 /** 11 * Core class used to safely parse and modify an HTML document. 12 * 13 * The HTML Processor class properly parses and modifies HTML5 documents. 14 * 15 * It supports a subset of the HTML5 specification, and when it encounters 16 * unsupported markup, it aborts early to avoid unintentionally breaking 17 * the document. The HTML Processor should never break an HTML document. 18 * 19 * While the `WP_HTML_Tag_Processor` is a valuable tool for modifying 20 * attributes on individual HTML tags, the HTML Processor is more capable 21 * and useful for the following operations: 22 * 23 * - Querying based on nested HTML structure. 24 * 25 * Eventually the HTML Processor will also support: 26 * - Wrapping a tag in surrounding HTML. 27 * - Unwrapping a tag by removing its parent. 28 * - Inserting and removing nodes. 29 * - Reading and changing inner content. 30 * - Navigating up or around HTML structure. 31 * 32 * ## Usage 33 * 34 * Use of this class requires three steps: 35 * 36 * 1. Call a static creator method with your input HTML document. 37 * 2. Find the location in the document you are looking for. 38 * 3. Request changes to the document at that location. 39 * 40 * Example: 41 * 42 * $processor = WP_HTML_Processor::create_fragment( $html ); 43 * if ( $processor->next_tag( array( 'breadcrumbs' => array( 'DIV', 'FIGURE', 'IMG' ) ) ) ) { 44 * $processor->add_class( 'responsive-image' ); 45 * } 46 * 47 * #### Breadcrumbs 48 * 49 * Breadcrumbs represent the stack of open elements from the root 50 * of the document or fragment down to the currently-matched node, 51 * if one is currently selected. Call WP_HTML_Processor::get_breadcrumbs() 52 * to inspect the breadcrumbs for a matched tag. 53 * 54 * Breadcrumbs can specify nested HTML structure and are equivalent 55 * to a CSS selector comprising tag names separated by the child 56 * combinator, such as "DIV > FIGURE > IMG". 57 * 58 * Since all elements find themselves inside a full HTML document 59 * when parsed, the return value from `get_breadcrumbs()` will always 60 * contain any implicit outermost elements. For example, when parsing 61 * with `create_fragment()` in the `BODY` context (the default), any 62 * tag in the given HTML document will contain `array( 'HTML', 'BODY', … )` 63 * in its breadcrumbs. 64 * 65 * Despite containing the implied outermost elements in their breadcrumbs, 66 * tags may be found with the shortest-matching breadcrumb query. That is, 67 * `array( 'IMG' )` matches all IMG elements and `array( 'P', 'IMG' )` 68 * matches all IMG elements directly inside a P element. To ensure that no 69 * partial matches erroneously match it's possible to specify in a query 70 * the full breadcrumb match all the way down from the root HTML element. 71 * 72 * Example: 73 * 74 * $html = '<figure><img><figcaption>A <em>lovely</em> day outside</figcaption></figure>'; 75 * // ----- Matches here. 76 * $processor->next_tag( array( 'breadcrumbs' => array( 'FIGURE', 'IMG' ) ) ); 77 * 78 * $html = '<figure><img><figcaption>A <em>lovely</em> day outside</figcaption></figure>'; 79 * // ---- Matches here. 80 * $processor->next_tag( array( 'breadcrumbs' => array( 'FIGURE', 'FIGCAPTION', 'EM' ) ) ); 81 * 82 * $html = '<div><img></div><img>'; 83 * // ----- Matches here, because IMG must be a direct child of the implicit BODY. 84 * $processor->next_tag( array( 'breadcrumbs' => array( 'BODY', 'IMG' ) ) ); 85 * 86 * ## HTML Support 87 * 88 * This class implements a small part of the HTML5 specification. 89 * It's designed to operate within its support and abort early whenever 90 * encountering circumstances it can't properly handle. This is 91 * the principle way in which this class remains as simple as possible 92 * without cutting corners and breaking compliance. 93 * 94 * ### Supported elements 95 * 96 * If any unsupported element appears in the HTML input the HTML Processor 97 * will abort early and stop all processing. This draconian measure ensures 98 * that the HTML Processor won't break any HTML it doesn't fully understand. 99 * 100 * The following list specifies the HTML tags that _are_ supported: 101 * 102 * - Containers: ADDRESS, BLOCKQUOTE, DETAILS, DIALOG, DIV, FOOTER, HEADER, MAIN, MENU, SPAN, SUMMARY. 103 * - Custom elements: All custom elements are supported. :) 104 * - Form elements: BUTTON, DATALIST, FIELDSET, INPUT, LABEL, LEGEND, METER, PROGRESS, SEARCH. 105 * - Formatting elements: B, BIG, CODE, EM, FONT, I, PRE, SMALL, STRIKE, STRONG, TT, U, WBR. 106 * - Heading elements: H1, H2, H3, H4, H5, H6, HGROUP. 107 * - Links: A. 108 * - Lists: DD, DL, DT, LI, OL, UL. 109 * - Media elements: AUDIO, CANVAS, EMBED, FIGCAPTION, FIGURE, IMG, MAP, PICTURE, SOURCE, TRACK, VIDEO. 110 * - Paragraph: BR, P. 111 * - Phrasing elements: ABBR, AREA, BDI, BDO, CITE, DATA, DEL, DFN, INS, MARK, OUTPUT, Q, SAMP, SUB, SUP, TIME, VAR. 112 * - Sectioning elements: ARTICLE, ASIDE, HR, NAV, SECTION. 113 * - Templating elements: SLOT. 114 * - Text decoration: RUBY. 115 * - Deprecated elements: ACRONYM, BLINK, CENTER, DIR, ISINDEX, KEYGEN, LISTING, MULTICOL, NEXTID, PARAM, SPACER. 116 * 117 * ### Supported markup 118 * 119 * Some kinds of non-normative HTML involve reconstruction of formatting elements and 120 * re-parenting of mis-nested elements. For example, a DIV tag found inside a TABLE 121 * may in fact belong _before_ the table in the DOM. If the HTML Processor encounters 122 * such a case it will stop processing. 123 * 124 * The following list specifies HTML markup that _is_ supported: 125 * 126 * - Markup involving only those tags listed above. 127 * - Fully-balanced and non-overlapping tags. 128 * - HTML with unexpected tag closers. 129 * - Some unbalanced or overlapping tags. 130 * - P tags after unclosed P tags. 131 * - BUTTON tags after unclosed BUTTON tags. 132 * - A tags after unclosed A tags that don't involve any active formatting elements. 133 * 134 * @since 6.4.0 135 * 136 * @see WP_HTML_Tag_Processor 137 * @see https://html.spec.whatwg.org/ 138 */ 139 class WP_HTML_Processor extends WP_HTML_Tag_Processor { 140 /** 141 * The maximum number of bookmarks allowed to exist at any given time. 142 * 143 * HTML processing requires more bookmarks than basic tag processing, 144 * so this class constant from the Tag Processor is overwritten. 145 * 146 * @since 6.4.0 147 * 148 * @var int 149 */ 150 const MAX_BOOKMARKS = 100; 151 152 /** 153 * Holds the working state of the parser, including the stack of 154 * open elements and the stack of active formatting elements. 155 * 156 * Initialized in the constructor. 157 * 158 * @since 6.4.0 159 * 160 * @var WP_HTML_Processor_State 161 */ 162 private $state = null; 163 164 /** 165 * Used to create unique bookmark names. 166 * 167 * This class sets a bookmark for every tag in the HTML document that it encounters. 168 * The bookmark name is auto-generated and increments, starting with `1`. These are 169 * internal bookmarks and are automatically released when the referring WP_HTML_Token 170 * goes out of scope and is garbage-collected. 171 * 172 * @since 6.4.0 173 * 174 * @see WP_HTML_Processor::$release_internal_bookmark_on_destruct 175 * 176 * @var int 177 */ 178 private $bookmark_counter = 0; 179 180 /** 181 * Stores an explanation for why something failed, if it did. 182 * 183 * @see self::get_last_error 184 * 185 * @since 6.4.0 186 * 187 * @var string|null 188 */ 189 private $last_error = null; 190 191 /** 192 * Releases a bookmark when PHP garbage-collects its wrapping WP_HTML_Token instance. 193 * 194 * This function is created inside the class constructor so that it can be passed to 195 * the stack of open elements and the stack of active formatting elements without 196 * exposing it as a public method on the class. 197 * 198 * @since 6.4.0 199 * 200 * @var closure 201 */ 202 private $release_internal_bookmark_on_destruct = null; 203 204 /* 205 * Public Interface Functions 206 */ 207 208 /** 209 * Creates an HTML processor in the fragment parsing mode. 210 * 211 * Use this for cases where you are processing chunks of HTML that 212 * will be found within a bigger HTML document, such as rendered 213 * block output that exists within a post, `the_content` inside a 214 * rendered site layout. 215 * 216 * Fragment parsing occurs within a context, which is an HTML element 217 * that the document will eventually be placed in. It becomes important 218 * when special elements have different rules than others, such as inside 219 * a TEXTAREA or a TITLE tag where things that look like tags are text, 220 * or inside a SCRIPT tag where things that look like HTML syntax are JS. 221 * 222 * The context value should be a representation of the tag into which the 223 * HTML is found. For most cases this will be the body element. The HTML 224 * form is provided because a context element may have attributes that 225 * impact the parse, such as with a SCRIPT tag and its `type` attribute. 226 * 227 * ## Current HTML Support 228 * 229 * - The only supported context is `<body>`, which is the default value. 230 * - The only supported document encoding is `UTF-8`, which is the default value. 231 * 232 * @since 6.4.0 233 * 234 * @param string $html Input HTML fragment to process. 235 * @param string $context Context element for the fragment, must be default of `<body>`. 236 * @param string $encoding Text encoding of the document; must be default of 'UTF-8'. 237 * @return WP_HTML_Processor|null The created processor if successful, otherwise null. 238 */ 239 public static function create_fragment( $html, $context = '<body>', $encoding = 'UTF-8' ) { 240 if ( '<body>' !== $context || 'UTF-8' !== $encoding ) { 241 return null; 242 } 243 244 $processor = new self( $html, self::CONSTRUCTOR_UNLOCK_CODE ); 245 $processor->state->context_node = array( 'BODY', array() ); 246 $processor->state->insertion_mode = WP_HTML_Processor_State::INSERTION_MODE_IN_BODY; 247 248 // @todo Create "fake" bookmarks for non-existent but implied nodes. 249 $processor->bookmarks['root-node'] = new WP_HTML_Span( 0, 0 ); 250 $processor->bookmarks['context-node'] = new WP_HTML_Span( 0, 0 ); 251 252 $processor->state->stack_of_open_elements->push( 253 new WP_HTML_Token( 254 'root-node', 255 'HTML', 256 false 257 ) 258 ); 259 260 $processor->state->stack_of_open_elements->push( 261 new WP_HTML_Token( 262 'context-node', 263 $processor->state->context_node[0], 264 false 265 ) 266 ); 267 268 return $processor; 269 } 270 271 /** 272 * Constructor. 273 * 274 * Do not use this method. Use the static creator methods instead. 275 * 276 * @access private 277 * 278 * @since 6.4.0 279 * 280 * @see WP_HTML_Processor::create_fragment() 281 * 282 * @param string $html HTML to process. 283 * @param string|null $use_the_static_create_methods_instead This constructor should not be called manually. 284 */ 285 public function __construct( $html, $use_the_static_create_methods_instead = null ) { 286 parent::__construct( $html ); 287 288 if ( self::CONSTRUCTOR_UNLOCK_CODE !== $use_the_static_create_methods_instead ) { 289 _doing_it_wrong( 290 __METHOD__, 291 sprintf( 292 /* translators: %s: WP_HTML_Processor::create_fragment(). */ 293 __( 'Call %s to create an HTML Processor instead of calling the constructor directly.' ), 294 '<code>WP_HTML_Processor::create_fragment()</code>' 295 ), 296 '6.4.0' 297 ); 298 } 299 300 $this->state = new WP_HTML_Processor_State(); 301 302 /* 303 * Create this wrapper so that it's possible to pass 304 * a private method into WP_HTML_Token classes without 305 * exposing it to any public API. 306 */ 307 $this->release_internal_bookmark_on_destruct = function ( $name ) { 308 parent::release_bookmark( $name ); 309 }; 310 } 311 312 /** 313 * Returns the last error, if any. 314 * 315 * Various situations lead to parsing failure but this class will 316 * return `false` in all those cases. To determine why something 317 * failed it's possible to request the last error. This can be 318 * helpful to know to distinguish whether a given tag couldn't 319 * be found or if content in the document caused the processor 320 * to give up and abort processing. 321 * 322 * Example 323 * 324 * $processor = WP_HTML_Processor::create_fragment( '<template><strong><button><em><p><em>' ); 325 * false === $processor->next_tag(); 326 * WP_HTML_Processor::ERROR_UNSUPPORTED === $processor->get_last_error(); 327 * 328 * @since 6.4.0 329 * 330 * @see self::ERROR_UNSUPPORTED 331 * @see self::ERROR_EXCEEDED_MAX_BOOKMARKS 332 * 333 * @return string|null The last error, if one exists, otherwise null. 334 */ 335 public function get_last_error() { 336 return $this->last_error; 337 } 338 339 /** 340 * Finds the next tag matching the $query. 341 * 342 * @todo Support matching the class name and tag name. 343 * 344 * @since 6.4.0 345 * 346 * @throws Exception When unable to allocate a bookmark for the next token in the input HTML document. 347 * 348 * @param array|string|null $query { 349 * Optional. Which tag name to find, having which class, etc. Default is to find any tag. 350 * 351 * @type string|null $tag_name Which tag to find, or `null` for "any tag." 352 * @type int|null $match_offset Find the Nth tag matching all search criteria. 353 * 1 for "first" tag, 3 for "third," etc. 354 * Defaults to first tag. 355 * @type string|null $class_name Tag must contain this whole class name to match. 356 * @type string[] $breadcrumbs DOM sub-path at which element is found, e.g. `array( 'FIGURE', 'IMG' )`. 357 * May also contain the wildcard `*` which matches a single element, e.g. `array( 'SECTION', '*' )`. 358 * } 359 * @return bool Whether a tag was matched. 360 */ 361 public function next_tag( $query = null ) { 362 if ( null === $query ) { 363 while ( $this->step() ) { 364 if ( '#tag' !== $this->get_token_type() ) { 365 continue; 366 } 367 368 if ( ! $this->is_tag_closer() ) { 369 return true; 370 } 371 } 372 373 return false; 374 } 375 376 if ( is_string( $query ) ) { 377 $query = array( 'breadcrumbs' => array( $query ) ); 378 } 379 380 if ( ! is_array( $query ) ) { 381 _doing_it_wrong( 382 __METHOD__, 383 __( 'Please pass a query array to this function.' ), 384 '6.4.0' 385 ); 386 return false; 387 } 388 389 if ( ! ( array_key_exists( 'breadcrumbs', $query ) && is_array( $query['breadcrumbs'] ) ) ) { 390 while ( $this->step() ) { 391 if ( '#tag' !== $this->get_token_type() ) { 392 continue; 393 } 394 395 if ( ! $this->is_tag_closer() ) { 396 return true; 397 } 398 } 399 400 return false; 401 } 402 403 if ( isset( $query['tag_closers'] ) && 'visit' === $query['tag_closers'] ) { 404 _doing_it_wrong( 405 __METHOD__, 406 __( 'Cannot visit tag closers in HTML Processor.' ), 407 '6.4.0' 408 ); 409 return false; 410 } 411 412 $breadcrumbs = $query['breadcrumbs']; 413 $match_offset = isset( $query['match_offset'] ) ? (int) $query['match_offset'] : 1; 414 415 while ( $match_offset > 0 && $this->step() ) { 416 if ( '#tag' !== $this->get_token_type() ) { 417 continue; 418 } 419 420 if ( $this->matches_breadcrumbs( $breadcrumbs ) && 0 === --$match_offset ) { 421 return true; 422 } 423 } 424 425 return false; 426 } 427 428 /** 429 * Ensures internal accounting is maintained for HTML semantic rules while 430 * the underlying Tag Processor class is seeking to a bookmark. 431 * 432 * This doesn't currently have a way to represent non-tags and doesn't process 433 * semantic rules for text nodes. For access to the raw tokens consider using 434 * WP_HTML_Tag_Processor instead. 435 * 436 * @since 6.5.0 Added for internal support; do not use. 437 * 438 * @access private 439 * 440 * @return bool 441 */ 442 public function next_token() { 443 return $this->step(); 444 } 445 446 /** 447 * Indicates if the currently-matched tag matches the given breadcrumbs. 448 * 449 * A "*" represents a single tag wildcard, where any tag matches, but not no tags. 450 * 451 * At some point this function _may_ support a `**` syntax for matching any number 452 * of unspecified tags in the breadcrumb stack. This has been intentionally left 453 * out, however, to keep this function simple and to avoid introducing backtracking, 454 * which could open up surprising performance breakdowns. 455 * 456 * Example: 457 * 458 * $processor = WP_HTML_Processor::create_fragment( '<div><span><figure><img></figure></span></div>' ); 459 * $processor->next_tag( 'img' ); 460 * true === $processor->matches_breadcrumbs( array( 'figure', 'img' ) ); 461 * true === $processor->matches_breadcrumbs( array( 'span', 'figure', 'img' ) ); 462 * false === $processor->matches_breadcrumbs( array( 'span', 'img' ) ); 463 * true === $processor->matches_breadcrumbs( array( 'span', '*', 'img' ) ); 464 * 465 * @since 6.4.0 466 * 467 * @param string[] $breadcrumbs DOM sub-path at which element is found, e.g. `array( 'FIGURE', 'IMG' )`. 468 * May also contain the wildcard `*` which matches a single element, e.g. `array( 'SECTION', '*' )`. 469 * @return bool Whether the currently-matched tag is found at the given nested structure. 470 */ 471 public function matches_breadcrumbs( $breadcrumbs ) { 472 // Everything matches when there are zero constraints. 473 if ( 0 === count( $breadcrumbs ) ) { 474 return true; 475 } 476 477 // Start at the last crumb. 478 $crumb = end( $breadcrumbs ); 479 480 if ( '*' !== $crumb && $this->get_tag() !== strtoupper( $crumb ) ) { 481 return false; 482 } 483 484 foreach ( $this->state->stack_of_open_elements->walk_up() as $node ) { 485 $crumb = strtoupper( current( $breadcrumbs ) ); 486 487 if ( '*' !== $crumb && $node->node_name !== $crumb ) { 488 return false; 489 } 490 491 if ( false === prev( $breadcrumbs ) ) { 492 return true; 493 } 494 } 495 496 return false; 497 } 498 499 /** 500 * Steps through the HTML document and stop at the next tag, if any. 501 * 502 * @since 6.4.0 503 * 504 * @throws Exception When unable to allocate a bookmark for the next token in the input HTML document. 505 * 506 * @see self::PROCESS_NEXT_NODE 507 * @see self::REPROCESS_CURRENT_NODE 508 * 509 * @param string $node_to_process Whether to parse the next node or reprocess the current node. 510 * @return bool Whether a tag was matched. 511 */ 512 public function step( $node_to_process = self::PROCESS_NEXT_NODE ) { 513 // Refuse to proceed if there was a previous error. 514 if ( null !== $this->last_error ) { 515 return false; 516 } 517 518 if ( self::REPROCESS_CURRENT_NODE !== $node_to_process ) { 519 /* 520 * Void elements still hop onto the stack of open elements even though 521 * there's no corresponding closing tag. This is important for managing 522 * stack-based operations such as "navigate to parent node" or checking 523 * on an element's breadcrumbs. 524 * 525 * When moving on to the next node, therefore, if the bottom-most element 526 * on the stack is a void element, it must be closed. 527 * 528 * @todo Once self-closing foreign elements and BGSOUND are supported, 529 * they must also be implicitly closed here too. BGSOUND is 530 * special since it's only self-closing if the self-closing flag 531 * is provided in the opening tag, otherwise it expects a tag closer. 532 */ 533 $top_node = $this->state->stack_of_open_elements->current_node(); 534 if ( 535 $top_node && ( 536 // Void elements. 537 self::is_void( $top_node->node_name ) || 538 // Comments, text nodes, and other atomic tokens. 539 '#' === $top_node->node_name[0] || 540 // Doctype declarations. 541 'html' === $top_node->node_name 542 ) 543 ) { 544 $this->state->stack_of_open_elements->pop(); 545 } 546 } 547 548 if ( self::PROCESS_NEXT_NODE === $node_to_process ) { 549 parent::next_token(); 550 } 551 552 // Finish stepping when there are no more tokens in the document. 553 if ( 554 WP_HTML_Tag_Processor::STATE_INCOMPLETE_INPUT === $this->parser_state || 555 WP_HTML_Tag_Processor::STATE_COMPLETE === $this->parser_state 556 ) { 557 return false; 558 } 559 560 $this->state->current_token = new WP_HTML_Token( 561 $this->bookmark_token(), 562 $this->get_token_name(), 563 $this->has_self_closing_flag(), 564 $this->release_internal_bookmark_on_destruct 565 ); 566 567 try { 568 switch ( $this->state->insertion_mode ) { 569 case WP_HTML_Processor_State::INSERTION_MODE_IN_BODY: 570 return $this->step_in_body(); 571 572 default: 573 $this->last_error = self::ERROR_UNSUPPORTED; 574 throw new WP_HTML_Unsupported_Exception( "No support for parsing in the '{$this->state->insertion_mode}' state." ); 575 } 576 } catch ( WP_HTML_Unsupported_Exception $e ) { 577 /* 578 * Exceptions are used in this class to escape deep call stacks that 579 * otherwise might involve messier calling and return conventions. 580 */ 581 return false; 582 } 583 } 584 585 /** 586 * Computes the HTML breadcrumbs for the currently-matched node, if matched. 587 * 588 * Breadcrumbs start at the outermost parent and descend toward the matched element. 589 * They always include the entire path from the root HTML node to the matched element. 590 * 591 * @todo It could be more efficient to expose a generator-based version of this function 592 * to avoid creating the array copy on tag iteration. If this is done, it would likely 593 * be more useful to walk up the stack when yielding instead of starting at the top. 594 * 595 * Example 596 * 597 * $processor = WP_HTML_Processor::create_fragment( '<p><strong><em><img></em></strong></p>' ); 598 * $processor->next_tag( 'IMG' ); 599 * $processor->get_breadcrumbs() === array( 'HTML', 'BODY', 'P', 'STRONG', 'EM', 'IMG' ); 600 * 601 * @since 6.4.0 602 * 603 * @return string[]|null Array of tag names representing path to matched node, if matched, otherwise NULL. 604 */ 605 public function get_breadcrumbs() { 606 $breadcrumbs = array(); 607 foreach ( $this->state->stack_of_open_elements->walk_down() as $stack_item ) { 608 $breadcrumbs[] = $stack_item->node_name; 609 } 610 611 return $breadcrumbs; 612 } 613 614 /** 615 * Parses next element in the 'in body' insertion mode. 616 * 617 * This internal function performs the 'in body' insertion mode 618 * logic for the generalized WP_HTML_Processor::step() function. 619 * 620 * @since 6.4.0 621 * 622 * @throws WP_HTML_Unsupported_Exception When encountering unsupported HTML input. 623 * 624 * @see https://html.spec.whatwg.org/#parsing-main-inbody 625 * @see WP_HTML_Processor::step 626 * 627 * @return bool Whether an element was found. 628 */ 629 private function step_in_body() { 630 $token_name = $this->get_token_name(); 631 $token_type = $this->get_token_type(); 632 $op_sigil = '#tag' === $token_type ? ( $this->is_tag_closer() ? '-' : '+' ) : ''; 633 $op = "{$op_sigil}{$token_name}"; 634 635 switch ( $op ) { 636 case '#comment': 637 case '#funky-comment': 638 case '#presumptuous-tag': 639 $this->insert_html_element( $this->state->current_token ); 640 return true; 641 642 case '#text': 643 $this->reconstruct_active_formatting_elements(); 644 645 $current_token = $this->bookmarks[ $this->state->current_token->bookmark_name ]; 646 647 /* 648 * > A character token that is U+0000 NULL 649 * 650 * Any successive sequence of NULL bytes is ignored and won't 651 * trigger active format reconstruction. Therefore, if the text 652 * only comprises NULL bytes then the token should be ignored 653 * here, but if there are any other characters in the stream 654 * the active formats should be reconstructed. 655 */ 656 if ( 657 1 <= $current_token->length && 658 "\x00" === $this->html[ $current_token->start ] && 659 strspn( $this->html, "\x00", $current_token->start, $current_token->length ) === $current_token->length 660 ) { 661 // Parse error: ignore the token. 662 return $this->step(); 663 } 664 665 /* 666 * Whitespace-only text does not affect the frameset-ok flag. 667 * It is probably inter-element whitespace, but it may also 668 * contain character references which decode only to whitespace. 669 */ 670 $text = $this->get_modifiable_text(); 671 if ( strlen( $text ) !== strspn( $text, " \t\n\f\r" ) ) { 672 $this->state->frameset_ok = false; 673 } 674 675 $this->insert_html_element( $this->state->current_token ); 676 return true; 677 678 case 'html': 679 /* 680 * > A DOCTYPE token 681 * > Parse error. Ignore the token. 682 */ 683 return $this->step(); 684 685 /* 686 * > A start tag whose tag name is "button" 687 */ 688 case '+BUTTON': 689 if ( $this->state->stack_of_open_elements->has_element_in_scope( 'BUTTON' ) ) { 690 // @todo Indicate a parse error once it's possible. This error does not impact the logic here. 691 $this->generate_implied_end_tags(); 692 $this->state->stack_of_open_elements->pop_until( 'BUTTON' ); 693 } 694 695 $this->reconstruct_active_formatting_elements(); 696 $this->insert_html_element( $this->state->current_token ); 697 $this->state->frameset_ok = false; 698 699 return true; 700 701 /* 702 * > A start tag whose tag name is one of: "address", "article", "aside", 703 * > "blockquote", "center", "details", "dialog", "dir", "div", "dl", 704 * > "fieldset", "figcaption", "figure", "footer", "header", "hgroup", 705 * > "main", "menu", "nav", "ol", "p", "search", "section", "summary", "ul" 706 */ 707 case '+ADDRESS': 708 case '+ARTICLE': 709 case '+ASIDE': 710 case '+BLOCKQUOTE': 711 case '+CENTER': 712 case '+DETAILS': 713 case '+DIALOG': 714 case '+DIR': 715 case '+DIV': 716 case '+DL': 717 case '+FIELDSET': 718 case '+FIGCAPTION': 719 case '+FIGURE': 720 case '+FOOTER': 721 case '+HEADER': 722 case '+HGROUP': 723 case '+MAIN': 724 case '+MENU': 725 case '+NAV': 726 case '+OL': 727 case '+P': 728 case '+SEARCH': 729 case '+SECTION': 730 case '+SUMMARY': 731 case '+UL': 732 if ( $this->state->stack_of_open_elements->has_p_in_button_scope() ) { 733 $this->close_a_p_element(); 734 } 735 736 $this->insert_html_element( $this->state->current_token ); 737 return true; 738 739 /* 740 * > An end tag whose tag name is one of: "address", "article", "aside", "blockquote", 741 * > "button", "center", "details", "dialog", "dir", "div", "dl", "fieldset", 742 * > "figcaption", "figure", "footer", "header", "hgroup", "listing", "main", 743 * > "menu", "nav", "ol", "pre", "search", "section", "summary", "ul" 744 */ 745 case '-ADDRESS': 746 case '-ARTICLE': 747 case '-ASIDE': 748 case '-BLOCKQUOTE': 749 case '-BUTTON': 750 case '-CENTER': 751 case '-DETAILS': 752 case '-DIALOG': 753 case '-DIR': 754 case '-DIV': 755 case '-DL': 756 case '-FIELDSET': 757 case '-FIGCAPTION': 758 case '-FIGURE': 759 case '-FOOTER': 760 case '-HEADER': 761 case '-HGROUP': 762 case '-LISTING': 763 case '-MAIN': 764 case '-MENU': 765 case '-NAV': 766 case '-OL': 767 case '-PRE': 768 case '-SEARCH': 769 case '-SECTION': 770 case '-SUMMARY': 771 case '-UL': 772 if ( ! $this->state->stack_of_open_elements->has_element_in_scope( $token_name ) ) { 773 // @todo Report parse error. 774 // Ignore the token. 775 return $this->step(); 776 } 777 778 $this->generate_implied_end_tags(); 779 if ( $this->state->stack_of_open_elements->current_node()->node_name !== $token_name ) { 780 // @todo Record parse error: this error doesn't impact parsing. 781 } 782 $this->state->stack_of_open_elements->pop_until( $token_name ); 783 return true; 784 785 /* 786 * > A start tag whose tag name is one of: "h1", "h2", "h3", "h4", "h5", "h6" 787 */ 788 case '+H1': 789 case '+H2': 790 case '+H3': 791 case '+H4': 792 case '+H5': 793 case '+H6': 794 if ( $this->state->stack_of_open_elements->has_p_in_button_scope() ) { 795 $this->close_a_p_element(); 796 } 797 798 if ( 799 in_array( 800 $this->state->stack_of_open_elements->current_node()->node_name, 801 array( 'H1', 'H2', 'H3', 'H4', 'H5', 'H6' ), 802 true 803 ) 804 ) { 805 // @todo Indicate a parse error once it's possible. 806 $this->state->stack_of_open_elements->pop(); 807 } 808 809 $this->insert_html_element( $this->state->current_token ); 810 return true; 811 812 /* 813 * > A start tag whose tag name is one of: "pre", "listing" 814 */ 815 case '+PRE': 816 case '+LISTING': 817 if ( $this->state->stack_of_open_elements->has_p_in_button_scope() ) { 818 $this->close_a_p_element(); 819 } 820 $this->insert_html_element( $this->state->current_token ); 821 $this->state->frameset_ok = false; 822 return true; 823 824 /* 825 * > An end tag whose tag name is one of: "h1", "h2", "h3", "h4", "h5", "h6" 826 */ 827 case '-H1': 828 case '-H2': 829 case '-H3': 830 case '-H4': 831 case '-H5': 832 case '-H6': 833 if ( ! $this->state->stack_of_open_elements->has_element_in_scope( '(internal: H1 through H6 - do not use)' ) ) { 834 /* 835 * This is a parse error; ignore the token. 836 * 837 * @todo Indicate a parse error once it's possible. 838 */ 839 return $this->step(); 840 } 841 842 $this->generate_implied_end_tags(); 843 844 if ( $this->state->stack_of_open_elements->current_node()->node_name !== $token_name ) { 845 // @todo Record parse error: this error doesn't impact parsing. 846 } 847 848 $this->state->stack_of_open_elements->pop_until( '(internal: H1 through H6 - do not use)' ); 849 return true; 850 851 /* 852 * > A start tag whose tag name is "li" 853 * > A start tag whose tag name is one of: "dd", "dt" 854 */ 855 case '+DD': 856 case '+DT': 857 case '+LI': 858 $this->state->frameset_ok = false; 859 $node = $this->state->stack_of_open_elements->current_node(); 860 $is_li = 'LI' === $token_name; 861 862 in_body_list_loop: 863 /* 864 * The logic for LI and DT/DD is the same except for one point: LI elements _only_ 865 * close other LI elements, but a DT or DD element closes _any_ open DT or DD element. 866 */ 867 if ( $is_li ? 'LI' === $node->node_name : ( 'DD' === $node->node_name || 'DT' === $node->node_name ) ) { 868 $node_name = $is_li ? 'LI' : $node->node_name; 869 $this->generate_implied_end_tags( $node_name ); 870 if ( $node_name !== $this->state->stack_of_open_elements->current_node()->node_name ) { 871 // @todo Indicate a parse error once it's possible. This error does not impact the logic here. 872 } 873 874 $this->state->stack_of_open_elements->pop_until( $node_name ); 875 goto in_body_list_done; 876 } 877 878 if ( 879 'ADDRESS' !== $node->node_name && 880 'DIV' !== $node->node_name && 881 'P' !== $node->node_name && 882 $this->is_special( $node->node_name ) 883 ) { 884 /* 885 * > If node is in the special category, but is not an address, div, 886 * > or p element, then jump to the step labeled done below. 887 */ 888 goto in_body_list_done; 889 } else { 890 /* 891 * > Otherwise, set node to the previous entry in the stack of open elements 892 * > and return to the step labeled loop. 893 */ 894 foreach ( $this->state->stack_of_open_elements->walk_up( $node ) as $item ) { 895 $node = $item; 896 break; 897 } 898 goto in_body_list_loop; 899 } 900 901 in_body_list_done: 902 if ( $this->state->stack_of_open_elements->has_p_in_button_scope() ) { 903 $this->close_a_p_element(); 904 } 905 906 $this->insert_html_element( $this->state->current_token ); 907 return true; 908 909 /* 910 * > An end tag whose tag name is "li" 911 * > An end tag whose tag name is one of: "dd", "dt" 912 */ 913 case '-DD': 914 case '-DT': 915 case '-LI': 916 if ( 917 /* 918 * An end tag whose tag name is "li": 919 * If the stack of open elements does not have an li element in list item scope, 920 * then this is a parse error; ignore the token. 921 */ 922 ( 923 'LI' === $token_name && 924 ! $this->state->stack_of_open_elements->has_element_in_list_item_scope( 'LI' ) 925 ) || 926 /* 927 * An end tag whose tag name is one of: "dd", "dt": 928 * If the stack of open elements does not have an element in scope that is an 929 * HTML element with the same tag name as that of the token, then this is a 930 * parse error; ignore the token. 931 */ 932 ( 933 'LI' !== $token_name && 934 ! $this->state->stack_of_open_elements->has_element_in_scope( $token_name ) 935 ) 936 ) { 937 /* 938 * This is a parse error, ignore the token. 939 * 940 * @todo Indicate a parse error once it's possible. 941 */ 942 return $this->step(); 943 } 944 945 $this->generate_implied_end_tags( $token_name ); 946 947 if ( $token_name !== $this->state->stack_of_open_elements->current_node()->node_name ) { 948 // @todo Indicate a parse error once it's possible. This error does not impact the logic here. 949 } 950 951 $this->state->stack_of_open_elements->pop_until( $token_name ); 952 return true; 953 954 /* 955 * > An end tag whose tag name is "p" 956 */ 957 case '-P': 958 if ( ! $this->state->stack_of_open_elements->has_p_in_button_scope() ) { 959 $this->insert_html_element( $this->state->current_token ); 960 } 961 962 $this->close_a_p_element(); 963 return true; 964 965 // > A start tag whose tag name is "a" 966 case '+A': 967 foreach ( $this->state->active_formatting_elements->walk_up() as $item ) { 968 switch ( $item->node_name ) { 969 case 'marker': 970 break; 971 972 case 'A': 973 $this->run_adoption_agency_algorithm(); 974 $this->state->active_formatting_elements->remove_node( $item ); 975 $this->state->stack_of_open_elements->remove_node( $item ); 976 break; 977 } 978 } 979 980 $this->reconstruct_active_formatting_elements(); 981 $this->insert_html_element( $this->state->current_token ); 982 $this->state->active_formatting_elements->push( $this->state->current_token ); 983 return true; 984 985 /* 986 * > A start tag whose tag name is one of: "b", "big", "code", "em", "font", "i", 987 * > "s", "small", "strike", "strong", "tt", "u" 988 */ 989 case '+B': 990 case '+BIG': 991 case '+CODE': 992 case '+EM': 993 case '+FONT': 994 case '+I': 995 case '+S': 996 case '+SMALL': 997 case '+STRIKE': 998 case '+STRONG': 999 case '+TT': 1000 case '+U': 1001 $this->reconstruct_active_formatting_elements(); 1002 $this->insert_html_element( $this->state->current_token ); 1003 $this->state->active_formatting_elements->push( $this->state->current_token ); 1004 return true; 1005 1006 /* 1007 * > An end tag whose tag name is one of: "a", "b", "big", "code", "em", "font", "i", 1008 * > "nobr", "s", "small", "strike", "strong", "tt", "u" 1009 */ 1010 case '-A': 1011 case '-B': 1012 case '-BIG': 1013 case '-CODE': 1014 case '-EM': 1015 case '-FONT': 1016 case '-I': 1017 case '-S': 1018 case '-SMALL': 1019 case '-STRIKE': 1020 case '-STRONG': 1021 case '-TT': 1022 case '-U': 1023 $this->run_adoption_agency_algorithm(); 1024 return true; 1025 1026 /* 1027 * > An end tag whose tag name is "br" 1028 * > Parse error. Drop the attributes from the token, and act as described in the next 1029 * > entry; i.e. act as if this was a "br" start tag token with no attributes, rather 1030 * > than the end tag token that it actually is. 1031 */ 1032 case '-BR': 1033 $this->last_error = self::ERROR_UNSUPPORTED; 1034 throw new WP_HTML_Unsupported_Exception( 'Closing BR tags require unimplemented special handling.' ); 1035 1036 /* 1037 * > A start tag whose tag name is one of: "area", "br", "embed", "img", "keygen", "wbr" 1038 */ 1039 case '+AREA': 1040 case '+BR': 1041 case '+EMBED': 1042 case '+IMG': 1043 case '+KEYGEN': 1044 case '+WBR': 1045 $this->reconstruct_active_formatting_elements(); 1046 $this->insert_html_element( $this->state->current_token ); 1047 $this->state->frameset_ok = false; 1048 return true; 1049 1050 /* 1051 * > A start tag whose tag name is "input" 1052 */ 1053 case '+INPUT': 1054 $this->reconstruct_active_formatting_elements(); 1055 $this->insert_html_element( $this->state->current_token ); 1056 $type_attribute = $this->get_attribute( 'type' ); 1057 /* 1058 * > If the token does not have an attribute with the name "type", or if it does, 1059 * > but that attribute's value is not an ASCII case-insensitive match for the 1060 * > string "hidden", then: set the frameset-ok flag to "not ok". 1061 */ 1062 if ( ! is_string( $type_attribute ) || 'hidden' !== strtolower( $type_attribute ) ) { 1063 $this->state->frameset_ok = false; 1064 } 1065 return true; 1066 1067 /* 1068 * > A start tag whose tag name is "hr" 1069 */ 1070 case '+HR': 1071 if ( $this->state->stack_of_open_elements->has_p_in_button_scope() ) { 1072 $this->close_a_p_element(); 1073 } 1074 $this->insert_html_element( $this->state->current_token ); 1075 $this->state->frameset_ok = false; 1076 return true; 1077 1078 /* 1079 * > A start tag whose tag name is one of: "param", "source", "track" 1080 */ 1081 case '+PARAM': 1082 case '+SOURCE': 1083 case '+TRACK': 1084 $this->insert_html_element( $this->state->current_token ); 1085 return true; 1086 } 1087 1088 /* 1089 * These tags require special handling in the 'in body' insertion mode 1090 * but that handling hasn't yet been implemented. 1091 * 1092 * As the rules for each tag are implemented, the corresponding tag 1093 * name should be removed from this list. An accompanying test should 1094 * help ensure this list is maintained. 1095 * 1096 * @see Tests_HtmlApi_WpHtmlProcessor::test_step_in_body_fails_on_unsupported_tags 1097 * 1098 * Since this switch structure throws a WP_HTML_Unsupported_Exception, it's 1099 * possible to handle "any other start tag" and "any other end tag" below, 1100 * as that guarantees execution doesn't proceed for the unimplemented tags. 1101 * 1102 * @see https://html.spec.whatwg.org/multipage/parsing.html#parsing-main-inbody 1103 */ 1104 switch ( $token_name ) { 1105 case 'APPLET': 1106 case 'BASE': 1107 case 'BASEFONT': 1108 case 'BGSOUND': 1109 case 'BODY': 1110 case 'CAPTION': 1111 case 'COL': 1112 case 'COLGROUP': 1113 case 'FORM': 1114 case 'FRAME': 1115 case 'FRAMESET': 1116 case 'HEAD': 1117 case 'HTML': 1118 case 'IFRAME': 1119 case 'LINK': 1120 case 'MARQUEE': 1121 case 'MATH': 1122 case 'META': 1123 case 'NOBR': 1124 case 'NOEMBED': 1125 case 'NOFRAMES': 1126 case 'NOSCRIPT': 1127 case 'OBJECT': 1128 case 'OPTGROUP': 1129 case 'OPTION': 1130 case 'PLAINTEXT': 1131 case 'RB': 1132 case 'RP': 1133 case 'RT': 1134 case 'RTC': 1135 case 'SARCASM': 1136 case 'SCRIPT': 1137 case 'SELECT': 1138 case 'STYLE': 1139 case 'SVG': 1140 case 'TABLE': 1141 case 'TBODY': 1142 case 'TD': 1143 case 'TEMPLATE': 1144 case 'TEXTAREA': 1145 case 'TFOOT': 1146 case 'TH': 1147 case 'THEAD': 1148 case 'TITLE': 1149 case 'TR': 1150 case 'XMP': 1151 $this->last_error = self::ERROR_UNSUPPORTED; 1152 throw new WP_HTML_Unsupported_Exception( "Cannot process {$token_name} element." ); 1153 } 1154 1155 if ( ! $this->is_tag_closer() ) { 1156 /* 1157 * > Any other start tag 1158 */ 1159 $this->reconstruct_active_formatting_elements(); 1160 $this->insert_html_element( $this->state->current_token ); 1161 return true; 1162 } else { 1163 /* 1164 * > Any other end tag 1165 */ 1166 1167 /* 1168 * Find the corresponding tag opener in the stack of open elements, if 1169 * it exists before reaching a special element, which provides a kind 1170 * of boundary in the stack. For example, a `</custom-tag>` should not 1171 * close anything beyond its containing `P` or `DIV` element. 1172 */ 1173 foreach ( $this->state->stack_of_open_elements->walk_up() as $node ) { 1174 if ( $token_name === $node->node_name ) { 1175 break; 1176 } 1177 1178 if ( self::is_special( $node->node_name ) ) { 1179 // This is a parse error, ignore the token. 1180 return $this->step(); 1181 } 1182 } 1183 1184 $this->generate_implied_end_tags( $token_name ); 1185 if ( $node !== $this->state->stack_of_open_elements->current_node() ) { 1186 // @todo Record parse error: this error doesn't impact parsing. 1187 } 1188 1189 foreach ( $this->state->stack_of_open_elements->walk_up() as $item ) { 1190 $this->state->stack_of_open_elements->pop(); 1191 if ( $node === $item ) { 1192 return true; 1193 } 1194 } 1195 } 1196 } 1197 1198 /* 1199 * Internal helpers 1200 */ 1201 1202 /** 1203 * Creates a new bookmark for the currently-matched token and returns the generated name. 1204 * 1205 * @since 6.4.0 1206 * @since 6.5.0 Renamed from bookmark_tag() to bookmark_token(). 1207 * 1208 * @throws Exception When unable to allocate requested bookmark. 1209 * 1210 * @return string|false Name of created bookmark, or false if unable to create. 1211 */ 1212 private function bookmark_token() { 1213 if ( ! parent::set_bookmark( ++$this->bookmark_counter ) ) { 1214 $this->last_error = self::ERROR_EXCEEDED_MAX_BOOKMARKS; 1215 throw new Exception( 'could not allocate bookmark' ); 1216 } 1217 1218 return "{$this->bookmark_counter}"; 1219 } 1220 1221 /* 1222 * HTML semantic overrides for Tag Processor 1223 */ 1224 1225 /** 1226 * Returns the uppercase name of the matched tag. 1227 * 1228 * The semantic rules for HTML specify that certain tags be reprocessed 1229 * with a different tag name. Because of this, the tag name presented 1230 * by the HTML Processor may differ from the one reported by the HTML 1231 * Tag Processor, which doesn't apply these semantic rules. 1232 * 1233 * Example: 1234 * 1235 * $processor = new WP_HTML_Tag_Processor( '<div class="test">Test</div>' ); 1236 * $processor->next_tag() === true; 1237 * $processor->get_tag() === 'DIV'; 1238 * 1239 * $processor->next_tag() === false; 1240 * $processor->get_tag() === null; 1241 * 1242 * @since 6.4.0 1243 * 1244 * @return string|null Name of currently matched tag in input HTML, or `null` if none found. 1245 */ 1246 public function get_tag() { 1247 if ( null !== $this->last_error ) { 1248 return null; 1249 } 1250 1251 $tag_name = parent::get_tag(); 1252 1253 switch ( $tag_name ) { 1254 case 'IMAGE': 1255 /* 1256 * > A start tag whose tag name is "image" 1257 * > Change the token's tag name to "img" and reprocess it. (Don't ask.) 1258 */ 1259 return 'IMG'; 1260 1261 default: 1262 return $tag_name; 1263 } 1264 } 1265 1266 /** 1267 * Removes a bookmark that is no longer needed. 1268 * 1269 * Releasing a bookmark frees up the small 1270 * performance overhead it requires. 1271 * 1272 * @since 6.4.0 1273 * 1274 * @param string $bookmark_name Name of the bookmark to remove. 1275 * @return bool Whether the bookmark already existed before removal. 1276 */ 1277 public function release_bookmark( $bookmark_name ) { 1278 return parent::release_bookmark( "_{$bookmark_name}" ); 1279 } 1280 1281 /** 1282 * Moves the internal cursor in the HTML Processor to a given bookmark's location. 1283 * 1284 * Be careful! Seeking backwards to a previous location resets the parser to the 1285 * start of the document and reparses the entire contents up until it finds the 1286 * sought-after bookmarked location. 1287 * 1288 * In order to prevent accidental infinite loops, there's a 1289 * maximum limit on the number of times seek() can be called. 1290 * 1291 * @throws Exception When unable to allocate a bookmark for the next token in the input HTML document. 1292 * 1293 * @since 6.4.0 1294 * 1295 * @param string $bookmark_name Jump to the place in the document identified by this bookmark name. 1296 * @return bool Whether the internal cursor was successfully moved to the bookmark's location. 1297 */ 1298 public function seek( $bookmark_name ) { 1299 // Flush any pending updates to the document before beginning. 1300 $this->get_updated_html(); 1301 1302 $actual_bookmark_name = "_{$bookmark_name}"; 1303 $processor_started_at = $this->state->current_token 1304 ? $this->bookmarks[ $this->state->current_token->bookmark_name ]->start 1305 : 0; 1306 $bookmark_starts_at = $this->bookmarks[ $actual_bookmark_name ]->start; 1307 $direction = $bookmark_starts_at > $processor_started_at ? 'forward' : 'backward'; 1308 1309 /* 1310 * If seeking backwards, it's possible that the sought-after bookmark exists within an element 1311 * which has been closed before the current cursor; in other words, it has already been removed 1312 * from the stack of open elements. This means that it's insufficient to simply pop off elements 1313 * from the stack of open elements which appear after the bookmarked location and then jump to 1314 * that location, as the elements which were open before won't be re-opened. 1315 * 1316 * In order to maintain consistency, the HTML Processor rewinds to the start of the document 1317 * and reparses everything until it finds the sought-after bookmark. 1318 * 1319 * There are potentially better ways to do this: cache the parser state for each bookmark and 1320 * restore it when seeking; store an immutable and idempotent register of where elements open 1321 * and close. 1322 * 1323 * If caching the parser state it will be essential to properly maintain the cached stack of 1324 * open elements and active formatting elements when modifying the document. This could be a 1325 * tedious and time-consuming process as well, and so for now will not be performed. 1326 * 1327 * It may be possible to track bookmarks for where elements open and close, and in doing so 1328 * be able to quickly recalculate breadcrumbs for any element in the document. It may even 1329 * be possible to remove the stack of open elements and compute it on the fly this way. 1330 * If doing this, the parser would need to track the opening and closing locations for all 1331 * tokens in the breadcrumb path for any and all bookmarks. By utilizing bookmarks themselves 1332 * this list could be automatically maintained while modifying the document. Finding the 1333 * breadcrumbs would then amount to traversing that list from the start until the token 1334 * being inspected. Once an element closes, if there are no bookmarks pointing to locations 1335 * within that element, then all of these locations may be forgotten to save on memory use 1336 * and computation time. 1337 */ 1338 if ( 'backward' === $direction ) { 1339 /* 1340 * Instead of clearing the parser state and starting fresh, calling the stack methods 1341 * maintains the proper flags in the parser. 1342 */ 1343 foreach ( $this->state->stack_of_open_elements->walk_up() as $item ) { 1344 if ( 'context-node' === $item->bookmark_name ) { 1345 break; 1346 } 1347 1348 $this->state->stack_of_open_elements->remove_node( $item ); 1349 } 1350 1351 foreach ( $this->state->active_formatting_elements->walk_up() as $item ) { 1352 if ( 'context-node' === $item->bookmark_name ) { 1353 break; 1354 } 1355 1356 $this->state->active_formatting_elements->remove_node( $item ); 1357 } 1358 1359 parent::seek( 'context-node' ); 1360 $this->state->insertion_mode = WP_HTML_Processor_State::INSERTION_MODE_IN_BODY; 1361 $this->state->frameset_ok = true; 1362 } 1363 1364 // When moving forwards, reparse the document until reaching the same location as the original bookmark. 1365 if ( $bookmark_starts_at === $this->bookmarks[ $this->state->current_token->bookmark_name ]->start ) { 1366 return true; 1367 } 1368 1369 while ( $this->step() ) { 1370 if ( $bookmark_starts_at === $this->bookmarks[ $this->state->current_token->bookmark_name ]->start ) { 1371 return true; 1372 } 1373 } 1374 1375 return false; 1376 } 1377 1378 /** 1379 * Sets a bookmark in the HTML document. 1380 * 1381 * Bookmarks represent specific places or tokens in the HTML 1382 * document, such as a tag opener or closer. When applying 1383 * edits to a document, such as setting an attribute, the 1384 * text offsets of that token may shift; the bookmark is 1385 * kept updated with those shifts and remains stable unless 1386 * the entire span of text in which the token sits is removed. 1387 * 1388 * Release bookmarks when they are no longer needed. 1389 * 1390 * Example: 1391 * 1392 * <main><h2>Surprising fact you may not know!</h2></main> 1393 * ^ ^ 1394 * \-|-- this `H2` opener bookmark tracks the token 1395 * 1396 * <main class="clickbait"><h2>Surprising fact you may no… 1397 * ^ ^ 1398 * \-|-- it shifts with edits 1399 * 1400 * Bookmarks provide the ability to seek to a previously-scanned 1401 * place in the HTML document. This avoids the need to re-scan 1402 * the entire document. 1403 * 1404 * Example: 1405 * 1406 * <ul><li>One</li><li>Two</li><li>Three</li></ul> 1407 * ^^^^ 1408 * want to note this last item 1409 * 1410 * $p = new WP_HTML_Tag_Processor( $html ); 1411 * $in_list = false; 1412 * while ( $p->next_tag( array( 'tag_closers' => $in_list ? 'visit' : 'skip' ) ) ) { 1413 * if ( 'UL' === $p->get_tag() ) { 1414 * if ( $p->is_tag_closer() ) { 1415 * $in_list = false; 1416 * $p->set_bookmark( 'resume' ); 1417 * if ( $p->seek( 'last-li' ) ) { 1418 * $p->add_class( 'last-li' ); 1419 * } 1420 * $p->seek( 'resume' ); 1421 * $p->release_bookmark( 'last-li' ); 1422 * $p->release_bookmark( 'resume' ); 1423 * } else { 1424 * $in_list = true; 1425 * } 1426 * } 1427 * 1428 * if ( 'LI' === $p->get_tag() ) { 1429 * $p->set_bookmark( 'last-li' ); 1430 * } 1431 * } 1432 * 1433 * Bookmarks intentionally hide the internal string offsets 1434 * to which they refer. They are maintained internally as 1435 * updates are applied to the HTML document and therefore 1436 * retain their "position" - the location to which they 1437 * originally pointed. The inability to use bookmarks with 1438 * functions like `substr` is therefore intentional to guard 1439 * against accidentally breaking the HTML. 1440 * 1441 * Because bookmarks allocate memory and require processing 1442 * for every applied update, they are limited and require 1443 * a name. They should not be created with programmatically-made 1444 * names, such as "li_{$index}" with some loop. As a general 1445 * rule they should only be created with string-literal names 1446 * like "start-of-section" or "last-paragraph". 1447 * 1448 * Bookmarks are a powerful tool to enable complicated behavior. 1449 * Consider double-checking that you need this tool if you are 1450 * reaching for it, as inappropriate use could lead to broken 1451 * HTML structure or unwanted processing overhead. 1452 * 1453 * @since 6.4.0 1454 * 1455 * @param string $bookmark_name Identifies this particular bookmark. 1456 * @return bool Whether the bookmark was successfully created. 1457 */ 1458 public function set_bookmark( $bookmark_name ) { 1459 return parent::set_bookmark( "_{$bookmark_name}" ); 1460 } 1461 1462 /** 1463 * Checks whether a bookmark with the given name exists. 1464 * 1465 * @since 6.5.0 1466 * 1467 * @param string $bookmark_name Name to identify a bookmark that potentially exists. 1468 * @return bool Whether that bookmark exists. 1469 */ 1470 public function has_bookmark( $bookmark_name ) { 1471 return parent::has_bookmark( "_{$bookmark_name}" ); 1472 } 1473 1474 /* 1475 * HTML Parsing Algorithms 1476 */ 1477 1478 /** 1479 * Closes a P element. 1480 * 1481 * @since 6.4.0 1482 * 1483 * @throws WP_HTML_Unsupported_Exception When encountering unsupported HTML input. 1484 * 1485 * @see https://html.spec.whatwg.org/#close-a-p-element 1486 */ 1487 private function close_a_p_element() { 1488 $this->generate_implied_end_tags( 'P' ); 1489 $this->state->stack_of_open_elements->pop_until( 'P' ); 1490 } 1491 1492 /** 1493 * Closes elements that have implied end tags. 1494 * 1495 * @since 6.4.0 1496 * 1497 * @see https://html.spec.whatwg.org/#generate-implied-end-tags 1498 * 1499 * @param string|null $except_for_this_element Perform as if this element doesn't exist in the stack of open elements. 1500 */ 1501 private function generate_implied_end_tags( $except_for_this_element = null ) { 1502 $elements_with_implied_end_tags = array( 1503 'DD', 1504 'DT', 1505 'LI', 1506 'P', 1507 ); 1508 1509 $current_node = $this->state->stack_of_open_elements->current_node(); 1510 while ( 1511 $current_node && $current_node->node_name !== $except_for_this_element && 1512 in_array( $this->state->stack_of_open_elements->current_node(), $elements_with_implied_end_tags, true ) 1513 ) { 1514 $this->state->stack_of_open_elements->pop(); 1515 } 1516 } 1517 1518 /** 1519 * Closes elements that have implied end tags, thoroughly. 1520 * 1521 * See the HTML specification for an explanation why this is 1522 * different from generating end tags in the normal sense. 1523 * 1524 * @since 6.4.0 1525 * 1526 * @see WP_HTML_Processor::generate_implied_end_tags 1527 * @see https://html.spec.whatwg.org/#generate-implied-end-tags 1528 */ 1529 private function generate_implied_end_tags_thoroughly() { 1530 $elements_with_implied_end_tags = array( 1531 'DD', 1532 'DT', 1533 'LI', 1534 'P', 1535 ); 1536 1537 while ( in_array( $this->state->stack_of_open_elements->current_node(), $elements_with_implied_end_tags, true ) ) { 1538 $this->state->stack_of_open_elements->pop(); 1539 } 1540 } 1541 1542 /** 1543 * Reconstructs the active formatting elements. 1544 * 1545 * > This has the effect of reopening all the formatting elements that were opened 1546 * > in the current body, cell, or caption (whichever is youngest) that haven't 1547 * > been explicitly closed. 1548 * 1549 * @since 6.4.0 1550 * 1551 * @throws WP_HTML_Unsupported_Exception When encountering unsupported HTML input. 1552 * 1553 * @see https://html.spec.whatwg.org/#reconstruct-the-active-formatting-elements 1554 * 1555 * @return bool Whether any formatting elements needed to be reconstructed. 1556 */ 1557 private function reconstruct_active_formatting_elements() { 1558 /* 1559 * > If there are no entries in the list of active formatting elements, then there is nothing 1560 * > to reconstruct; stop this algorithm. 1561 */ 1562 if ( 0 === $this->state->active_formatting_elements->count() ) { 1563 return false; 1564 } 1565 1566 $last_entry = $this->state->active_formatting_elements->current_node(); 1567 if ( 1568 1569 /* 1570 * > If the last (most recently added) entry in the list of active formatting elements is a marker; 1571 * > stop this algorithm. 1572 */ 1573 'marker' === $last_entry->node_name || 1574 1575 /* 1576 * > If the last (most recently added) entry in the list of active formatting elements is an 1577 * > element that is in the stack of open elements, then there is nothing to reconstruct; 1578 * > stop this algorithm. 1579 */ 1580 $this->state->stack_of_open_elements->contains_node( $last_entry ) 1581 ) { 1582 return false; 1583 } 1584 1585 $this->last_error = self::ERROR_UNSUPPORTED; 1586 throw new WP_HTML_Unsupported_Exception( 'Cannot reconstruct active formatting elements when advancing and rewinding is required.' ); 1587 } 1588 1589 /** 1590 * Runs the adoption agency algorithm. 1591 * 1592 * @since 6.4.0 1593 * 1594 * @throws WP_HTML_Unsupported_Exception When encountering unsupported HTML input. 1595 * 1596 * @see https://html.spec.whatwg.org/#adoption-agency-algorithm 1597 */ 1598 private function run_adoption_agency_algorithm() { 1599 $budget = 1000; 1600 $subject = $this->get_tag(); 1601 $current_node = $this->state->stack_of_open_elements->current_node(); 1602 1603 if ( 1604 // > If the current node is an HTML element whose tag name is subject 1605 $current_node && $subject === $current_node->node_name && 1606 // > the current node is not in the list of active formatting elements 1607 ! $this->state->active_formatting_elements->contains_node( $current_node ) 1608 ) { 1609 $this->state->stack_of_open_elements->pop(); 1610 return; 1611 } 1612 1613 $outer_loop_counter = 0; 1614 while ( $budget-- > 0 ) { 1615 if ( $outer_loop_counter++ >= 8 ) { 1616 return; 1617 } 1618 1619 /* 1620 * > Let formatting element be the last element in the list of active formatting elements that: 1621 * > - is between the end of the list and the last marker in the list, 1622 * > if any, or the start of the list otherwise, 1623 * > - and has the tag name subject. 1624 */ 1625 $formatting_element = null; 1626 foreach ( $this->state->active_formatting_elements->walk_up() as $item ) { 1627 if ( 'marker' === $item->node_name ) { 1628 break; 1629 } 1630 1631 if ( $subject === $item->node_name ) { 1632 $formatting_element = $item; 1633 break; 1634 } 1635 } 1636 1637 // > If there is no such element, then return and instead act as described in the "any other end tag" entry above. 1638 if ( null === $formatting_element ) { 1639 $this->last_error = self::ERROR_UNSUPPORTED; 1640 throw new WP_HTML_Unsupported_Exception( 'Cannot run adoption agency when "any other end tag" is required.' ); 1641 } 1642 1643 // > If formatting element is not in the stack of open elements, then this is a parse error; remove the element from the list, and return. 1644 if ( ! $this->state->stack_of_open_elements->contains_node( $formatting_element ) ) { 1645 $this->state->active_formatting_elements->remove_node( $formatting_element ); 1646 return; 1647 } 1648 1649 // > If formatting element is in the stack of open elements, but the element is not in scope, then this is a parse error; return. 1650 if ( ! $this->state->stack_of_open_elements->has_element_in_scope( $formatting_element->node_name ) ) { 1651 return; 1652 } 1653 1654 /* 1655 * > Let furthest block be the topmost node in the stack of open elements that is lower in the stack 1656 * > than formatting element, and is an element in the special category. There might not be one. 1657 */ 1658 $is_above_formatting_element = true; 1659 $furthest_block = null; 1660 foreach ( $this->state->stack_of_open_elements->walk_down() as $item ) { 1661 if ( $is_above_formatting_element && $formatting_element->bookmark_name !== $item->bookmark_name ) { 1662 continue; 1663 } 1664 1665 if ( $is_above_formatting_element ) { 1666 $is_above_formatting_element = false; 1667 continue; 1668 } 1669 1670 if ( self::is_special( $item->node_name ) ) { 1671 $furthest_block = $item; 1672 break; 1673 } 1674 } 1675 1676 /* 1677 * > If there is no furthest block, then the UA must first pop all the nodes from the bottom of the 1678 * > stack of open elements, from the current node up to and including formatting element, then 1679 * > remove formatting element from the list of active formatting elements, and finally return. 1680 */ 1681 if ( null === $furthest_block ) { 1682 foreach ( $this->state->stack_of_open_elements->walk_up() as $item ) { 1683 $this->state->stack_of_open_elements->pop(); 1684 1685 if ( $formatting_element->bookmark_name === $item->bookmark_name ) { 1686 $this->state->active_formatting_elements->remove_node( $formatting_element ); 1687 return; 1688 } 1689 } 1690 } 1691 1692 $this->last_error = self::ERROR_UNSUPPORTED; 1693 throw new WP_HTML_Unsupported_Exception( 'Cannot extract common ancestor in adoption agency algorithm.' ); 1694 } 1695 1696 $this->last_error = self::ERROR_UNSUPPORTED; 1697 throw new WP_HTML_Unsupported_Exception( 'Cannot run adoption agency when looping required.' ); 1698 } 1699 1700 /** 1701 * Inserts an HTML element on the stack of open elements. 1702 * 1703 * @since 6.4.0 1704 * 1705 * @see https://html.spec.whatwg.org/#insert-a-foreign-element 1706 * 1707 * @param WP_HTML_Token $token Name of bookmark pointing to element in original input HTML. 1708 */ 1709 private function insert_html_element( $token ) { 1710 $this->state->stack_of_open_elements->push( $token ); 1711 } 1712 1713 /* 1714 * HTML Specification Helpers 1715 */ 1716 1717 /** 1718 * Returns whether an element of a given name is in the HTML special category. 1719 * 1720 * @since 6.4.0 1721 * 1722 * @see https://html.spec.whatwg.org/#special 1723 * 1724 * @param string $tag_name Name of element to check. 1725 * @return bool Whether the element of the given name is in the special category. 1726 */ 1727 public static function is_special( $tag_name ) { 1728 $tag_name = strtoupper( $tag_name ); 1729 1730 return ( 1731 'ADDRESS' === $tag_name || 1732 'APPLET' === $tag_name || 1733 'AREA' === $tag_name || 1734 'ARTICLE' === $tag_name || 1735 'ASIDE' === $tag_name || 1736 'BASE' === $tag_name || 1737 'BASEFONT' === $tag_name || 1738 'BGSOUND' === $tag_name || 1739 'BLOCKQUOTE' === $tag_name || 1740 'BODY' === $tag_name || 1741 'BR' === $tag_name || 1742 'BUTTON' === $tag_name || 1743 'CAPTION' === $tag_name || 1744 'CENTER' === $tag_name || 1745 'COL' === $tag_name || 1746 'COLGROUP' === $tag_name || 1747 'DD' === $tag_name || 1748 'DETAILS' === $tag_name || 1749 'DIR' === $tag_name || 1750 'DIV' === $tag_name || 1751 'DL' === $tag_name || 1752 'DT' === $tag_name || 1753 'EMBED' === $tag_name || 1754 'FIELDSET' === $tag_name || 1755 'FIGCAPTION' === $tag_name || 1756 'FIGURE' === $tag_name || 1757 'FOOTER' === $tag_name || 1758 'FORM' === $tag_name || 1759 'FRAME' === $tag_name || 1760 'FRAMESET' === $tag_name || 1761 'H1' === $tag_name || 1762 'H2' === $tag_name || 1763 'H3' === $tag_name || 1764 'H4' === $tag_name || 1765 'H5' === $tag_name || 1766 'H6' === $tag_name || 1767 'HEAD' === $tag_name || 1768 'HEADER' === $tag_name || 1769 'HGROUP' === $tag_name || 1770 'HR' === $tag_name || 1771 'HTML' === $tag_name || 1772 'IFRAME' === $tag_name || 1773 'IMG' === $tag_name || 1774 'INPUT' === $tag_name || 1775 'KEYGEN' === $tag_name || 1776 'LI' === $tag_name || 1777 'LINK' === $tag_name || 1778 'LISTING' === $tag_name || 1779 'MAIN' === $tag_name || 1780 'MARQUEE' === $tag_name || 1781 'MENU' === $tag_name || 1782 'META' === $tag_name || 1783 'NAV' === $tag_name || 1784 'NOEMBED' === $tag_name || 1785 'NOFRAMES' === $tag_name || 1786 'NOSCRIPT' === $tag_name || 1787 'OBJECT' === $tag_name || 1788 'OL' === $tag_name || 1789 'P' === $tag_name || 1790 'PARAM' === $tag_name || 1791 'PLAINTEXT' === $tag_name || 1792 'PRE' === $tag_name || 1793 'SCRIPT' === $tag_name || 1794 'SEARCH' === $tag_name || 1795 'SECTION' === $tag_name || 1796 'SELECT' === $tag_name || 1797 'SOURCE' === $tag_name || 1798 'STYLE' === $tag_name || 1799 'SUMMARY' === $tag_name || 1800 'TABLE' === $tag_name || 1801 'TBODY' === $tag_name || 1802 'TD' === $tag_name || 1803 'TEMPLATE' === $tag_name || 1804 'TEXTAREA' === $tag_name || 1805 'TFOOT' === $tag_name || 1806 'TH' === $tag_name || 1807 'THEAD' === $tag_name || 1808 'TITLE' === $tag_name || 1809 'TR' === $tag_name || 1810 'TRACK' === $tag_name || 1811 'UL' === $tag_name || 1812 'WBR' === $tag_name || 1813 'XMP' === $tag_name || 1814 1815 // MathML. 1816 'MI' === $tag_name || 1817 'MO' === $tag_name || 1818 'MN' === $tag_name || 1819 'MS' === $tag_name || 1820 'MTEXT' === $tag_name || 1821 'ANNOTATION-XML' === $tag_name || 1822 1823 // SVG. 1824 'FOREIGNOBJECT' === $tag_name || 1825 'DESC' === $tag_name || 1826 'TITLE' === $tag_name 1827 ); 1828 } 1829 1830 /** 1831 * Returns whether a given element is an HTML Void Element 1832 * 1833 * > area, base, br, col, embed, hr, img, input, link, meta, source, track, wbr 1834 * 1835 * @since 6.4.0 1836 * 1837 * @see https://html.spec.whatwg.org/#void-elements 1838 * 1839 * @param string $tag_name Name of HTML tag to check. 1840 * @return bool Whether the given tag is an HTML Void Element. 1841 */ 1842 public static function is_void( $tag_name ) { 1843 $tag_name = strtoupper( $tag_name ); 1844 1845 return ( 1846 'AREA' === $tag_name || 1847 'BASE' === $tag_name || 1848 'BASEFONT' === $tag_name || // Obsolete but still treated as void. 1849 'BGSOUND' === $tag_name || // Obsolete but still treated as void. 1850 'BR' === $tag_name || 1851 'COL' === $tag_name || 1852 'EMBED' === $tag_name || 1853 'FRAME' === $tag_name || 1854 'HR' === $tag_name || 1855 'IMG' === $tag_name || 1856 'INPUT' === $tag_name || 1857 'KEYGEN' === $tag_name || // Obsolete but still treated as void. 1858 'LINK' === $tag_name || 1859 'META' === $tag_name || 1860 'PARAM' === $tag_name || // Obsolete but still treated as void. 1861 'SOURCE' === $tag_name || 1862 'TRACK' === $tag_name || 1863 'WBR' === $tag_name 1864 ); 1865 } 1866 1867 /* 1868 * Constants that would pollute the top of the class if they were found there. 1869 */ 1870 1871 /** 1872 * Indicates that the next HTML token should be parsed and processed. 1873 * 1874 * @since 6.4.0 1875 * 1876 * @var string 1877 */ 1878 const PROCESS_NEXT_NODE = 'process-next-node'; 1879 1880 /** 1881 * Indicates that the current HTML token should be reprocessed in the newly-selected insertion mode. 1882 * 1883 * @since 6.4.0 1884 * 1885 * @var string 1886 */ 1887 const REPROCESS_CURRENT_NODE = 'reprocess-current-node'; 1888 1889 /** 1890 * Indicates that the current HTML token should be processed without advancing the parser. 1891 * 1892 * @since 6.5.0 1893 * 1894 * @var string 1895 */ 1896 const PROCESS_CURRENT_NODE = 'process-current-node'; 1897 1898 /** 1899 * Indicates that the parser encountered unsupported markup and has bailed. 1900 * 1901 * @since 6.4.0 1902 * 1903 * @var string 1904 */ 1905 const ERROR_UNSUPPORTED = 'unsupported'; 1906 1907 /** 1908 * Indicates that the parser encountered more HTML tokens than it 1909 * was able to process and has bailed. 1910 * 1911 * @since 6.4.0 1912 * 1913 * @var string 1914 */ 1915 const ERROR_EXCEEDED_MAX_BOOKMARKS = 'exceeded-max-bookmarks'; 1916 1917 /** 1918 * Unlock code that must be passed into the constructor to create this class. 1919 * 1920 * This class extends the WP_HTML_Tag_Processor, which has a public class 1921 * constructor. Therefore, it's not possible to have a private constructor here. 1922 * 1923 * This unlock code is used to ensure that anyone calling the constructor is 1924 * doing so with a full understanding that it's intended to be a private API. 1925 * 1926 * @access private 1927 */ 1928 const CONSTRUCTOR_UNLOCK_CODE = 'Use WP_HTML_Processor::create_fragment() instead of calling the class constructor directly.'; 1929 }
title
Description
Body
title
Description
Body
title
Description
Body
title
Body
Generated : Sat Apr 27 08:20:02 2024 | Cross-referenced by PHPXref |