Fix tokenizer to handle < in tag names per HTML5 spec · anil.recoil.org/ocaml-html5rw@6c4eb02

OCaml HTML5 parser/serialiser based on Python's JustHTML

Fix tokenizer to handle < in tag names per HTML5 spec

Per WHATWG spec section 13.2.5.8 (Tag name state), when '<' is encountered
during tag name parsing, it should be appended to the current tag token's
tag name as part of "anything else" handling - not emit the current tag
and switch to tag open state.

This fixes 3 tree-construction test failures:
- <div<div> now correctly parses as element named "div<div"
- <p>Test</p<p>Test2</p> now correctly handles </p<p> as invalid end tag
- <option><XH<optgroup> now correctly parses XH<optgroup as element name

Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>

anil.recoil.org 2 months ago 6c4eb02f 26a9feba

+6 -5

1 changed file

expand all

unified split

lib

html5rw

tokenizer

tokenizer_impl.ml

+6 -5

lib/html5rw/tokenizer/tokenizer_impl.ml

··· 727 727 error t "unexpected-null-character"; 728 728 Buffer.add_string t.current_tag_name "\xEF\xBF\xBD" 729 729 | Some '<' -> 730 - (* Per HTML5 spec: emit error and reconsume in tag open state *) 731 - error t "unexpected-character-in-tag-name"; 732 - (* Emit current tag as-is before starting new tag *) 733 - emit_current_tag (); 734 - t.state <- Tokenizer_state.Tag_open 730 + (* Per HTML5 spec section 13.2.5.8: '<' is "anything else" - append to tag name. 731 + Note: The previous implementation incorrectly emitted the tag and switched 732 + to tag open state. The spec says to just append the character to the tag name 733 + without emitting an error. *) 734 + Tokenizer_stream.advance t.stream; 735 + Buffer.add_char t.current_tag_name '<' 735 736 | Some c -> 736 737 Tokenizer_stream.advance t.stream; 737 738 check_control_char c;