Skip to content

Syntax error when using XPath to extract node from HTML when <script> tag is present with '<' character in it #135

@cleanhands

Description

@cleanhands

Problem Statement

When using XPath on HTML it fails with a syntax error if there is javascript that uses the less than symbol.

The documentation says XPath can be used on XML documents. If I understand correctly HTML5 is a valid XML document. This same failure may also exist on other types of XML documents depending on what data it contains. Fixing this would likely also make XPath work on earlier versions of HTML in the same scenario.

Using CSS selectors on the same document works fine.

Steps to Reproduce

Example HTML5 document:

<!DOCTYPE html>
<html lang="en">

<head>
  <title>HTML5 Example</title>
  <script>
    for (let i=0; i < 3; i++) {
      console.log(i);
    }
  </script>
</head>

<body>
  <h1>Page Title</h1>
  <a href="https://example.com">This is a link</a>
</body>

</html>

Example command:

xq -e '//a/@href' example.html

Actual Result

Error: XML syntax error on line 7: expected element name after <

Expected Result

https://example.com

Metadata

Metadata

Assignees

Labels

bugSomething isn't working

Projects

No projects

Milestone

No milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions