HTML (html)
The html module parses real-world HTML and lets you query it with CSS
selectors. It uses a lenient HTML5 parser (the same parsing rules a browser
follows), so it copes with unclosed tags, missing <html>/<body> wrappers,
and the other irregularities of pages in the wild. Reach for html when you
are scraping or extracting data from a page; use the xml module only for
strict, well-formed XML.
import html;
let doc = html.parse(pageSource);
let heading = doc.selectFirst("h1").text();
for (link in doc.select("a[href]")) {
io.println(link.attr("href"));
}
Parsing
html.parse(source: string): Node parses a document and returns its root
Node. Parsing is lenient and never throws on malformed markup:
- A fragment is wrapped in an implied
<html><body>tree, sohtml.parse("<li>x</li>")still has abodyyou can query. - Unclosed and misnested tags are repaired the way a browser would.
The returned root reports its tag as #document. The document element and
body are reachable through selectors or traversal.
let doc = html.parse("<li>only</li>");
doc.tag(); # "#document"
doc.selectFirst("body").children(); # [<html.Node li>]
Nodes
Every element and the document root are Node values. A node carries these
methods:
| Method | Returns | Description |
|---|---|---|
select(selector) |
list<Node> |
Every descendant element matching the CSS selector, in document order. |
selectFirst(selector) |
?Node |
The first matching descendant, or null if none match. |
text() |
string |
The concatenated text of this node and all its descendants. Whitespace is preserved as written. |
attr(name) |
?string |
The value of the named attribute, or null if the node has no such attribute. |
attrs() |
dict<string, string> |
All attributes of the node as a dict. |
tag() |
string |
The element's lowercased tag name (#document for the root, "" for non-elements). |
html() |
string |
The node's inner HTML (its children serialized back to markup). |
children() |
list<Node> |
The node's direct child elements. Text and comment nodes are skipped. |
parent() |
?Node |
The parent node, or null for the root. |
let doc = html.parse("<article><h1 id=\"t\">Title</h1><p>Body <em>text</em>.</p></article>");
let h1 = doc.selectFirst("h1");
h1.text(); # "Title"
h1.tag(); # "h1"
h1.attr("id"); # "t"
h1.attr("class"); # null
let p = doc.selectFirst("p");
p.text(); # "Body text." (descendant text is included)
p.html(); # "Body <em>text</em>."
p.parent().tag(); # "article"
doc.selectFirst("article").children().length(); # 2 (h1 and p)
select returns every match; selectFirst short-circuits to the first and
returns null when nothing matches, so guard it before use:
let nav = doc.selectFirst("nav");
if (nav != null) {
io.println(nav.html());
}
CSS selectors
select and selectFirst accept standard CSS selectors. The supported syntax
includes:
- Type, class, id, and universal:
div,.title,#main,*. - Attributes:
[href](present),[type="text"](exact), and the operators^=(prefix),$=(suffix),*=(substring),~=(word),|=(prefix or prefix-with-hyphen). - Combinators: descendant (
ul li), child (ul > li), adjacent sibling (h1 + p), and general sibling (h1 ~ p). - Pseudo-classes such as
:first-child,:last-child,:nth-child(n),:not(...), and the other structural pseudo-classes. - Grouping with
,:h1, h2, h3matches any of them.
doc.select("ul > li:nth-child(2)"); # the second list item
doc.select("a[href^=\"https://\"]"); # external links
doc.select("p:not(.footnote)"); # paragraphs except footnotes
doc.select("h1, h2"); # all top-level headings
An invalid selector throws a RuntimeError naming the offending selector, so a
typo fails loudly rather than silently matching nothing.
Examples
Extract every link with its text:
import html;
let doc = html.parse(pageSource);
for (a in doc.select("a[href]")) {
io.println(a.text() + " -> " + (a.attr("href") as string));
}
Pull rows out of a table:
let rows = [];
for (tr in doc.select("table.data tr")) {
let cells = [];
for (td in tr.select("td")) {
cells.push(td.text());
}
rows.push(cells);
}
Read the article title a redirect resolved to (with the HTTP client):
import html;
import http;
let resp = http.get("https://en.wikipedia.org/wiki/Special:Random");
let title = html.parse(resp.text()).selectFirst("h1").text();
io.println(title + " (" + resp.url() + ")");
Notes
text()walks the whole subtree, so it returns the visible text of nested elements too; it does not collapse or trim whitespace.html()is the node's inner HTML. To recover a child's own markup, select the child and callhtml()on it.children()yields element children only. To reach text content, usetext().- Nodes are ordinary garbage-collected values. Holding any node keeps its document tree alive (parents and siblings remain reachable); once you drop all references to a parsed document, it is collected like any other value.