html - Nokogiri 如何将 HTML 格式的字符串解析为 DOM

Question

我一直在研究 Nokogiri 源代码，但不知道 Nokogiri 如何将字符串解析为 Elements。源代码可以在这里找到：

https://github.com/sparklemotion/nokogiri/tree/master/lib/nokogiri

例如：我有一个字符串：

raw = "<html> <body> body <div>this is div </div> </body> <html>"

Nokogiri::HTML(raw)
=> 
#(Document:0x4d0c786 {
  name = "document",
  children = [
    #(DTD:0x4d0bc6e { name = "html" }),
    #(Element:0x4cfa46e {
      name = "html",
      children = [
        #(Element:0x4cf9bfe {
          name = "body",
          children = [
            #(Text "body"),
            #(Element:0x4cf9348 {
              name = "div",
              children = [ #(Text "this is div")]
              })]
          })]
      })]
  })

我调查了一下nokogiri / lib / nokogiri / xml / sax，我看不到它是如何解释 html 字符串的。当我尝试阅读源代码时，我意识到在上面的输出中，有数据类型Element，但我在源代码中没有看到声明class Element.

一般来说，谁能帮我解释一下 Nokogiri 如何将字符串解析为上面的数据结构？

score 2 · Accepted Answer

如前所述，Nokogiri 使用 libxml2 来处理实际的解析。这是使用本机（阅读：用 C 编码）Ruby 扩展来完成的。Ruby 有一个用于构建原生扩展的有据可查的标准接口。这是一个很好的指南。

html - Nokogiri 如何将 HTML 格式的字符串解析为 DOM

1 回答 1

Related

Reference