使用 SAX,您必须在解析器中为每个“事件”定义回调方法。您必须自己跟踪状态。它非常粗糙。例如,要从页面获取总统姓名,您可以这样做:
class MyDoc < Nokogiri::XML::SAX::Document
def start_element name, attributes = []
if name == "li"
@inside_li = true
end
end
def characters(chars)
if @inside_li
puts "found an <li> containing the string '#{chars}'"
end
end
def end_element name
if name == "li"
puts "ending #{name}"
@inside_li = false
end
end
end
以上可以被认为是陈述的粗略等价物:
doc.xpath('//li').map(&:text)
从以下输出开始:
ending li
found an <li> containing the string 'Grover Cleveland'
ending li
found an <li> containing the string 'William McKinley'
ending li
found an <li> containing the string 'Theodore Roosevelt'
到目前为止一切顺利,但是,它也输出了很多杂乱无章的内容,以:
found an <li> containing the string 'Disclaimers'
ending li
found an <li> containing the string 'Mobile view'
ending li
found an <li> containing the string '
'
found an <li> containing the string '
'
ending li
found an <li> containing the string '
'
found an <li> containing the string '
'
ending li
因此,为了使这一点更精确并且不获取您不关心的元素,您必须通过向,等添加更多子句来li
跟踪您所在的容器元素。如果您有相同的嵌套元素名称,您必须自己跟踪计数器,或者实现一个堆栈来推送和弹出您看到的元素。它变得非常混乱非常快。if
start_element
characters
SAX 最适合您不关心 DOM 的过滤器,您只需进行一些基本的转换。
相反,请考虑使用单个 XPath 语句,例如
doc.xpath("//table[contains(.//div, 'Presidents of the United States')]//ol/li").map(&:text)
这表示“查找包含带有“美国总统”字样的 div 的表,并从其中的所有有序列表项中返回文本”。这可以在 SAX 中完成,但是会产生很多混乱的代码。
上述 XPath 的输出:
["George Washington", "John Adams", "Thomas Jefferson", "James Madison", "James Monroe", "John Quincy Adams", "Andrew Jackson", "Martin Van Buren", "William Henry Harrison", "John Tyler", "James K. Polk", "Zachary Taylor", "Millard Fillmore", "Franklin Pierce", "James Buchanan", "Abraham Lincoln", "Andrew Johnson", "Ulysses S. Grant", "Rutherford B. Hayes", "James A. Garfield", "Chester A. Arthur", "Grover Cleveland", "Benjamin Harrison", "Grover Cleveland", "William McKinley", "Theodore Roosevelt", "William Howard Taft", "Woodrow Wilson", "Warren G. Harding", "Calvin Coolidge", "Herbert Hoover", "Franklin D. Roosevelt", "Harry S. Truman", "Dwight D. Eisenhower", "John F. Kennedy", "Lyndon B. Johnson", "Richard Nixon", "Gerald Ford", "Jimmy Carter", "Ronald Reagan", "George H. W. Bush", "Bill Clinton", "George W. Bush", "Barack Obama"]