python - 解析错误的 XHTML

Question

我的新项目是从Naxos 音乐术语词汇表中提取数据，这是一个很好的资源，我想处理其文本数据并将其提取到数据库中，以便在我将创建的另一个更简单的网站上使用。

我唯一的问题是糟糕的 XHTML 格式。W3C XHTML 验证引发 318个错误和 54 个警告。即使是我发现的HTML Tidier也无法解决所有问题。

我正在使用 Python 3.67，我正在解析的页面是 ASP。我测试了 LXML 和 Python XML 模块，但都失败了。

任何人都可以建议任何其他整洁或模块吗？还是我必须使用某种原始文本操作（糟糕！）？

我的代码：

LXML：

from lxml import etree

file = open("glossary.asp", "r", encoding="ISO-8859-1")
parsed = etree.parse(file)

错误：

  Traceback (most recent call last):
  File "/media/skuzzyneon/STORE-1/naxos_dict/xslt_test.py", line 4, in <module>
    parsed = etree.parse(file)
  File "src/lxml/etree.pyx", line 3426, in lxml.etree.parse
  File "src/lxml/parser.pxi", line 1861, in lxml.etree._parseDocument
  File "src/lxml/parser.pxi", line 1881, in lxml.etree._parseFilelikeDocument
  File "src/lxml/parser.pxi", line 1776, in lxml.etree._parseDocFromFilelike
  File "src/lxml/parser.pxi", line 1187, in lxml.etree._BaseParser._parseDocFromFilelike
  File "src/lxml/parser.pxi", line 601, in lxml.etree._ParserContext._handleParseResultDoc
  File "src/lxml/parser.pxi", line 711, in lxml.etree._handleParseResult
  File "src/lxml/parser.pxi", line 640, in lxml.etree._raiseParseError
  File "/media/skuzzyneon/STORE-1/naxos_dict/glossary.asp", line 25
lxml.etree.XMLSyntaxError: EntityRef: expecting ';', line 25, column 128
>>>

Python XML（使用整理的 XHTML）：

import xml.etree.ElementTree as ET

file = open("tidy.html", "r", encoding="ISO-8859-1")
root = ET.fromstring(file.read())

# Top-level elements
print(root.findall("."))

错误：

  Traceback (most recent call last):
  File "/media/skuzzyneon/STORE-1/naxos_dict/xslt_test.py", line 4, in <module>
    root = ET.fromstring(file.read())
  File "/usr/lib/python3.6/xml/etree/ElementTree.py", line 1314, in XML
    parser.feed(text)
  File "<string>", line None
xml.etree.ElementTree.ParseError: undefined entity: line 526, column 33

score 1 · Accepted Answer

Lxml 可能认为您以这种方式给它 xml。试试这样：

from lxml import html
from cssselect import GenericTranslator, SelectorError

file = open("glossary.asp", "r", encoding="ISO-8859-1")
doc = html.document_fromstring(file.read())
print(doc.cssselect('title')[0].text_content())

此外，只需在 chrome 中打开它并在元素面板中复制 html，而不是“HTML Tidiers”。

python - 解析错误的 XHTML

1 回答 1

Related

Reference