python - 用lxml python解析xhtml

Question

小问题，真的卡在这里了，不明白怎么回事，就是想从网上解析一个普通的xhtml，没什么特别的……

这是错误：

 File "class/page.py", line 85, in xslParse
    doc = lxml.etree.fromstring(self.content)
    File "lxml.etree.pyx", line 2753, in lxml.etree.fromstring (src/lxml/lxml.etree.c:54647)
    File "parser.pxi", line 1578, in lxml.etree._parseMemoryDocument (src/lxml/lxml.etree.c:82764)
    File "parser.pxi", line 1457, in lxml.etree._parseDoc (src/lxml/lxml.etree.c:81562)
    File "parser.pxi", line 965, in lxml.etree._BaseParser._parseDoc (src/lxml/lxml.etree.c:78232)
    File "parser.pxi", line 569, in lxml.etree._ParserContext._handleParseResultDoc (src/lxml/lxml.etree.c:74488)
    File "parser.pxi", line 650, in lxml.etree._handleParseResult (src/lxml/lxml.etree.c:75379)
    File "parser.pxi", line 590, in lxml.etree._raiseParseError (src/lxml/lxml.etree.c:74712)
    XMLSyntaxError: StartTag: invalid element name, line 1, column 2

self.content 是一个由 http 响应给出的普通字符串，没有清理，没有替换，什么都没有，只是服务器的响应，那么有什么意思呢？

html的开头是：

<!doctype html>
<!-- paulirish.com/2008/conditional-stylesheets-vs-css-hacks-answer-neither/ -->
<!--[if lt IE 7 ]> <html lang="fr" class="no-js ie6" itemscope itemtype="http://schema.org/Product"> <![endif]-->
<!--[if IE 7 ]>    <html lang="fr" class="no-js ie7" itemscope itemtype="http://schema.org/Product"> <![endif]-->
<!--[if IE 8 ]>    <html lang="fr" class="no-js ie8" itemscope itemtype="http://schema.org/Product"> <![endif]-->
<!--[if IE 9 ]>    <html lang="fr" class="no-js ie9" itemscope itemtype="http://schema.org/Product"> <![endif]-->
<!--[if (gt IE 9)|!(IE)]><!--> <html lang="en" class="no-js" itemscope itemtype="http://schema.org/Product"> <!--<![endif]-->
<head>......

一个普通的网页，为什么 lxml 不能解析一个普通的格式良好的文档？

score 14 · Accepted Answer

<!doctype html>表示它是一个使用 HTML 语法的 HTML5 文档。因此，您应该使用 HTML（而不是 XML）解析器。为了比较，XML 文档可能以<?xml version="1.0" encoding="UTF-8"?>.

您可以使用评论中建议lxml.html.fromstring()的@unutbu。

如果您通过 HTTP 接收页面，则使用 XML 语法的 HTML5 文档应该具有 XML 媒体类型，例如application/xhtml+xml或application/xml代替text/htmlHTML 语法。

python - 用lxml python解析xhtml

1 回答 1

Related

Reference