python - 用 lxml 解析 '<' 符号

Question

我目前正面临包含“<”符号的 mathjax 方程的问题。如果我用 lxml 解析这些字符串会被裁剪。

有没有办法告诉解析器不要删除未知标签（我想这就是问题所在）但保持原样？

例如

s="<div> This is a text with mathjax like $1<2$, let's see if this works till here $2>1$! </div>"
from lxml import html
tree=html.fragment_fromstring(s)
html.tostring(tree)

给出：

'<div> This is a text with mathjax like $11$! </div>'

如果 '<' 没有被裁剪就可以了。

我完全知道这不是有效的 xml。但是，不幸的是，我无法用源代码中正确的 html 转义符号替换 '<' 符号，因为实际上，我正在尝试解析包含 html 标签的降价文件，而 < 符号在这里是一个非常好的符号。

谢谢！

雅各布

score 4 · Accepted Answer

如果您使用 XML 解析器来解析无效 XML 的内容，那么您没有使用正确的工具来完成这项工作。

其他解决方案是编写自定义解析器或首先将您的降价内容传递给降价引擎（参见https://github.com/trentm/python-markdown2或https://pypi.python.org/pypi/Markdown）将其转换为正确的 HTML，然后使用 lxml 的 HTML 解析器（或任何其他 HTML 解析器 FWIW）解析此 HTML。

score 0 · Accepted Answer

单独的 Lxml 在这里不起作用，但使用 BeautifulSoup 可以正常工作！

s1="This is a text with mathjax like $1<2$, let's see if this works till here $2>1$!"
import lxml.html.soupparser as sp
from lxml import html  
soup1 = sp.fromstring(s1)
print sp.unescape(html.tostring(soup1, encoding='unicode'))

给

<html>This is a text with mathjax like $1<2$, let's see if this works till here $2>1$!</html>

python - 用 lxml 解析 '<' 符号

2 回答 2

Related

Reference