python - lxml中属性和样式标签的区别

Question

使用 BeautifulSoup 后，我正在尝试学习 lxml。但是，总的来说，我不是一个强大的程序员。

我在一些源 html 中有以下代码：

<p style="font-family:times;text-align:justify"><font size="2"><b><i> The reasons to eat pickles include:  </i></b></font></p>

因为文本是粗体的，所以我想拉那个文本。我似乎无法区分该特定行是粗体的。

当我今晚开始这项工作时，我正在处理一个在样式属性中包含粗体字的文档，如下所示：

<p style="font-style:italic;font-weight:bold;margin:0pt 0pt 6.0pt;text-indent:0pt;"><b><i><font size="2" face="Times New Roman" style="font-size:10.0pt;">The reason I like tomatoes include:</font></i></b></p>

我应该说我正在使用的文档是我在行中阅读的一个片段，将这些行连接在一起，然后使用 html.fromstring 函数

txtFile=open(r'c:\myfile.htm','r').readlines()
strHTM=''.join(txtFile)
newHTM=html.fromstring(strHTM)

所以我上面的第一行 htm 代码是 newHTM[19]

嗯，这似乎让我更接近

newHTM.cssselect('b')

我还没有完全理解，但这里是解决方案：

for each in newHTM:
    if each.cssselect('b')
        each.text_content()

score 0 · Accepted Answer

使用 CSS API 确实不是正确的方法。如果要查找所有 b 元素，请执行

strHTM=open(r'c:\myfile.htm','r').read() # no need to split it into lines first
newHTM=html.fromString(strHTM)
bELements = newHTM.findall('b')
for b in bElements:
    print b.text_content()

python - lxml中属性和样式标签的区别

1 回答 1

Related

Reference