python - 在 html 元素上使用 text_content() 时避免连续单词的稳健方法

Question

我们正在解析网页。一个目标是找到所有单词及其频率。我们一直在使用 lxml

from lxml import html

my_string = open(some_file_path).read()

tree = html.fromstring(my_string)

text_no_markup = tree.text_content()

好吧，我们会看到像这样的东西 a_wordconcatenated_to_another

当我们期待 a_word concatenated_to_another

仔细观察，当 a_word 后跟某种关闭标记，然后是更多 html 标记，然后没有任何空格或换行符时，似乎会发生这种情况 concatenated_to_another 将包含在某些标记中。

我能想出解决这个问题的唯一方法是

my_modified_string = open(some_file_path).read().replace('>','> ')

所以我用 gt 符号和空格替换所有 gt 符号。

有没有更强大的方法来实现这一点？

score 2 · Accepted Answer

采用itertext()

>>> my_string = '''
... <div>
...     <b>hello</b>world
... </div>
... '''
>>>
>>> root = html.fromstring(my_string)
>>> print root.text_content()

    helloworld

>>> for text in root.itertext():
...     text = text.strip()
...     if text: # to skip empty(or space-only) string
...         print text
...
hello
world
>>> print ' '.join(root.itertext())

     hello world

python - 在 html 元素上使用 text_content() 时避免连续单词的稳健方法

1 回答 1

Related

Reference