python - How to find a string in a HTML document, ignoring whitespace?

Question

I am trying to find a string "USB 2 ports" in a number of HTML pages. The problem is that the strings have a large amount of white space before them - sometimes 4, 20 or even 50 white space characters.

The following works with a single white space character preceding my string:

soup.find(text=' USB 2 ports')

Note the single space before the USB.

How can I tell Beautiful Soup's find() to find my string while ignoring all preceding white space?

score 3 · Accepted Answer

您可以定义一个正则表达式来搜索文本，而不考虑前导和尾随空格：

import re
pattern = re.compile(r'\s*%s\s*' % 'USB 2 ports')
result = soup.find(text=pattern)

例如：

>>> soup = BeautifulSoup("""
... <html>
...   <body>
...     <ul>
...       <li>
...         USB 2 ports
...       </li>
...       <li>
...         Firewire ports
...       </li>
...       <li>
...         HDMI ports
...       </li>
...     </ul>
...   </body>
... </html>
... """)
>>> import re
>>> pattern = re.compile(r'\s*%s\s*' % 'USB 2 ports')
>>> soup.find(text=pattern)
u'\n        USB 2 ports\n      '

编辑：我已经更改了上面的代码，将结果显式分配soup.find()给一个变量，希望能更清楚地了解正在发生的事情。为了清楚起见，我最初在您的示例代码之后对我的答案中的代码进行了建模，但我现在怀疑您可能对该代码的实际作用有些困惑。

python - How to find a string in a HTML document, ignoring whitespace?

1 回答 1

Related

Reference