如何使用 python 从以下代码段中提取 34980 和 100329:
<tr id="product_34980" class="even">
<tr id="variant_100329" class="variantRow">
如何使用 python 从以下代码段中提取 34980 和 100329:
<tr id="product_34980" class="even">
<tr id="variant_100329" class="variantRow">
使用filter
and str.isdigit
,以下代码从每一行中提取数字。
>>> lines = '''<tr id="product_34980" class="even">
... <tr id="variant_100329" class="variantRow">
... '''
>>> [filter(str.isdigit, line) for line in lines.splitlines()]
['34980', '100329']
更新使用lxml
:
import lxml.html
html_string = '''
<tr id="product_34980" class="even">
<tr id="variant_100329" class="variantRow">
'''
root = lxml.html.fromstring(html_string)
for tr in root.cssselect('tr.even, tr.variantRow'):
print(tr.get('id')) # => product_34980
print(tr.get('id').rsplit('_', 1)[-1]) # => 34980
不是最通用的解决方案,但它适用于上面的代码段:
import re
html = """
<tr id="product_34980" class="even">
<tr id="variant_100329" class="variantRow">
"""
ids = re.findall(r'id="\w+_(\d+)"', html)