python - Python：解析包含特定字符的行的网页内容并存储到文件中

Question

我是 python 新手。我有包含以下内容的网页：

<Response>
<Value type="ABC">107544</Value>
<Value type="EFG">10544</Value>
<Value type="ABC">77544</Value>

我想解析包含 ABC 的行并仅将数字存储在临时文本文件中。我怎样才能做到这一点？

目前我有

htmlpage = urllib2.urlopen(<URL>)
result = htmlpage.read()

score 1 · Accepted Answer

将您的结果放入BeautifulSoup中，您将能够非常轻松地提取任何数据而无需正则表达式

更新：

from bs4 import BeautifulSoup

result = '''<div class="test">
             <a href="example">Result 1</a>
            </div>
            
            <div class="test">
             <a href="example2">Result 2</a>
            </div>'''

soup = BeautifulSoup(result)

for div in soup.findAll('div', attrs={'class':'test'}):
    print div.find('a').text

Result 1
Result 2

score 1 · Accepted Answer

我将支持使用 BeutifulSoup 解析 HTML 的建议，但如果您坚持使用正则表达式，您可以尝试以下操作：

re.findall('(?<=type="ABC">).+?(?=<\/)', text, re.S)

score 1 · Accepted Answer

或者 lxml 和 xpaths

>>>from lxml import html

>>>result = html.fromstring('''<Response>
<Value type="ABC">107544</Value>
<Value type="EFG">10544</Value>
<Value type="ABC">77544</Value></Response>''')

>>>result.xpath('//value[@type="ABC"]/text()')
...['107544', '77544']

python - Python：解析包含特定字符的行的网页内容并存储到文件中

3 回答 3

更新：

Related

Reference