python - 从 HTML 文档中提取特定字符串

Question

我需要从脱机 HTML 文档中仅采样和提取特定字符串，并将该信息干净整洁地写入 *.txt 文件中。

例如，假设这是 HTML 文件的一部分：

    <span id="dataView01">001.00 SPL</span>
    <span id="dataView02">543.00 SPL</span>
    <span id="dataView03">056.00 SPL</span>
    <span id="dataView04">228.00 SPL</span>

我需要得到这个结果：

   001.00 SPL
   543.00 SPL
   056.00 SPL
   228.00 SPL

你能帮我解决这个问题吗，谢谢。

score 2 · Accepted Answer

使用像BeautifulSoup这样的 HTML 解析器。
例子：

from bs4 import BeautifulSoup as bs
import re

markup = '''<span id="dataView01">001.00 SPL</span>
    <span id="dataView02">543.00 SPL</span>
    <span id="dataView03">056.00 SPL</span>
    <span id="dataView04">228.00 SPL</span>'''

soup = bs(markup)
tags = soup.find_all('span', id=re.compile(r'[dataView]\d+'))
for t in tags:  
    print(t.text)

结果：

001.00 声压级
543.00 声压级
056.00 声压级
228.00 声压级

下一步; 写入 .txt 文件：

import csv

with open('output.txt','wb') as fou:
    csv_writer = csv.writer(fou)
    for tag in tags:
        split_on_whitespace = t.text.split()
        csv_writer.writerow(split_on_whitespace)

score 1 · Accepted Answer

1

使用BeautifulSoup

于 2012-04-04T22:13:55.587 回答

score 0 · Accepted Answer

 import re
 s='001.00 SPL 543.00 SPL 056.00 SPL 228.00 SPL'
 print re.search(r'(\d{3}\.\d{2}\sSPL\s\d{3}\.\d{2}\sSPL\s\d{3}\.\d{2}\sSPL\s\d{3}\.\d{2}\sSPL)',s).group()

我不知道 html 文档中的周围文本，但这可能有效。

我看到你的编辑我会更新我的

实际上与jldupont的答案一起去。

python - 从 HTML 文档中提取特定字符串

3 回答 3

Related

Reference