0

编辑:所以我开始意识到,在下面的答案的帮助下,用正则表达式解析 html 通常是一个坏主意。对于它的价值,如果有一天其他人遇到我的帖子并提出同样的问题,这里有一个关于这个主题的两个类似问题的链接,其中有更多的辩论和解释,你可能会觉得有用:Using regular expressions to parse HTML : 为什么不?还有这个:RegEx 匹配除 XHTML 自包含标签之外的开放标签

规格: Python 3.3.1

我正在尝试做的事情:我正在编写一个网页提取器来从网站中获取天气数据,对于我的项目来说,它有 3 个有意义的部分:温度“现在”、“今天早些时候”和“今晚”。我打算只抓住这 3 个数字,而忽略所有其他文本。在下面的代码中,我使用温度数字之前存在的特定 html 元素作为模式来帮助我获取数字本身。

我需要的所有数据都在这段 html 代码摘录中:(即89,9680

<div class="wx-timepart-title">
Earlier Today
</div>
<div class="wx-timepart-title">Tonight</div>
<div class="wx-data-part wx-first">
<img src="http://s.imwx.com/v.20120328.084208/img/wxicon/120/29.png" height="120" width="120" alt="Partly Cloudy" class="wx-weather-icon">
</div>
<div class="wx-data-part">
<img src="http://s.imwx.com/v.20120328.084208/img/wxicon/120/30.png" height="120" width="120" alt="Partly Cloudy" class="wx-weather-icon">
</div>
<div class="wx-data-part">
<img src="http://s.imwx.com/v.20120328.084208/img/wxicon/120/29.png" height="120" width="120" alt="Partly Cloudy" class="wx-weather-icon">
</div>
<div class="wx-data-part wx-first">
<div class="wx-temperature"><span itemprop="temperature-fahrenheit">89</span><span class="wx-degrees">&deg;<span class="wx-unit">F</span></span></div>
<div class="wx-temperature-label">FEELS LIKE
<span itemprop="feels-like-temperature-fahrenheit">94</span>&deg;</div>
</div>
<div class="wx-data-part">
<div class="wx-temperature">96<span class="wx-degrees">&deg;</span></div>
<div class="wx-temperature-label">HIGH AT 4:45 PM</div>
</div>
<div class="wx-data-part">
<div class="wx-temperature">80<span class="wx-degrees">&deg;</span></div>
<div class="wx-temperature-label">LOW</div>
</div>  

我想出的解决方案:

import urllib.request
import re

# open the webpage and read the html code into a string; 
base = urllib.request.urlopen('http://www.weather.com/weather/today/Washington+DC+USDC0001:1:US')
f = base.readlines()
f = str(f)


# temperature "Right Now" 
match1 = re.search(r'<div class="wx-temperature"><span itemprop="temperature-fahrenheit">\w\w',f)

if match1:
    result1 = match1.group()
    right_now = result1[68:]
    print(right_now)


# temperature "Earlier Today"
match2 = re.search(r'<div class="wx-temperature">\w\w',f)

if match2:
    result2 = match2.group()
    ealier_today = result2[28:]
    print(ealier_today)


# temperature "Tonight"
match3 = re.search(r'<div class="wx-temperature">\w\w',f)

if match3:
    result3 = match3.group()
    tonight = result3[28:]
    print(tonight)

这三个打印语句仅用于测试是否正确抓取了数据。

我的问题:第三个正则表达式(match3)出现问题,显示match2. 我认为这是因为它使用与第二个相同的正则表达式模式?所以我想我的问题是你如何使用相同的正则表达式模式搜索多个结果。还是您只能抓住模式的第一次出现?我对 Python 很陌生,这是我接触正则表达式的头几天。如果您能分享一些关于我的解决方案或我对这个项目的总体思路的一般性建议,我将不胜感激。谢谢!

4

1 回答 1

1

也许我误解了你的问题,但你只是在寻找findall

match3 = re.findall(r'<div class="wx-temperature">\w\w',f)

此外,您可能会发现使用BeautifulSoup或类似的东西更容易。用正则表达式解析 html 是地狱般的。此外,你最好不要重新发明轮子,因为 python 有数百个构建良好的模块,它们已经为你做了很多工作。安装 bs4 后,您可以执行以下操作:

>>> from bs4 import BeautifulSoup
>>> html = '''<div class="wx-timepart-title">
Earlier Today
</div>
<div class="wx-timepart-title">Tonight</div>
<div class="wx-data-part wx-first">
<img src="http://s.imwx.com/v.20120328.084208/img/wxicon/120/29.png" height="120" width="120" alt="Partly Cloudy" class="wx-weather-icon">
</div>
<div class="wx-data-part">
<img src="http://s.imwx.com/v.20120328.084208/img/wxicon/120/30.png" height="120" width="120" alt="Partly Cloudy" class="wx-weather-icon">
</div>
<div class="wx-data-part">
<img src="http://s.imwx.com/v.20120328.084208/img/wxicon/120/29.png" height="120" width="120" alt="Partly Cloudy" class="wx-weather-icon">
</div>
<div class="wx-data-part wx-first">
<div class="wx-temperature"><span itemprop="temperature-fahrenheit">89</span><span class="wx-degrees">&deg;<span class="wx-unit">F</span></span></div>
<div class="wx-temperature-label">FEELS LIKE
<span itemprop="feels-like-temperature-fahrenheit">94</span>&deg;</div>
</div>
<div class="wx-data-part">
<div class="wx-temperature">96<span class="wx-degrees">&deg;</span></div>
<div class="wx-temperature-label">HIGH AT 4:45 PM</div>
</div>
<div class="wx-data-part">
<div class="wx-temperature">80<span class="wx-degrees">&deg;</span></div>
<div class="wx-temperature-label">LOW</div>
</div>  '''
>>> soup = BeautifulSoup(html)
>>> for temp in soup.find_all(class_="wx-temperature"):
    print(temp.text)       # or add these to a list or make a list comprehension


89°F
96°
80°

如果您只想要数字(可能是负数),您可以这样做:

>>> import re
>>> for temp in soup.find_all(class_="wx-temperature"):
    print(re.match(r'-?\d+', temp.text).group())


89
96
80

如果天气下降到一位数或上升到三位数,这种方法将为您提供一些灵活性。我添加了-?,这意味着字符出现 0 或 1 次-,以防您遇到负数。

于 2013-07-16T02:31:53.547 回答