我有一个要解析的 URL,尤其是 widgetid:
<a href="http://www.somesite.com/process.asp?widgetid=4530">Widgets Rock!</a>
我已经编写了这个 Python(我是 Python 的一个新手——版本是 2.7):
import re
from bs4 import BeautifulSoup
doc = open('c:\Python27\some_xml_file.txt')
soup = BeautifulSoup(doc)
links = soup.findAll('a')
# debugging statements
print type(links[7])
# output: <class 'bs4.element.Tag'>
print links[7]
# output: <a href="http://www.somesite.com/process.asp?widgetid=4530">Widgets Rock!</a>
theURL = links[7].attrs['href']
print theURL
# output: http://www.somesite.com/process.asp?widgetid=4530
print type(theURL)
# output: <type 'unicode'>
is_widget_url = re.compile('[0-9]')
print is_widget_url.match(theURL)
# output: None (I know this isn't the correct regex but I'd think it
# would match if there's any number in there!)
我想我错过了正则表达式的一些东西(或者我对如何使用它们的理解),但我无法弄清楚。
谢谢你的帮助!