python - 通过 BeautifulSoup 提取后如何通过正则表达式运行属性值？

Question

我有一个要解析的 URL，尤其是 widgetid：

<a href="http://www.somesite.com/process.asp?widgetid=4530">Widgets Rock!</a>

我已经编写了这个 Python（我是 Python 的一个新手——版本是 2.7）：

import re
from bs4 import BeautifulSoup

doc = open('c:\Python27\some_xml_file.txt')
soup = BeautifulSoup(doc)


links = soup.findAll('a')

# debugging statements

print type(links[7])
# output: <class 'bs4.element.Tag'>

print links[7]
# output: <a href="http://www.somesite.com/process.asp?widgetid=4530">Widgets Rock!</a>

theURL = links[7].attrs['href']
print theURL
# output: http://www.somesite.com/process.asp?widgetid=4530

print type(theURL)
# output: <type 'unicode'>

is_widget_url = re.compile('[0-9]')
print is_widget_url.match(theURL)
# output: None (I know this isn't the correct regex but I'd think it
#         would match if there's any number in there!)

我想我错过了正则表达式的一些东西（或者我对如何使用它们的理解），但我无法弄清楚。

谢谢你的帮助！

score 5 · Accepted Answer

这个问题与 BeautifulSoup 没有任何关系。

问题在于，正如文档所解释的那样，match仅匹配字符串的开头。由于您要查找的数字位于字符串的末尾，因此它不返回任何内容。

要在任何地方匹配数字，请使用search- 并且您可能希望将\d实体用于数字。

matches = re.search(r'\d+', theURL)

score 4 · Accepted Answer

我不认为你想要重新 - 你可能想要：

from urlparse import urlparse, parse_qs
s = 'http://www.somesite.com/process.asp?widgetid=4530'
qs = parse_qs(urlparse(s).query)
if 'widgetid' in qs:
   # it's got a widget, a widget it has got...

score 2 · Accepted Answer

使用urlparse：

from urlparse import urlparse, parse_qs
o = urlparse("http://www.somesite.com/process.asp?widgetid=4530")
if "widgetId" in parse_qs(o.query):
    # this is a 'widget URL'

python - 通过 BeautifulSoup 提取后如何通过正则表达式运行属性值？

3 回答 3

Related

Reference