python - Python 网络爬虫，仅打印路径中包含特定单词的链接 - Mechanize、Beautiful Soup 等

Question

所以我有一个网络爬虫，它打印出给定站点的所有链接，而不重复相同的链接。我的代码（带有导入但尚未使用的库）如下所示：

import urllib
import re
import mechanize
from bs4 import BeautifulSoup
import urlparse
import cookielib
from urlparse import urlsplit
from publicsuffix import PublicSuffixList

url = "http://www.zahnarztpraxis-uwe-krause.de"

br = mechanize.Browser()
cj = cookielib.LWPCookieJar()
br.set_cookiejar(cj)
br.set_handle_robots(False)
br.set_handle_equiv(False)
br.set_handle_redirect(True)
br.set_handle_refresh(mechanize._http.HTTPRefreshProcessor(), max_time=1)
br.addheaders = [('User-agent', 'Mozilla/5.0 (X11; U; Linux i686; en-US; rv:1.9.0.1) Gecko/2008071615 Fedora/3.0.1-1.fc9 Firefox/3.0.1')]
page = br.open(url, timeout=5)

htmlcontent = page.read()
soup = BeautifulSoup(htmlcontent)

newurlArray = []

for link in br.links(text_regex=re.compile('^((?!IMG).)*$')):
    newurl = urlparse.urljoin(link.base_url, link.url)
    if newurl not in newurlArray:
        newurlArray.append(newurl)
        print newurl

它给了我这样的结果：

.....

http://www.zahnarztpraxis-uwe-krause.de/pages/amalgamentfernung.html
http://www.zahnarztpraxis-uwe-krause.de/pages/homoeopathie.html
http://www.zahnarztpraxis-uwe-krause.de/pages/veneers.html
http://www.zahnarztpraxis-uwe-krause.de/pages/prophylaxe.html
http://www.zahnarztpraxis-uwe-krause.de/pages/bleaching/bleaching-zahnschmuck.html
http://www.zahnarztpraxis-uwe-krause.de/pages/dental_wellness_care.html
http://www.zahnarztpraxis-uwe-krause.de/pages/digitales-roentgen.html
http://www.zahnarztpraxis-uwe-krause.de/pages/anfahrt.html
http://www.zahnarztpraxis-uwe-krause.de/pages/kontakt.html
http://www.zahnarztpraxis-uwe-krause.de/pages/impressum.html

etc....

现在我的问题是如何告诉我的程序它只打印出包含单词kontakt的链接。

我应该为此使用正则表达式还是其他什么？

我从来没有这样做过，所以我不知道用什么来获得唯一的结果：

http://www.zahnarztpraxis-uwe-krause.de/pages/kontakt.html

有什么建议么？

score 4 · Accepted Answer

4

为什么不干脆做

if 'kontakt' in url:
    print url
else:
    continue

于 2013-08-13T07:45:49.807 回答

score 1 · Accepted Answer

find()是的，这就像在 link.url 上使用正则表达式或普通的 Python 字符串一样简单。（编辑：你也可以'kontakt' in link.url像 shshank 一样使用）

for link in br.links(text_regex=re.compile('^((?!IMG).)*$')):

    if link.url.find('kontakt')>=0: ...do stuff on urls containing contact
    # or:
    if link.url.find('kontakt')<0: continue # skip urls without

显然这两个（字符串find()方法或in运算符）都可以匹配字符串中的任何位置，这有点草率。您在这里要做的只是匹配 url tail 内的内容。find()您可以使用on仅检查尾部link.url.split('/')[-1]

要不然link.url.rsplit('/',2)[1]

python - Python 网络爬虫，仅打印路径中包含特定单词的链接 - Mechanize、Beautiful Soup 等

2 回答 2

Related

Reference