python - 网络爬虫的属性错误

Question

运行以下代码时：

import urllib
import re
from urllib import request
import webbrowser

#email pattern
r'[\w._(),:;<>]+@[\w._(),:;<>][.]\w+'

# url pattern
r'\w\w\w[.]\w+[.]\w+'

html = urllib.request.urlopen('somelinkthatistoolongforstackoverflow')

#find all websites

websites = re.findall(r'http://www[.]\w+[.]\w+',str(html.read()))
print(websites)

#find all emails

emails = re.findall(r'[\w._(),:;<>]+@[\w._(),:;<>][.]\w+',str(html.read()))
print(emails)

#sort through websites and find other links

for i in websites:
    y = urllib.request.urlopen(i)
    x = re.findall(r'http://www[.]\w+[.]\w+',str(y.read()))
    websites.append(x)

我收到此错误：

AttributeError: 'list' object has no attribute 'timeout'

注意 AttributeError。我能做些什么呢？我正在使用 urllib 模块和 regex（正则表达式）模块。这是在 python 3.3.0 中。谁能帮我这个？如果你能帮助我，请在下面发帖。这是一个网络爬虫，可以找到尽可能多的链接和电子邮件地址。感谢所有可以提供帮助的人。

score 0 · Accepted Answer

你想扩展 websites：

websites.extend(x)

因为x它本身就是一个列表。

您当前附加了匹配网站的列表，因此在某些时候您会将该列表i从forlop传递到该列表urllib.request.urlopen()，然后尝试将其视为一个Request对象，因为它肯定不是字符串，另一个有效选项。

python - 网络爬虫的属性错误

1 回答 1

Related

Reference