我正在尝试从 infoweb.newsbank.com 的数据库中收集文章,以用于我在大学所做的研究。到目前为止,这是我的代码:
from bs4 import BeautifulSoup
import requests
import urllib
from requests import session
import http.cookiejar
mainLink = "http://infoweb.newsbank.com.proxy.lib.uiowa.edu/iw-search/we/InfoWeb?p_product=AWNB&p_theme=aggregated5&p_action=doc&p_docid=14D12E120CD13C18&p_docnum=2&p_queryname=4"
def articleCrawler(mainUrl):
response = urllib.request.urlopen(mainUrl)
soup = BeautifulSoup(response)
linkList = []
for link in soup.find_all('a'):
print(link)
articleCrawler(mainLink)
不幸的是,我得到了这个回复:
<html>
<head>
<title>Cookie Required</title>
</head>
<body>
This is cookie.htm from the doc subdirectory.
<p>
<hr>
<p>
Licensing agreements for these databases require that access be extended
only to authorized users. Once you have been validated by this system,
a "cookie" is sent to your browser as an ongoing indication of your authorization to
access these databases. It will only need to be set once during login.
<p>
As you access databases, they may also use cookies. Your ability to use those databases
may depend on whether or not you allow those cookies to be set.
<p>
To login again, click <a href="login">here</a>.
</p></p></p></hr></p></body>
</html>
<a href="login">here</a>
我试过使用 http.cookiejar,但我不熟悉这个库。我正在使用 Python 3。有人知道如何接受 cookie 并访问文章吗?谢谢你。