python - 如何让爬虫从相对路径中提取信息？

Question

我正在尝试制作一个简单的爬虫，从该链接https://en.wikipedia.org/wiki/Web_scraping的“查看关于”部分中提取链接。总共有 19 个链接，我已经成功地使用 Beautiful Soup 提取了这些链接。但是，我将它们作为列表中的相对链接获取，我还需要通过将它们变成绝对链接来修复它们。预期结果如下所示：

然后我想使用相同的 19 个链接并从中提取更多信息。例如，19 个链接中每个链接的第一段。到目前为止，我有这个：

import requests
from bs4 import BeautifulSoup
from urllib.parse import urljoin
from urllib.request import urlopen

url = 'https://en.wikipedia.org/wiki/Web_scraping'
data = requests.get('https://en.wikipedia.org/wiki/Web_scraping').text

soup = BeautifulSoup(data, 'html.parser')

links = soup.find('div', {'class':'div-col'})
test = links.find_all('a', href=True)

data = []
for link in links.find_all('a'):
    data.append(link.get('href'))
#print(data)

soupNew = BeautifulSoup(''.join(data), 'html.parser')
print(soupNew.find_all('p')[0].text)

#test if there is any <p> tag, which returns empty, so I have not looped correctly.
x = soupNew.findAll('p')
if x is not None and len(x) > 0:
    section = x[0]
print(x)

我的主要问题是我根本无法找到一种方法来遍历 19 个链接并查找我需要的信息。我正在尝试学习 Beautiful Soup 和 Python，所以我现在更愿意坚持使用它们，即使可能有更好的选择来做这件事。所以我只需要一些帮助或者最好是一个简单的例子来解释上面所说的事情的过程。谢谢！

score 0 · Accepted Answer

您应该像拆分问题一样拆分代码。

您的第一个问题是获取一个列表，因此您可以编写一个名为 get_urls 的方法

 def get_urls():
     url = 'https://en.wikipedia.org/wiki/Web_scraping'
     data = requests.get(url).text
     soup = BeautifulSoup(data, 'html.parser')
     links = soup.find('div', {'class':'div-col'})
     data = []
     for link in links.find_all('a'):
         data.append("https://en.wikipedia.org"+link.get('href'))
     return data

您想获取每个网址的第一段。几乎没有研究，我刚得到这个

 def get_first_paragraph(url):
     data = requests.get(url).text
     soup = BeautifulSoup(data, 'html.parser')
     return soup.p.text

现在一切都必须连接起来

 def iterate_through_urls(urls):
     for url in urls:
         print(get_first_paragraph(url))


 def run():
     urls = get_urls()
     iterate_through_urls(urls)

python - 如何让爬虫从相对路径中提取信息？

1 回答 1

Related

Reference