python - 使用 BeautifulSoup 抓取多个 URL

Question

我正在尝试抓取一个网站，但是，我无法完成代码，以便我可以一次插入多个 URL。目前，该代码一次只能使用一个 URL，

当前代码是：

import requests
from bs4 import BeautifulSoup
import lxml
import pandas as pd

from urllib.request import urlopen
from urllib.error import HTTPError
from urllib.error import URLError
from bs4 import BeautifulSoup
try:
    html = urlopen("http://google.com")
except HTTPError as e:
    print(e)
except URLError:
    print("error")
else:
    res = BeautifulSoup(html.read(),"html5lib")
    tags = res.findAll("div", {"itemtype": "http://schema.org/LocalBusiness"})
    title = res.title.text
    print(title)
    for tag in tags:
      print(tag)

有人可以帮我进行修改，以便我可以插入这样的东西吗？

html = urlopen ("url1, url2, url3")

score 0 · Accepted Answer

将代码的可重复部分包装在一个函数中并使用一个列表：

def urlhelper(x):
    for ele in x:
        try:
            html = urlopen(ele)
        except HTTPError as e:
            print(e)
        except URLError:
            print("error")
        else:
            res = BeautifulSoup(html.read(),"html5lib")
            tags = res.findAll("div", {"itemtype": "http://schema.org/LocalBusiness"})
            title = res.title.text
            print(title)
            for tag in tags:
            print(tag)

使用 urlhelper(["url1","url2","etc"]) 调用此函数

这里要理解的关键概念是“for”，它告诉 python 遍历列表中的每个元素。

我建议阅读迭代器和列表以获取更多信息：

https://www.w3schools.com/python/python_lists.asp

https://www.w3schools.com/python/python_iterators.asp

score 0 · Accepted Answer

您可以创建一个 url 列表并使用 for 循环遍历它，如下所示：

import requests
from bs4 import BeautifulSoup
import lxml
import pandas as pd

from urllib.request import urlopen
from urllib.error import HTTPError
from urllib.error import URLError
from bs4 import BeautifulSoup

urlList = ["url1", "url2", "url3", "url4"]

for url in urlList:
    try:
        html = urlopen(url)
    except HTTPError as e:
        print(e)
    except URLError:
        print("error")
    else:
        res = BeautifulSoup(html.read(),"html5lib")
        tags = res.findAll("div", {"itemtype": "http://schema.org/LocalBusiness"})
        title = res.title.text
        print(title)
        for tag in tags:
          print(tag)

python - 使用 BeautifulSoup 抓取多个 URL

2 回答 2

Related

Reference