0

我正在尝试抓取一个网站,但是,我无法完成代码,以便我可以一次插入多个 URL。目前,该代码一次只能使用一个 URL,

当前代码是:

import requests
from bs4 import BeautifulSoup
import lxml
import pandas as pd

from urllib.request import urlopen
from urllib.error import HTTPError
from urllib.error import URLError
from bs4 import BeautifulSoup
try:
    html = urlopen("http://google.com")
except HTTPError as e:
    print(e)
except URLError:
    print("error")
else:
    res = BeautifulSoup(html.read(),"html5lib")
    tags = res.findAll("div", {"itemtype": "http://schema.org/LocalBusiness"})
    title = res.title.text
    print(title)
    for tag in tags:
      print(tag)

有人可以帮我进行修改,以便我可以插入这样的东西吗?

html = urlopen ("url1, url2, url3") 
4

2 回答 2

0

将代码的可重复部分包装在一个函数中并使用一个列表:

def urlhelper(x):
    for ele in x:
        try:
            html = urlopen(ele)
        except HTTPError as e:
            print(e)
        except URLError:
            print("error")
        else:
            res = BeautifulSoup(html.read(),"html5lib")
            tags = res.findAll("div", {"itemtype": "http://schema.org/LocalBusiness"})
            title = res.title.text
            print(title)
            for tag in tags:
            print(tag)

使用 urlhelper(["url1","url2","etc"]) 调用此函数

这里要理解的关键概念是“for”,它告诉 python 遍历列表中的每个元素。

我建议阅读迭代器和列表以获取更多信息:

https://www.w3schools.com/python/python_lists.asp

https://www.w3schools.com/python/python_iterators.asp

于 2020-09-18T14:55:05.890 回答
0

您可以创建一个 url 列表并使用 for 循环遍历它,如下所示:

import requests
from bs4 import BeautifulSoup
import lxml
import pandas as pd

from urllib.request import urlopen
from urllib.error import HTTPError
from urllib.error import URLError
from bs4 import BeautifulSoup

urlList = ["url1", "url2", "url3", "url4"]

for url in urlList:
    try:
        html = urlopen(url)
    except HTTPError as e:
        print(e)
    except URLError:
        print("error")
    else:
        res = BeautifulSoup(html.read(),"html5lib")
        tags = res.findAll("div", {"itemtype": "http://schema.org/LocalBusiness"})
        title = res.title.text
        print(title)
        for tag in tags:
          print(tag)
于 2020-09-18T15:01:56.400 回答