python - BeautifulSoup findAll() 给定了多个类？

Question

我想从网站上抓取一个项目列表，并保留它们的显示顺序。这些项目组织在一个表格中，但它们可以是两个不同类别之一（以随机顺序）。

有没有办法提供多个类并让 BeautifulSoup4 找到任何给定类中的所有项目？

我需要实现此代码的功能，除了保留源代码中的项目顺序：

items = soup.findAll(True,{'class':'class1'})
items += soup.findAll(True,{'class':'class2'})

score 103 · Accepted Answer

你可以这样做

soup.findAll(True, {'class':['class1', 'class2']})

例子：

>>> from bs4 import BeautifulSoup
>>> soup = BeautifulSoup('<html><body><div class="class1"></div><div class="class2"></div><div class="class3"></div></body></html>')
>>> soup.findAll(True, {"class":["class1", "class2"]})
[<div class="class1"></div>, <div class="class2"></div>]

score 26 · Accepted Answer

我是使用 BeautifulSoup 的 Python 新手，但我的回答可能对您有所帮助。我遇到了同样的情况，我必须找到一个标签的多个类，所以我只需将这些类传递到一个数组中，它就对我有用。这是代码片段

# Search with single Class
    find_all("tr",  {"class":"abc"})
# Search with multiple classes
    find_all("tr",  {"class": ["abc", "xyz"]})

score 13 · Accepted Answer

One way to do it is to use regular expression instead of a class name:

import re
import requests
from bs4 import BeautifulSoup


s = requests.Session()
link = 'https://leaderboards.guildwars2.com/en/na/achievements'
r = s.get(link)


soup = BeautifulSoup(r.text)
for item in soup.findAll(True, {"class": re.compile("^(equal|up)$")}):
    if 'achievements' in item.attrs['class'] and 'number' in item.attrs['class']:
        print item

score 13 · Accepted Answer

或者使用最新版本的 BeautifulSoup：

find_all('a', class_=['class1', 'class2'])

使用“class”会返回一个错误，所以他们使用“class_”来代替。

score 10 · Accepted Answer

    <html>
        <body>
            <div class="cls1">ok</div>
            <div class="cls2">hi</div>
            <div class="cls1 cls2">both</div>
        </body>
    </html>

假设 html 变量包含上面的 html 代码

    from bs4 import BeautifulSoup
    soup = BeautifulSoup(html)
    divs = soup.find_all('div', class_=['cls1', 'cls2'])
    print(divs)

这将输出：

[<div class="cls1">ok</div>, <div class="cls2">hi</div>, <div class="cls1 cls2">both</div>]

它是“OR”运算符而不是“AND”，也就是说，元素不需要同时具有两个类。
要使用“AND”运算符，您可以使用select('div.cls1.cls2')

    from bs4 import BeautifulSoup
    soup = BeautifulSoup(html)
    divs = soup.select('div.cls1.cls2')
    print(divs)

这将输出：

[<div class="cls1 cls2">both</div>]

score 1 · Accepted Answer

如果您使用 Url 作为参数，请不要忘记传递标头。我为获得这些具有 2 个类的 div 元素而奋斗了大约一个小时，但它对 mi 不起作用，直到我注意到我忘记传递 this 标头。

header = {
    "Accept-Language": "es-ES,es;q=0.9",
    "User-Agent": "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/94.0.4606.81 Safari/537.36"
}
url = 'something.com'
response = requests.get(url=url,headers=header)
response.raise_for_status()
data = response.text

soup = BeautifulSoup(data, 'html.parser')  

elements = soup.select('div.fde444d7ef._c445487e2')

python - BeautifulSoup findAll() 给定了多个类？

6 回答 6

Related

Reference