0

我正在尝试抓取多个网页来比较书籍的价格。因为每个站点都有不同的布局(和类名),我想使用正则表达式找到书名,然后是周围的元素。下面给出了一个代码示例。

from bs4 import BeautifulSoup
import re

html_page1 = """
<div class='product-box'>
<h2 class='title'>Title Book</h2>
<p class='price>18.45</p>
</div>
"""

html_page2 = """
<div class='page-box'>
<h2 class='orange-heading'>Title Book</h2>
<p class='blue-price'>18.45</p>
</div>
"""

# turn page into soup
soup1 = BeautifulSoup(html_page1, 'html.parser')

# find book titles
names1 = soup1.find_all(string=re.compile("[A-Z]([a-z]+,|\.|[a-z]+)(?:\s{1}[A-Z]([a-z]+,|\.|[a-z]+))"))

# print titles
print('Names1: ', names1)

# turn page into soup
soup2 = BeautifulSoup(html_page2, 'html.parser')

# find book titles
names2 = soup2.find_all(string=re.compile("[A-Z]([a-z]+,|\.|[a-z]+)(?:\s{1}[A-Z]([a-z]+,|\.|[a-z]+))"))

# print titles
print('Names2: ', names2)

这将返回:

Names1:  ['Title Book']
Names2:  ['Title Book']

现在我想用这些信息来找到相应的价格。我知道当使用标签和类名选择了一个元素时,可以使用“next_sibling”,但这不适用于文本选择的元素:

select_title = soup1.find('h2', {"class": "title"})
next_sib = new_try.next_sibling
print(next_sib) # returns <p class='price>18.45

# now try the same thing on element selected by name, this will result in an error
next_sib = names1.next_sibling 

当我使用文本找到元素时,如何使用相同的方法来查找价格?

可以在这里找到一个类似的问题:Find data within HTML tags using Python但是,它仍然使用 html 标签。

编辑问题是我有许多具有不同布局和类名的页面。因此,我无法使用标签/类/id 名称来查找元素,我必须使用正则表达式查找书名。

4

2 回答 2

2

获取价格时包含'h2'标签find_all()然后使用find_next('p')p标签的第一个示例,其中类名缺少字符串,我添加了字符串class='price'

from bs4 import BeautifulSoup
import re

html_page1 = """
<div class='product-box'>
<h2 class='title'>Title Book</h2>
<p class='price'>18.45</p>
</div>
"""

html_page2 = """
<div class='page-box'>
<h2 class='orange-heading'>Title Book</h2>
<p class='blue-price'>18.45</p>
</div>
"""


# turn page into soup
soup1 = BeautifulSoup(html_page1, 'html.parser')

# find book titles
names1 = soup1.find_all('h2',string=re.compile("[A-Z]([a-z]+,|\.|[a-z]+)(?:\s{1}[A-Z]([a-z]+,|\.|[a-z]+))"))

# print titles
print('Names1: ', names1[0].find_next('p').text)


# turn page into soup
soup2 = BeautifulSoup(html_page2, 'html.parser')

# find book titles
names2 = soup2.find_all('h2',string=re.compile("[A-Z]([a-z]+,|\.|[a-z]+)(?:\s{1}[A-Z]([a-z]+,|\.|[a-z]+))"))

# print titles
print('Names2: ', names2[0].find_next('p').text)

或将字符串更改为文本

from bs4 import BeautifulSoup
import re

html_page1 = """
<div class='product-box'>
<h2 class='title'>Title Book</h2>
<p class='price'>18.45</p>
</div>
"""

html_page2 = """
<div class='page-box'>
<h2 class='orange-heading'>Title Book</h2>
<p class='blue-price'>18.45</p>
</div>
"""


# turn page into soup
soup1 = BeautifulSoup(html_page1, 'html.parser')

# find book titles
names1 = soup1.find_all(text=re.compile("[A-Z]([a-z]+,|\.|[a-z]+)(?:\s{1}[A-Z]([a-z]+,|\.|[a-z]+))"))

# print titles
print('Names1: ', names1[0].find_next('p').text)


# turn page into soup
soup2 = BeautifulSoup(html_page2, 'html.parser')

# find book titles
names2 = soup2.find_all(text=re.compile("[A-Z]([a-z]+,|\.|[a-z]+)(?:\s{1}[A-Z]([a-z]+,|\.|[a-z]+))"))

# print titles
print('Names2: ', names2[0].find_next('p').text)

EDITED 使用 text 获取没有标签的元素,使用 next_element 获取 price 的值。

from bs4 import BeautifulSoup
import re

html_page1 = """
<div class='product-box'>
<h2 class='title'>Title Book</h2>
<p class='price'>18.45</p>
</div>
"""

html_page2 = """
<div class='page-box'>
<h2 class='orange-heading'>Title Book</h2>
<p class='blue-price'>18.45</p>
</div>
"""

# turn page into soup
soup1 = BeautifulSoup(html_page1, 'html.parser')
# find book titles
names1 = soup1.find_all(text=re.compile("[A-Z]([a-z]+,|\.|[a-z]+)(?:\s{1}[A-Z]([a-z]+,|\.|[a-z]+))"))
# print titles
print('Names1: ', names1[0])
print('Price1: ', names1[0].next_element.next_element.next_element)
# turn page into soup
soup2 = BeautifulSoup(html_page2, 'html.parser')
# find book titles
names2 = soup2.find_all(text=re.compile("[A-Z]([a-z]+,|\.|[a-z]+)(?:\s{1}[A-Z]([a-z]+,|\.|[a-z]+))"))
# print titles
print('Names2: ', names2[0])
print('Price2: ', names2[0].next_element.next_element.next_element)

输出

Names1:  Title Book
Price1:  18.45
Names2:  Title Book
Price2:  18.45
于 2019-11-11T17:44:25.093 回答
0

p.price您错过了in的课程结束逗号html_page1
有了names1 = soup1.find_all(text=re.compile("[A-Z]([a-z]+,|\.|[a-z]+)(?:\s{1}[A-Z]([a-z]+,|\.|[a-z]+))"))NavigableString,这就是为什么你会None得到next_sibling.

您可以在@Kunduk 答案中找到使用正则表达式的解决方案。
两者的替代更清晰和简单的解决方案html_page1html_page2

soup = BeautifulSoup(html_page1, 'html.parser')
# or BeautifulSoup(html_page2, 'html.parser')

books = soup.select('div[class*=box]')
for book in books:
    book_title = book.select_one('h2').text
    book_price = book.select_one('p[class*=price]').text
    print(book_title, book_price)

div[class*=box]表示类包含box的div

于 2019-11-11T17:00:01.710 回答