2

I have the following html part which repeates itself several times with other href links:

<div class="product-list-item  margin-bottom">
<a title="titleexample" href="http://www.urlexample.com/example_1" data-style-id="sp_2866">

Now I want to get all the href links in this document that are directly after the div tag with the class "product-list-item". Pretty new to beautifulsoup and nothing that I came up with worked.

Thanks for your ideas.

EDIT: Does not really have to be beautifulsoup; when it can be done with regex and the python html parser this is also ok.

EDIT2: What I tried (I'm pretty new to python, so what I did might be totaly stupid from an advanced viewpoint):

soup = bs4.BeautifulSoup(htmlsource)
x = soup.find_all("div")
for i in range(len(x)):
    if x[i].get("class") and "product-list-item" in x[i].get("class"):
        print(x[i].get("class"))

This will give me a list of all the "product-list-item" but then I tried something like

print(x[i].get("class").next_element)

Because I thought next_element or next_sibling should give me the next tag but it just leads to AttributeError: 'list' object has no attribute 'next_element'. So I tried with only the first list element:

print(x[i][0].get("class").next_element)

Which led to this error: return self.attrs[key] KeyError: 0. Also tried with .find_all("href") and .get("href") but this all leads to the same errors.

EDIT3: Ok seems I found out how to solve it, now I did:

x = soup.find_all("div")

for i in range(len(x)):    
    if x[i].get("class") and "product-list-item" in x[i].get("class"):
        print(x[i].next_element.next_element.get("href"))

This can also be shortened by using another attribute to the find_all function:

x = soup.find_all("div", "product-list-item")
for i in x:
    print(i.next_element.next_element.get("href"))

greetings

4

1 回答 1

2

我想获取本文档中直接在带有“product-list-item”类的 div 标记之后的所有 href 链接

要查找 中的第一个<a href>元素<div>

links = []
for div in soup.find_all('div', 'product-list-item'): 
    a = div.find('a', href=True) # find <a> anywhere in <div>
    if a is not None:
       links.append(a['href'])

它假定链接在里面<div><div>第一个之前的任何元素都将<a href>被忽略。

如果你愿意;您可以对它更严格,例如,仅当它是第一个孩子时才使用链接<div>

a = div.contents[0] # take the very first child even if it is not a Tag
if a.name == 'a' and a.has_attr('href'):
   links.append(a['href'])

或者如果<a>不在里面<div>

a = div.find_next('a', href=True) # find <a> that appears after <div>
if a is not None:
   links.append(a['href'])

BeautifulSoup 中有多种搜索和导航方式

如果您使用 搜索lxml.html,您也可以使用 xpath 和 css 表达式(如果您熟悉它们)。

于 2013-05-31T18:58:17.560 回答