1
<p>This is the first paragraph with some details</p>
<p><a href = "user123">user1</a><font>This is opening contents for user1</font></p>
<p><font>This is the contents from user1</font></p>
<font><p>This is more content from user1</p></font>
<p><a href = "user234">user2</a><font>This is opening contents for user2</font></p>
<p><font>This is the contents from user2</font></p>
<font><p>This is more content from user1</p></font>
!----There is n number of data like this-----!

This is the structure of my html. My aim is to extract the users and their contents. In this case it should print all the contents between two 'a' tags. This is just an example of my structure, but in real html, i have different types of tags between two 'a' tags. I need a solution to iterate all the tags below a 'a' tag till it finds another 'a' tag. Hope thats clear.

The code which i tried is :

for i in soup.findAll('a'):
    while(i.nextSibling.name!='a'):
        print i.nextSibling

I returns me an infinite loop. So if anyone has idea how i can solve this issue please share it with me.

Expected output is :

username is : user1

text is : This is opening contents for user1 This is the contents from user1 This is more content from user1

username is : user2

text is : This is opening contents for user2 This is the contents from user2 This is more content from user2

and so on......

4

2 回答 2

1

One option is to search for every <a> tag with find_all() and for each link use find_all_next() to search <font> tags that have the contents for each user. The following script extracts the user name and its contents and save both as a tuple inside a list:

from bs4 import BeautifulSoup

l = []

soup = BeautifulSoup(open('htmlfile'))
for link in soup.find_all('a'):
    s = []
    for elem in link.find_all_next(['font', 'a']):
        if elem.name == 'a':
            break
        s.append(elem.string)
    user_content = ' '.join(s)
    l.append((link.string, user_content))

It yields:

[('user1', 'This is the contents from user1 This is more content from user1'),
 ('user2', 'This is the contents from user2 This is more content from user2')]
于 2013-09-01T12:09:30.850 回答
0

Try this:

from bs4 import BeautifulSoup

html="""
<p>This is the first paragraph with some details</p>
<p><a href="user123">user1</a><font>This is opening contents for user1</font></p>
<p><font>This is the contents from user1</font></p>
<font><p>This is more content from user1</p></font>
<p><a href="user234">user2</a><font>This is opening contents for user2</font></p>
<p><font>This is the contents from user2</font></p>
<font><p>This is more content from user1</p></font>
"""

soup = BeautifulSoup(html)
for i in soup.find_all('a'):
  print 'name:', i.text
  for s in [i, i.parent.find_next_sibling()]:
    while s <> None:
      if s.find('a') <> None:
        break
      print 'contents:', s.text
      s = s.find_next_sibling()

(Note: find_all is the recommended name for findAll, it may not work in older soups. Same with find_next_sibling.)

于 2013-09-01T11:41:15.227 回答