python - BeautifulSoup 在给定的标签内只抓取一次

Question

我想抓取一个父标签，如果它包含一个标记，比如说MARKER。例如，我有：

<a>
 <b>
  <c>
  MARKER
  </c>
 </b>
 <b>
  <c>
  MARKER
  MARKER
  </c>
 </b>
 <b>
  <c>
  stuff
  </c>
 </b>
</a>

我想抢：

 <b>
  <c>
  MARKER
  </c>
 </b>

 <b>
  <c>
  MARKER
  MARKER
  </c>
 </b>

我目前的代码是：

for stuff in soup.find_all(text=re.compile("MARKER")):
        post = stuff.find_parent("b")

这很有效，但是，它给了我：

 <b>
  <c>
  MARKER
  </c>
 </b>

 <b>
  <c>
  MARKER
  MARKER
  </c>
 </b>

 <b>
  <c>
  MARKER
  MARKER
  </c>
 </b>

发生这种情况的原因很明显，它为找到的每个 MARKER 打印整个包含标签一次，因此包含两个 MARKER 的标签被打印两次。但是，我不知道如何告诉 BeautifulSoup 在完成后不要在给定标签内搜索（我怀疑，具体来说，不能这样做？）或以其他方式阻止这种情况，除了可能将所有内容索引到字典并拒绝重复?

编辑：这是我正在处理的具体案例，这给我带来了麻烦，因为出于某种原因，尽管是剥离版本，但上述内容实际上并没有产生错误。（如果有人好奇，我正在获取一个播放帖子的特定论坛主题。）

from bs4 import BeautifulSoup
import urllib.request
import re

url = 'http://forums.spacebattles.com/threads/asukaquest-3-starfish-eater.258271/page-179'
soup = urllib.request.urlopen(url).read()
sbsoup = BeautifulSoup(soup)

for stuff in sbsoup.find_all(text=re.compile("\[[Xx]\]")):
        post = stuff.find_parent("li")
        print(post.find("a", class_="username").string)
        print(post.find("blockquote", class_="messageText ugc baseHtml").get_text())

score 0 · Accepted Answer

我用 bs3 写了这个，它可能适用于 bs4，但概念是一样的。基本上，li 标签在“data-author”属性下都有用户名，因此您不需要找到较低的标签然后寻找父 li。

您似乎只对包含“标记”的块引用标签感兴趣，那么为什么不指定呢？

Lambda 函数通常是查询 Beautiful soup 的最通用方式。

import os
import sys

# Import System libraries
import re
import urllib2

# Import Custom libraries
#from BeautifulSoup import BeautifulSoup
from bs4 import BeautifulSoup

# The url variable to be searched
url = 'http://forums.spacebattles.com/threads/asukaquest-3-starfish-eater.258271/page-179'
# Create a request object
request = urllib2.Request(url)

# Attempt to open the request and read the response
try:
    response = urllib2.urlopen(request)
    the_page = response.read()
except Exception:
    the_page = ""
    # If the response exists, create a BeautifulSoup from it
if(the_page):
    soup = BeautifulSoup(the_page)

    # Define the search location for the desired tags
    li_location = lambda x: x.name == u"li" and set([("class", "message   ")]) <= set(x.attrs)
    x_location = lambda x: x.name == u"blockquote" and bool(re.search("\[[Xx]\]", x.text))

    # Iterate through all the found lis
    for li in soup.findAll(li_location):
        # Print the author name
        print dict(li.attrs)["data-author"]
        # Iterate through all the found blockquotes containing the marker
        for xs in li.findAll(x_location):
            # Print the text of the found blockquote
            print xs.text
        print ""

python - BeautifulSoup 在给定的标签内只抓取一次

1 回答 1

Related

Reference