python - 从中获取字符串
无标签

Question

我有一个 HTML 代码如下：

<div class="content">
    <div class="title">
        <a id="hlAdv" class="title" href="./sample.aspx">
            <font size=2>Pretty Beauty Fiesta -1st Avenue Mall!</font>
        </a>
    </div>
    19<sup>th</sup> ~ 21<sup>st</sup> Apr 2013
</div>

我现在正在使用 Python，并尝试使用 BeatifulSoup 找出日期。我期望的是：

19th ~ 21st Apr 2013

我试过了：

find("div", {"class":"content"}).text

输出：

Pretty Beauty Fiesta -1st Avenue Mall!19th ~ 21st Apr 2013

和，

find("div", {"class":"content"}).div.nextSibling

输出：

我尝试使用更多 nextSibling 来获取内容，但我仍然无法正确获取“st Apr 2013”。

我怎样才能得到我想要的数据？谢谢你。

score 0 · Accepted Answer

您的问题是您想要在div.

你想.next_siblings在这里循环使用：

content_div = soup.find('div', class_='content')
text = []
for elem in content_div.div.next_siblings:
    try:
        text.extend(elem.strings)
    except AttributeError:
        text.append(elem)
text = ' '.join(text).strip()

.next_siblings是一个生成器，它简单地生成.next_sibling属性链，包括NavigableString元素。

结果是：

>>> ''.join(text).strip()
u'19th ~ 21st Apr 2013'

在这里如何处理空格可能有点棘手；之后的剥离最适合这个特定的例子，但对于其他例子来说，使用elem.stripped_stringsandelem.strip()也可以工作。

score 0 · Accepted Answer

这个怎么样？它用于element.nextSiblingGenerator遍历您关心的 div 后面的元素，并忽略最后的 None 。

d = s.find('div', {'class':'content'}).div

def all_text_after(element):
    for item in element.nextSiblingGenerator():
        if not item:
            continue
        elif hasattr(item, 'contents'):
            for c in item.contents:
                yield c
        else:
            yield item

text_parts = list(all_text_after(d))
# -> [u'\n    19', u'th', u' ~ 21', u'st', u' Apr 2013\n']

print ''.join(text_parts)
# ->     19th ~ 21st Apr 2013

python - 从中获取字符串无标签

2 回答 2

Related

Reference

python - 从中获取字符串
无标签