我正在做一个小项目,从中提取政治领导人在报纸上的出现。有时会提到政治家,但没有父母或孩子有联系。(由于我猜是语义不好的标记)。
所以我想创建一个可以找到最近链接的函数,然后提取它。在下面的情况下,搜索字符串是Rasmussen
,我想要的链接是:/307046
。
#-*- coding: utf-8 -*-
from bs4 import BeautifulSoup
import re
tekst = '''
<li>
<div class="views-field-field-webrubrik-value">
<h3>
<a href="/307046">Claus Hjort spiller med mrkede kort</a>
</h3>
</div>
<div class="views-field-field-skribent-uid">
<div class="byline">Af: <span class="authors">Dennis Kristensen</span></div>
</div>
<div class="views-field-field-webteaser-value">
<div class="webteaser">Claus Hjort Frederiksens argumenter for at afvise
trepartsforhandlinger har ikke hold i virkeligheden. Hans rinde er nok
snarere at forberede det ideologiske grundlag for en Løkke Rasmussens
genkomst som statsministe
</div>
</div>
<span class="views-field-view-node">
<span class="actions">
<a href="/307046">Ls mere</a>
|
<a href="/307046/#comments">Kommentarer (4)</a>
</span>
</span>
</li>
'''
to_find = "Rasmussen"
soup = BeautifulSoup(tekst)
contexts = soup.find_all(text=re.compile(to_find))
def find_nearest(element, url, direction="both"):
"""Find the nearest link, relative to a text string.
When complete it will search up and down (parent, child),
and only X levels up down. These features are not implemented yet.
Will then return the link the fewest steps away from the
original element. Assumes we have already found an element"""
# Is the nearest link readily available?
# If so - this works and extracts the link.
if element.find_parents('a'):
for artikel_link in element.find_parents('a'):
link = artikel_link.get('href')
# sometimes the link is a relative link - sometimes it is not
if ("http" or "www") not in link:
link = url+link
return link
# But if the link is not readily available, we will go up
# This is (I think) where it goes wrong
# ↓↓↓↓↓↓↓↓↓↓↓↓↓↓↓↓↓↓↓↓↓↓↓↓↓↓↓↓↓↓↓↓↓↓↓
if not element.find_parents('a'):
element = element.parent
# Print for debugging
print element #on the 2nd run (i.e <li> this finds <a href=/307056>
# So shouldn't it be caught as readily available above?
print u"Found: %s" % element.name
# the recursive call
find_nearest(element,url)
# run it
if contexts:
for a in contexts:
find_nearest( element=a, url="http://information.dk")
下面的直接调用有效:
print contexts[0].parent.parent.parent.a['href'].encode('utf-8')
作为参考,整个抱歉的代码都在 bitbucket 上:https ://bitbucket.org/achristoffersen/politikere-i-medierne
(ps 使用 BeautifullSoup 4)
编辑:SimonSapin 要求我定义最近:最近是指在任一方向上距搜索词的嵌套级别最少的链接。在上面的文本中,a href
由基于 drupal 的报纸站点生成的,既不是找到搜索字符串的标记的直接父级也不是子级。所以 BeautifullSoup 找不到。
我怀疑“最少的字符”通常也会起作用。在这种情况下,可以将解决方案与 find 和 rfind 一起破解 - 但我真的很想通过 BS 来做到这一点。因为这会起作用:contexts[0].parent.parent.parent.a['href'].encode('utf-8')
必须可以将其推广到脚本。
编辑:也许我应该强调我正在寻找一个 BeautifulSoup 解决方案。我认为将 BS 与 @erik85 建议的自定义/简单的呼吸优先搜索结合起来很快就会变得混乱。