7

我正在做一个小项目,从中提取政治领导人在报纸上的出现。有时会提到政治家,但没有父母或孩子有联系。(由于我猜是语义不好的标记)。

所以我想创建一个可以找到最近链接的函数,然后提取它。在下面的情况下,搜索字符串是Rasmussen,我想要的链接是:/307046

#-*- coding: utf-8 -*-

from bs4 import BeautifulSoup
import re

tekst = '''
<li>
  <div class="views-field-field-webrubrik-value">
    <h3>
      <a href="/307046">Claus Hjort spiller med mrkede kort</a>
    </h3>
  </div>
  <div class="views-field-field-skribent-uid">
    <div class="byline">Af: <span class="authors">Dennis Kristensen</span></div>
  </div>
  <div class="views-field-field-webteaser-value">
    <div class="webteaser">Claus Hjort Frederiksens argumenter for at afvise
      trepartsforhandlinger har ikke hold i virkeligheden. Hans rinde er nok
      snarere at forberede det ideologiske grundlag for en Løkke Rasmussens
      genkomst som statsministe
    </div>
  </div>
  <span class="views-field-view-node">
    <span class="actions">
      <a href="/307046">Ls mere</a>
      |
      <a href="/307046/#comments">Kommentarer (4)</a>
    </span>
  </span>
</li>
'''

to_find = "Rasmussen"
soup = BeautifulSoup(tekst)
contexts = soup.find_all(text=re.compile(to_find)) 

def find_nearest(element, url, direction="both"):
    """Find the nearest link, relative to a text string.
    When complete it will search up and down (parent, child),
    and only X levels up down. These features are not implemented yet.
    Will then return the link the fewest steps away from the
    original element. Assumes we have already found an element"""

    # Is the nearest link readily available?
    # If so - this works and extracts the link.
    if element.find_parents('a'):
        for artikel_link in element.find_parents('a'):
            link = artikel_link.get('href')
            # sometimes the link is a relative link - sometimes it is not
            if ("http" or "www") not in link:
                link = url+link
                return link
    # But if the link is not readily available, we will go up
    # This is (I think) where it goes wrong
    # ↓↓↓↓↓↓↓↓↓↓↓↓↓↓↓↓↓↓↓↓↓↓↓↓↓↓↓↓↓↓↓↓↓↓↓
    if not element.find_parents('a'):
        element =  element.parent
        # Print for debugging
        print element #on the 2nd run (i.e <li> this finds <a href=/307056> 
        # So shouldn't it be caught as readily available above?
        print u"Found: %s" % element.name
        # the recursive call
        find_nearest(element,url)

# run it
if contexts:
    for a in contexts:
        find_nearest( element=a, url="http://information.dk")

下面的直接调用有效:

print contexts[0].parent.parent.parent.a['href'].encode('utf-8')

作为参考,整个抱歉的代码都在 bitbucket 上:https ://bitbucket.org/achristoffersen/politikere-i-medierne

(ps 使用 BeautifullSoup 4)


编辑:SimonSapin 要求我定义最近:最近是指在任一方向上距搜索词的嵌套级别最少的链接。在上面的文本中,a href由基于 drupal 的报纸站点生成的,既不是找到搜索字符串的标记的直接父级也不是子级。所以 BeautifullSoup 找不到。

我怀疑“最少的字符”通常也会起作用。在这种情况下,可以将解决方案与 find 和 rfind 一起破解 - 但我真的很想通过 BS 来做到这一点。因为这会起作用:contexts[0].parent.parent.parent.a['href'].encode('utf-8')必须可以将其推广到脚本。

编辑:也许我应该强调我正在寻找一个 BeautifulSoup 解决方案。我认为将 BS 与 @erik85 建议的自定义/简单的呼吸优先搜索结合起来很快就会变得混乱。

4

2 回答 2

12

有人可能会想出一个适用于复制和粘贴的解决方案,您会认为这可以解决您的问题。不过,您的问题不在于代码!这是你的策略。有一个称为“分而治之”的软件设计原则,您应该在重新设计代码时应用它:将将 HTML 字符串解释为树/图形的代码与搜索最近的节点(可能是广度优先搜索)分开。您不仅会学会设计更好的软件,而且您的问题可能会不复存在

我认为您足够聪明,可以自己解决这个问题,但我也想提供一个骨架:

def parse_html(txt):
    """ reads a string of html and returns a dict/list/tuple presentation"""
    pass

def breadth_first_search(graph, start, end):
    """ finds the shortest way from start to end
    You can probably customize start and end to work well with the input you want
    to provide. For implementation details see the link in the text above.
    """
    pass

def find_nearest_link(html,name):
    """putting it all together"""
    return breadth_first_search(parse_html(html),name,"link")

PS:这样做也适用于另一个原则,但来自数学:假设有一个问题您不知道解决方案(找到靠近所选子字符串的链接)并且有一组问题您知道解决方案(图遍历),然后尝试转换您的问题以匹配您可以解决的问题组,这样您就可以使用基本的解决方案模式(甚至可能已经在您选择的语言/框架中实现),您就完成了。

于 2012-08-04T12:02:49.643 回答
2

这是使用 lxml 的解决方案。主要思想是找到所有前面和后面的元素,然后通过这些元素进行循环迭代:

def find_nearest(elt):
    preceding = elt.xpath('preceding::*/@href')[::-1]
    following = elt.xpath('following::*/@href')
    parent = elt.xpath('parent::*/@href')
    for href in roundrobin(parent, preceding, following):
        return href

使用 BeautifulSoups(或 bs4)的next_elements 和 previous_elements的类似解决方案也应该是可能的。


import lxml.html as LH
import itertools

def find_nearest(elt):
    preceding = elt.xpath('preceding::*/@href')[::-1]
    following = elt.xpath('following::*/@href')
    parent = elt.xpath('parent::*/@href')
    for href in roundrobin(parent, preceding, following):
        return href

def roundrobin(*iterables):
    "roundrobin('ABC', 'D', 'EF') --> A D E B F C"
    # http://docs.python.org/library/itertools.html#recipes
    # Author: George Sakkis
    pending = len(iterables)
    nexts = itertools.cycle(iter(it).next for it in iterables)
    while pending:
        try:
            for n in nexts:
                yield n()
        except StopIteration:
            pending -= 1
            nexts = itertools.cycle(itertools.islice(nexts, pending))

tekst = '''
<li>
  <div class="views-field-field-webrubrik-value">
    <h3>
      <a href="/307046">Claus Hjort spiller med mrkede kort</a>
    </h3>
  </div>
  <div class="views-field-field-skribent-uid">
    <div class="byline">Af: <span class="authors">Dennis Kristensen</span></div>
  </div>
  <div class="views-field-field-webteaser-value">
    <div class="webteaser">Claus Hjort Frederiksens argumenter for at afvise
      trepartsforhandlinger har ikke hold i virkeligheden. Hans rinde er nok
      snarere at forberede det ideologiske grundlag for en Løkke Rasmussens
      genkomst som statsministe
    </div>
  </div>
  <span class="views-field-view-node">
    <span class="actions">
      <a href="/307046">Ls mere</a>
      |
      <a href="/307046/#comments">Kommentarer (4)</a>
    </span>
  </span>
</li>
'''

to_find = "Rasmussen"
doc = LH.fromstring(tekst)

for x in doc.xpath('//*[contains(text(),{s!r})]'.format(s = to_find)):
    print(find_nearest(x))

产量

/307046
于 2012-08-07T20:20:28.850 回答