python-2.7 - Python Goose extractor - “KNOWN_ARTICLE_CONTENT_TAGS” 流程似乎无效

Question

将 python goose2 用于 python 2.7 。

您将KNOWN_ARTICLE_CONTENT_TAGS 要提取的文章的标签/类或 id 放在哪里，似乎不起作用。

比如说，把里面的默认标签作为

KNOWN_ARTICLE_CONTENT_TAGS = [
    {'attr': 'itemprop', 'value': 'articleBody'},
    {'attr': 'class', 'value': 'post-content'},
    {'tag': 'article'},
]

现在我的第一个问题是采用这些值的确切预期逻辑是什么？

它是否认为这些文本中的所有文本都是默认文本？
它们是否只是作为 goose 的参考来增加这些节点内的文本分数但不能保证所有内容都会显示？
goose 是否忽略其他常见的通用标签只考虑这些标签？
我看到如果它返回 true ，它会跳过添加兄弟姐妹，这是什么意思？

但是经过一些调试，我发现提到的标签内的文本没有任何特殊的偏好，事实上，不调用已知的文章代码具有完全相同的输出，并且在某些来源使用已知标签时图像提取失败。原因。

同样在进一步挖掘后，我看到了该功能

 def get_known_article_tags(self):
        for item in KNOWN_ARTICLE_CONTENT_TAGS:
            nodes = self.parser.getElementsByTag(
                            self.article.doc,
                            **item)
            if len(nodes):
                return nodes[0]
        return None

article.doc 对似乎没有任何标签的对象进行操作。

几乎所有帖子上的这个也只返回带有文章标签的元素，而不是带有属性itemprop = articleBody的元素，即使文章有它们。

从下面的代码中可以看出调试is_articlebody功能

  def is_articlebody(self, node):
        for item in KNOWN_ARTICLE_CONTENT_TAGS:
            # attribute
            if "attr" in item and "value" in item:
                if(self.config.debug):
                    print 'for attr and value'
                    print self.parser.getAttribute(node, item['attr'])
                    print item['value']
                    print node
                if self.parser.getAttribute(node, item['attr']) == item['value']:
                    if(self.config.debug):
                        print 'is article body from attribute'
                    return True
            # tag
            if "tag" in item:
                print 'if tag'
                print node.tag
                if node.tag == item['tag']:
                    if(self.config.debug):
                        print 'is article body from tag'
                    return True

我注意到，即使目标提取文档中有类似的标签/类，此函数也从未返回 true。

该行print self.parser.getAttribute(node, item['attr'])始终返回为 null 。

我怎样才能让 goose 获取已知列表中提到的那些属性/类/标签中的所有文本，就像上面的示例一样，我想获取多个 p 标签内的所有文本（可以是除 p 之外的其他标签），不管分数？

编辑：在尝试进一步调试时，我意识到get_known_articles_tags函数只返回在字典中找到的第一个找到的标签/属性，关注：return nodes[0]

所以它只返回文档的那个单个节点，然后它只发送那个节点对象来找到顶部节点——假设节点不满足好/顶部节点的条件，那么它返回为空，因此失败。

我如何组合nodes列表中的所有节点对象，并将所有节点作为文档发送以解析并使用它来查找顶部节点？

score 0 · Accepted Answer

我设法解决了与这个问题有关的问题，

我改变了 return 语句的范围并像这样传递了整个数组

def get_known_article_tags(self):
        for item in KNOWN_ARTICLE_CONTENT_TAGS:
            nodes = self.parser.getElementsByTag(
                            self.article.doc,
                            **item)
        if len(nodes):
            return nodes
        return None

然后我一次将相同的节点数组传递给清理器一个节点（在数组内部）并将整个数组传递给calculate_top_node作为

self.article.top_node = self.extractor.calculate_best_node(doc)

然后在nodes_to_check 函数中添加一个额外的循环来检查数组的所有节点，

def nodes_to_check(self, docs):
        """\
        returns a list of nodes we want to search
        on like paragraphs and tables
        """
        nodes_to_check = []

        for doc in docs:
            for tag in ['p', 'pre', 'td']:
                items = self.parser.getElementsByTag(doc, tag=tag)
                nodes_to_check += items
        return nodes_to_check

这解决了仅返回单个元素的问题。

通过查看 python 3 goose 代码逻辑，我能够提出这一点，该逻辑更易于维护并通过 python2.7 语法实现。

python-2.7 - Python Goose extractor - “KNOWN_ARTICLE_CONTENT_TAGS” 流程似乎无效

1 回答 1

Related

Reference