0

我正在使用 Goose 引擎使用以下代码从 url 中提取文章文本:

g = Goose()
article = g.extract(url="http://www.sportingnews.com/ncaa-football/story/2013-09-17/week-4-exit-poll-johnny-manziel-alabama-oregon-texas-mack-brown-mariota")

看起来这个 URL 有一些问题,因为我收到以下错误:

'utf8' codec can't decode byte 0xa0 in position 1: invalid start byte
The string that could not be encoded/decoded was: �

我在我的文件顶部正确地将 utf-8 指定为我的编解码器,如下所示:

# -*- coding: utf-8 -*-

我该如何解决这个问题?

编辑:堆栈跟踪:

Environment:


Request Method: GET
Request URL: http://localhost:3000/scansources/

Django Version: 1.5.1
Python Version: 2.7.2
Installed Applications:
('django.contrib.auth',
 'django.contrib.contenttypes',
 'django.contrib.sessions',
 'django.contrib.sites',
 'django.contrib.messages',
 'django.contrib.staticfiles',
 'summaries',
 'sources_scan')
Installed Middleware:
('django.middleware.common.CommonMiddleware',
 'django.contrib.sessions.middleware.SessionMiddleware',
 'django.middleware.csrf.CsrfViewMiddleware',
 'django.contrib.auth.middleware.AuthenticationMiddleware',
 'django.contrib.messages.middleware.MessageMiddleware')


Traceback:
File "/Library/Python/2.7/site-packages/django/core/handlers/base.py" in get_response
  115.                         response = callback(request, *callback_args, **callback_kwargs)
File "/Users/yonatanoren/Documents/python/summarizer/sources_scan/views.py" in scan_sources
  183.              article = g.extract(url="http://www.sportingnews.com/ncaa-football/story/2013-09-17/week-4-exit-poll-johnny-manziel-alabama-oregon-texas-mack-brown-mariota")
File "/Library/Python/2.7/site-packages/goose_extractor-1.0.2-py2.7.egg/goose/__init__.py" in extract
  53.         return self.crawl(cc)
File "/Library/Python/2.7/site-packages/goose_extractor-1.0.2-py2.7.egg/goose/__init__.py" in crawl
  60.         article = crawler.crawl(crawl_candiate)
File "/Library/Python/2.7/site-packages/goose_extractor-1.0.2-py2.7.egg/goose/crawler.py" in crawl
  90.         article.top_node = extractor.calculate_best_node(article)
File "/Library/Python/2.7/site-packages/goose_extractor-1.0.2-py2.7.egg/goose/extractors.py" in calculate_best_node
  248.             text_node = self.parser.getText(node)
File "/Library/Python/2.7/site-packages/goose_extractor-1.0.2-py2.7.egg/goose/parsers.py" in getText
  179.         txts = [i for i in node.itertext()]

Exception Type: UnicodeDecodeError at /scansources/
Exception Value: 'utf8' codec can't decode byte 0xa0 in position 1: invalid start byte

谢谢。

编辑:使用 python shell 我得到与此代码相同的错误:

>>> g = Goose()
>>> article = g.extract(url="http://www.sportingnews.com/ncaa-football/story/2013-09-17/week-4-exit-poll-johnny-manziel-alabama-oregon-texas-mack-brown-mariota")

我还更新了所有文件以使用以下内容,但仍然出现错误。

#encoding=utf-8

我相信这可能是 Goose 本身的问题。因为 Goose 处理文本并返回它。在这种情况下我将如何解决它?

编辑:以下内容也没有什么不同

text = unicode(article.cleaned_text,'utf-8')
4

4 回答 4

1

you may try raw_html extraction: https://github.com/grangier/python-goose#known-issues

you may do some encoding/decoding with the raw html.

于 2013-09-18T07:13:24.313 回答
0

即使我无法使用此 URL 重现错误,但我在使用 python-goose 时遇到了类似的问题。尝试:

from goose.configuration import Configuration
from goose import Goose


config = Configuration()
config.parser_class = 'soupparser' # this helped me
g = Goose(config)
article = g.extract(url="http://www.sportingnews.com/ncaa-football/story/2013-09-17/week-4-exit-poll-johnny-manziel-alabama-oregon-texas-mack-brown-mariota")
于 2014-10-13T16:26:53.470 回答
0

也许对所有字符串使用unicode会有所帮助:from __future__ import unicode_literals在python文件的第一行插入并重试...

于 2013-09-18T20:34:19.743 回答
0

尝试在字符串之前添加一点 u 。我在那里没有看到任何奇怪的字符,但我通常在我的 django 代码中使用希伯来语,而顶部的 bash 并不总是足够的

article = g.extract(url=u"http://www.sportingnews.com/ncaa-football/story/2013-09-17/week-4-exit-poll-johnny-manziel-alabama-oregon-texas-mack-brown-mariota")
于 2013-09-18T20:42:17.990 回答