“python-newspaper”的相关标签问题

0 投票

1 回答

84 浏览

python - Python报纸库结果不一致？

我正在使用 Anaconda3，安装报纸。看起来很简单，但结果却不一致。

http://newspaper.readthedocs.io/en/latest/

这段简单的代码有时会返回所有结果，有时则不返回任何结果。

有人用过这个库或知道更好的库来抓取新闻网站吗？我宁愿不必自己编写解析器，但如果归根结底，我应该使用什么？

python python-newspaper

2017-12-16T02:43:45.820

0 投票

1 回答

394 浏览

python - Newspaper3k 从 archive.org waybackmachine 页面返回 0 篇文章，而实时页面按预期工作

当尝试在 archive.org 的存档页面 url 上使用 python 库报纸3时，它无法获取任何文章。但是，当在同一个实时页面 url 上使用它时，它可以正常工作。请看下面：

即使使用id返回原始修改页面的特殊技巧也不起作用：

任何帮助将不胜感激，谢谢！

python python-newspaper

2017-12-19T12:58:42.933

0 投票

0 回答

430 浏览

python - python，报纸，不可散列的类型：'tzutc'并写入数据框

我有一堆网址，我想下载文本并进行进一步分析。我是蟒蛇新手。我有两个问题：（1）我有一个非常奇怪的类型错误；(2) 结果没有写入数据帧。我的代码如下：

我的输出包括：

[] http://100seguro.com.ar/telefonica-pone-en-venta-su-aseguradora-antares-vida/ 追溯（最近一次通话最后）：

文件“”，第 1 行，在 runfile('C:/Users/theiman/Desktop/untitled7.py', wdir='C:/Users/theiman/Desktop')

文件“C:\Users\theiman\AppData\Local\Continuum\anaconda3\lib\site-packages\spyder\utils\site\sitecustomize.py”，第 710 行，运行文件 execfile（文件名，命名空间）

文件“C:\Users\theiman\AppData\Local\Continuum\anaconda3\lib\site-packages\spyder\utils\site\sitecustomize.py”，第 101 行，在 execfile exec(compile(f.read(), filename , 'exec'), 命名空间)

文件“C:/Users/theiman/Desktop/untitled7.py”，第 57 行，在 df.loc[index] = [d, datetime.datetime.now().date(), article.title, article.text, article.keywords,article.url]

文件“C:\Users\theiman\AppData\Local\Continuum\anaconda3\lib\site-packages\pandas\core\indexing.py”，第 179 行，在setitem self._setitem_with_indexer(indexer, value)

_setitem_with_indexer 中的文件“C:\Users\theiman\AppData\Local\Continuum\anaconda3\lib\site-packages\pandas\core\indexing.py”，第 425 行 self.obj._data = self.obj.append(value ）。_数据

文件“C:\Users\theiman\AppData\Local\Continuum\anaconda3\lib\site-packages\pandas\core\frame.py”，第 4533 行，附加其他 = other._convert(datetime=True, timedelta=True )

文件“C:\Users\theiman\AppData\Local\Continuum\anaconda3\lib\site-packages\pandas\core\generic.py”，第 3472 行，在 _convert copy=copy)）。完成（自己）

文件“C:\Users\theiman\AppData\Local\Continuum\anaconda3\lib\site-packages\pandas\core\internals.py”，第 3227 行，转换返回 self.apply('convert', **kwargs)

文件“C:\Users\theiman\AppData\Local\Continuum\anaconda3\lib\site-packages\pandas\core\internals.py”，第 3091 行，在 apply = getattr(b, f)(**kwargs)

文件“C:\Users\theiman\AppData\Local\Continuum\anaconda3\lib\site-packages\pandas\core\internals.py”，第 1892 行，在 convert values = fn(values.ravel(), **fn_kwargs )

文件“C:\Users\theiman\AppData\Local\Continuum\anaconda3\lib\site-packages\pandas\core\dtypes\cast.py”，第 740 行，在 soft_convert_objects 值 = lib.maybe_convert_objects(values, convert_datetime=datetime )

文件“pandas/_libs/src\inference.pyx”，第 1204 行，在 pandas._libs.lib.maybe_convert_objects

TypeError：不可散列的类型：'tzutc'

关于出了什么问题以及如何解决它的任何想法？谢谢！！

python dataframe types python-newspaper

2018-01-12T23:13:52.050

0 投票

1 回答

82 浏览

python - Python Flask 应用程序直接返回与 python 不同的（抓取的）字符串

我在我正在开发的 Flask 应用程序中发现了一个奇怪的东西。Flask API 旨在接收新闻文章 url，抓取它（使用报纸库）并预测抓取文本的类别。

但是，当我直接在 Python (Spyder) 中运行 Crawler 时，它会按预期返回文章文本。

这就像一个魅力。如果我现在在 Flask 应用程序中运行同一段代码，它会生成一些其他字符串，这些字符串属于 Crawled url 的导航：

基本上，第一个片段返回完整的文章文本，而第二个片段返回：

Sie befinden sich hier: DevOps > 配置管理 Sie sind noch nicht angelmeldet 注册 | 通讯

我希望我把问题说清楚了。如果不是，请告诉我。

任何想法发生了什么？

python flask python-newspaper

2018-01-18T09:32:11.207

0 投票

1 回答

469 浏览

python - Python报包返回哪些文章？

我的基本问题是 Python 中的报纸包如何确定它返回的网址/文章？有人会认为它只是返回您提供的 url 中包含的所有文章链接，但它似乎并没有那样工作。例如，如果您使用“ http://www.cnn.com ”和“ https://www.cnn.com/politics ”，您会得到完全相同的文章返回。我认为对于后者，您应该只在政治页面上获得文章，但情况似乎并非如此。

那么它实际上在做什么呢？它只是从主页获取所有文章吗？

这是我用来测试的一个例子（我使用了python 3.6.2版）：

python python-3.x python-newspaper

2018-02-10T23:14:03.900

0 投票

0 回答

517 浏览