python - newsplease commoncrawl.py 文件中的异常

Question

我正在使用从https://github.com/fhamborg/news-please克隆的 newsplease 库。我想使用 newsplease 从 commoncrawl 新闻数据集中获取新闻文章。我正在按照此处的说明运行 commoncrawl.py 文件。我使用了以下命令-

python -m newsplease.examples.commoncrawl

在执行以下命令时，我收到以下错误 -

my_local_download_dir_warc=./cc_download_warc/
my_local_download_dir_article=./cc_download_articles/
delete_warc_after_extraction=False
my_number_of_extraction_processes=1
INFO:newsplease.crawler.commoncrawl_crawler:executing: aws s3 ls --recursive s3://commoncrawl/crawl-data/CC-NEWS/ --no-sign-request > .tmpaws.txt && awk '{ print $4 }' .tmpaws.txt && rm .tmpaws.txt
INFO:newsplease.crawler.commoncrawl_crawler:found 2 files at commoncrawl.org
INFO:newsplease.crawler.commoncrawl_crawler:creating extraction process pool with 1 processes
INFO:newsplease.crawler.commoncrawl_extractor:found local file ./cc_download_warc/https%3A%2F%2Fcommoncrawl.s3.amazonaws.com%2F, not downloading again due to configuration
Traceback (most recent call last):
  File "/home/prateek/.local/lib/python3.6/site-packages/warcio/recordloader.py", line 236, in _detect_type_load_headers
    rec_headers = self.arc_parser.parse(stream, statusline)
  File "/home/prateek/.local/lib/python3.6/site-packages/warcio/recordloader.py", line 312, in parse
    raise StatusAndHeadersParserException(msg, parts)
warcio.statusandheaders.StatusAndHeadersParserException: Wrong # of headers, expected arc headers ['uri', 'ip-address', 'archive-date', 'content-type', 'length'], Found ['<?xml', 'version="1.0"', 'encoding="UTF-8"?>']

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
  File "/usr/lib/python3.6/runpy.py", line 193, in _run_module_as_main
    "__main__", mod_spec)
  File "/usr/lib/python3.6/runpy.py", line 85, in _run_code
    exec(code, run_globals)
  File "/home/prateek/.local/lib/python3.6/site-packages/newsplease/examples/commoncrawl.py", line 172, in <module>
    main()
  File "/home/prateek/.local/lib/python3.6/site-packages/newsplease/examples/commoncrawl.py", line 168, in main
    continue_process=True)
  File "/home/prateek/.local/lib/python3.6/site-packages/newsplease/crawler/commoncrawl_crawler.py", line 320, in crawl_from_commoncrawl
    log_pathname_fully_extracted_warcs=__log_pathname_fully_extracted_warcs)
  File "/home/prateek/.local/lib/python3.6/site-packages/newsplease/crawler/commoncrawl_crawler.py", line 230, in __start_commoncrawl_extractor
    log_pathname_fully_extracted_warcs=__log_pathname_fully_extracted_warcs)
  File "/home/prateek/.local/lib/python3.6/site-packages/newsplease/crawler/commoncrawl_extractor.py", line 338, in extract_from_commoncrawl
    self.__run()
  File "/home/prateek/.local/lib/python3.6/site-packages/newsplease/crawler/commoncrawl_extractor.py", line 292, in __run
    self.__process_warc_gz_file(local_path_name)
  File "/home/prateek/.local/lib/python3.6/site-packages/newsplease/crawler/commoncrawl_extractor.py", line 231, in __process_warc_gz_file
    for record in ArchiveIterator(stream):
  File "/home/prateek/.local/lib/python3.6/site-packages/warcio/archiveiterator.py", line 110, in _iterate_records
    self.record = self._next_record(self.next_line)
  File "/home/prateek/.local/lib/python3.6/site-packages/warcio/archiveiterator.py", line 262, in _next_record
    self.check_digests)
  File "/home/prateek/.local/lib/python3.6/site-packages/warcio/recordloader.py", line 88, in parse_record_stream
    known_format))
  File "/home/prateek/.local/lib/python3.6/site-packages/warcio/recordloader.py", line 243, in _detect_type_load_headers
    raise ArchiveLoadFailed(msg + str(se.statusline))
warcio.exceptions.ArchiveLoadFailed: Unknown archive format, first line: ['<?xml', 'version="1.0"', 'encoding="UTF-8"?>']

这里有什么错误我该如何解决这个问题。

https://github.com/fhamborg/news-please表示采用 newsplease/examples/commoncrawl.py 中的配置部分。这是什么意思？
我已经从这个文件中复制了配置并粘贴到了newsplease/config目录中的config.cfg中。这是他们指示的吗？或者我在这里犯了一个错误。

我正在使用python 3.6。我的机器上只安装了一个 python。

score 1 · Accepted Answer

此错误是由于 newsplease 正在使用的库。当我们手动安装每个库时会犯错误，而安装的重点是包的版本。每个库的版本信息都在 setup.py 文件中给出。安装 setup.py 文件中给出的确切版本。现在执行 setup.py 时可能会出现问题。

所以使用这个命令 -

python3 setup.py install

如果您需要卸载所有以前版本的已安装包，请运行 -

pip3 freeze --user | xargs pip3 uninstall -y

有关执行此操作的更多方法，请单击此处

python - newsplease commoncrawl.py 文件中的异常

1 回答 1

Related

Reference