python - 从 WARC.gz 文件中提取标头

Question

我一直在搜索该网站很多，但无法真正找到我需要的东西。我有包含数据的 web.warc.gz 文件，我需要提取 WARC 标头。我已经安装了 Tomcat 和 Wayback (1.6)，试图使用 Wayback 提供的 ./warc-header 脚本来导出它，但我不断收到我正在使用的格式的错误消息：

Sergeis-MacBook-Pro:bin sergeipashuev$ ./warc-header ~/Desktop/WEB.WARC.gz \r\n\ 
~/Desktop/output.csv type \r\n
      USAGE: tgtWarc fieldsSrc id
        tgtWarc is the path to the target WARC.gz
          fieldsSrc is the path to the text of the record
    make sure each line is terminated by \r\n
    and that the file ends with a blank, \r\n terminiated line
id is the XXX in:
    Content-Description: Made from XXX by org.archive.wayback.util.WARCHeader
    of the header record... header...

或其他类型的错误：

   Sergeis-MacBook-Pro:bin sergeipashuev$ ./warc-header ~/Desktop/WEB.WARC.gz 
    ~/Desktop/output.csv Content-Type
    java.io.IOException: End-Of-Stream before \r\n\r\n End-Of-ANVLRecord:

at org.archive.util.anvl.ANVLRecord.load(ANVLRecord.java:163)
at org.archive.wayback.util.WARCHeader.writeHeaderRecord(WARCHeader.java:43)
at org.archive.wayback.util.WARCHeader.main(WARCHeader.java:75)

我很确定这是我在命令行中编写的一种格式，但我仍然无法正确处理。请帮忙？

score 1 · Accepted Answer

您可以使用以下 github 项目代码获取它：

https://github.com/Smerity/cc-warc-examples/blob/master/src/org/commoncrawl/examples/S3ReaderTest.java

python - 从 WARC.gz 文件中提取标头

1 回答 1

Related

Reference