1

我正在尝试使用 PySpark 将 .warc.gz 文件读取到 RDD。我希望分隔符是三个换行符,这样我就可以将每条记录作为 RDD 的一个元素来读取,以便解析它们并使用信息。首先,我对阅读响应记录的 html 内容感兴趣。

    WARC/1.0
    WARC-Type: request
    WARC-Date: 2014-08-20T06:36:13Z
    WARC-Record-ID: <urn:uuid:0fa7a21c-8de1-44ef-a896-f39aad9fb915>
    Content-Length: 317
    Content-Type: application/http; msgtype=request
    WARC-Warcinfo-ID: <urn:uuid:ac993447-4652-47a0-be86-c14c7dc60e5e>
    WARC-IP-Address: 85.214.72.216
    WARC-Target-URI: http://0pointer.de/photos/?gallery=Chorin%202010-10&photo=119&exif_style=&show_thumbs=

    GET /photos/?gallery=Chorin%202010-10&photo=119&exif_style=&show_thumbs= HTTP/1.0
    Host: 0pointer.de
    Accept-Encoding: x-gzip, gzip, deflate
    User-Agent: CCBot/2.0 (http://commoncrawl.org/faq/)
    Accept-Language: en-us,en-gb,en;q=0.7,*;q=0.3
    Accept: text/html,application/xhtml+xml,application/xml;q=0.9,*/*;q=0.8



    WARC/1.0
    WARC-Type: response
    WARC-Date: 2014-08-20T06:36:13Z
    WARC-Record-ID: <urn:uuid:f95806a3-162c-41d5-a7d5-a6af7084409b>
    Content-Length: 7502
    Content-Type: application/http; msgtype=response
    WARC-Warcinfo-ID: <urn:uuid:ac993447-4652-47a0-be86-c14c7dc60e5e>
    WARC-Concurrent-To: <urn:uuid:0fa7a21c-8de1-44ef-a896-f39aad9fb915>
    WARC-IP-Address: 85.214.72.216
    WARC-Target-URI: http://0pointer.de/photos/?gallery=Chorin%202010-10&photo=119&exif_style=&show_thumbs=
    WARC-Payload-Digest: sha1:MOKD54JQHY4EWHNOJLT6IXM3ZTACA3CJ
    WARC-Block-Digest: sha1:VEYQQ2LH25SNUWZNVD4KA7EZWRKWK4HG

    HTTP/1.1 200 OK
    Date: Wed, 20 Aug 2014 06:36:13 GMT
    Server: Apache
    X-Powered-By: PHP/5.3.8-1+b1
    Content-Length: 7319
    Connection: close
    Content-Type: text/html; charset=utf-8

    <?xml version="1.0"?>
    <!DOCTYPE html PUBLIC "-//W3C//DTD XHTML 1.0 Strict//EN"
       "http://www.w3.org/TR/2000/REC-xhtml1-20000126/DTD/xhtml1-strict.dtd">
    <html>
    <head>
    <!-- This makes IE6 suck less (a bit) -->
    <!--[if lt IE 7]>
    <script src="inc/styles/ie7/ie7-standard.js" type="text/javascript">
    </script>
    ...
    </html>


    WARC/1.0
    WARC-Type: metadata
    WARC-Date: 2014-08-20T06:36:13Z
    WARC-Record-ID: <urn:uuid:e32aadef-5864-48e5-8829-c1a22223fb86>
    Content-Length: 20
    Content-Type: application/warc-fields
    WARC-Warcinfo-ID: <urn:uuid:ac993447-4652-47a0-be86-c14c7dc60e5e>
    WARC-Concurrent-To: <urn:uuid:f95806a3-162c-41d5-a7d5-a6af7084409b>
    WARC-Target-URI: http://0pointer.de/photos/?gallery=Chorin%202010-10&photo=119&exif_style=&show_thumbs=

    fetchTimeMs: 476



    WARC/1.0
    WARC-Type: request
    WARC-Date: 2014-08-20T05:06:10Z
    WARC-Record-ID: <urn:uuid:010961f7-7378-4ab2-b180-f971f10dff7b>
    Content-Length: 316
    Content-Type: application/http; msgtype=request
    WARC-Warcinfo-ID: <urn:uuid:ac993447-4652-47a0-be86-c14c7dc60e5e>
    WARC-IP-Address: 85.214.72.216
    WARC-Target-URI: http://0pointer.de/photos/?gallery=Hamburg%20Nature&photo=40&exif_style=&show_thumbs=

    GET /photos/?gallery=Hamburg%20Nature&photo=40&exif_style=&show_thumbs= HTTP/1.0
    Host: 0pointer.de
    Accept-Encoding: x-gzip, gzip, deflate
    User-Agent: CCBot/2.0 (http://commoncrawl.org/faq/)
    Accept-Language: en-us,en-gb,en;q=0.7,*;q=0.3
    Accept: text/html,application/xhtml+xml,application/xml;q=0.9,*/*;q=0.8



    WARC/1.0
    WARC-Type: response
    WARC-Date: 2014-08-20T05:06:10Z
    WARC-Record-ID: <urn:uuid:7c520a24-46eb-435b-ae08-5d51bbe4ff32>
    Content-Length: 7492
    Content-Type: application/http; msgtype=response
    WARC-Warcinfo-ID: <urn:uuid:ac993447-4652-47a0-be86-c14c7dc60e5e>
    WARC-Concurrent-To: <urn:uuid:010961f7-7378-4ab2-b180-f971f10dff7b>
    WARC-IP-Address: 85.214.72.216
    WARC-Target-URI: http://0pointer.de/photos/?gallery=Hamburg%20Nature&photo=40&exif_style=&show_thumbs=
    WARC-Payload-Digest: sha1:Z7OTT2W742LWVRPCNR7DYVSXDT72I3GH
    WARC-Block-Digest: sha1:6CX2E5F3DA6PLY5R5FN7Y4YG73SFMWDI

    HTTP/1.1 200 OK
    Date: Wed, 20 Aug 2014 05:06:10 GMT
    Server: Apache
    X-Powered-By: PHP/5.3.8-1+b1
    Content-Length: 7309
    Connection: close
    Content-Type: text/html; charset=utf-8

    <?xml version="1.0"?>
    <!DOCTYPE html PUBLIC "-//W3C//DTD XHTML 1.0 Strict//EN"
       "http://www.w3.org/TR/2000/REC-xhtml1-20000126/DTD/xhtml1-strict.dtd">
    <html>
    <head>
    <!-- This makes IE6 suck less (a bit) -->
    <!--[if lt IE 7]>
    <script src="inc/styles/ie7/ie7-standard.js" type="text/javascript">
    </script>
    ...
    </html>


    WARC/1.0
    WARC-Type: metadata
    WARC-Date: 2014-08-20T05:06:10Z
    WARC-Record-ID: <urn:uuid:7c899d9d-1934-4096-9037-f8e8edcbf238>
    Content-Length: 20
    Content-Type: application/warc-fields
    WARC-Warcinfo-ID: <urn:uuid:ac993447-4652-47a0-be86-c14c7dc60e5e>
    WARC-Concurrent-To: <urn:uuid:7c520a24-46eb-435b-ae08-5d51bbe4ff32>
    WARC-Target-URI: http://0pointer.de/photos/?gallery=Hamburg%20Nature&photo=40&exif_style=&show_thumbs=

    fetchTimeMs: 502

我努力了

    conf = SparkConf().setAppName("wdps phase1").setMaster("local")
    conf.set("textinputformat.record.delimiter", "\n\n\n")
    sc = SparkContext(conf=conf)

    data = sc.textFile(path)
    sample = data.filter(lambda x: checkResponse(x))

checkResponse 是一个将每个 RDD 元素解析为 warc 记录并使用 python 库提取一些信息的函数。

    def checkResponse(input):
        try:
            record = warc.WARCFile(fileobj=StringIO(input))
            if record['WARC-Type'] == 'response':
                return True
            else:
                return False
        except Exception as e:
            return False
4

0 回答 0