我正在尝试使用 PySpark 将 .warc.gz 文件读取到 RDD。我希望分隔符是三个换行符,这样我就可以将每条记录作为 RDD 的一个元素来读取,以便解析它们并使用信息。首先,我对阅读响应记录的 html 内容感兴趣。
WARC/1.0
WARC-Type: request
WARC-Date: 2014-08-20T06:36:13Z
WARC-Record-ID: <urn:uuid:0fa7a21c-8de1-44ef-a896-f39aad9fb915>
Content-Length: 317
Content-Type: application/http; msgtype=request
WARC-Warcinfo-ID: <urn:uuid:ac993447-4652-47a0-be86-c14c7dc60e5e>
WARC-IP-Address: 85.214.72.216
WARC-Target-URI: http://0pointer.de/photos/?gallery=Chorin%202010-10&photo=119&exif_style=&show_thumbs=
GET /photos/?gallery=Chorin%202010-10&photo=119&exif_style=&show_thumbs= HTTP/1.0
Host: 0pointer.de
Accept-Encoding: x-gzip, gzip, deflate
User-Agent: CCBot/2.0 (http://commoncrawl.org/faq/)
Accept-Language: en-us,en-gb,en;q=0.7,*;q=0.3
Accept: text/html,application/xhtml+xml,application/xml;q=0.9,*/*;q=0.8
WARC/1.0
WARC-Type: response
WARC-Date: 2014-08-20T06:36:13Z
WARC-Record-ID: <urn:uuid:f95806a3-162c-41d5-a7d5-a6af7084409b>
Content-Length: 7502
Content-Type: application/http; msgtype=response
WARC-Warcinfo-ID: <urn:uuid:ac993447-4652-47a0-be86-c14c7dc60e5e>
WARC-Concurrent-To: <urn:uuid:0fa7a21c-8de1-44ef-a896-f39aad9fb915>
WARC-IP-Address: 85.214.72.216
WARC-Target-URI: http://0pointer.de/photos/?gallery=Chorin%202010-10&photo=119&exif_style=&show_thumbs=
WARC-Payload-Digest: sha1:MOKD54JQHY4EWHNOJLT6IXM3ZTACA3CJ
WARC-Block-Digest: sha1:VEYQQ2LH25SNUWZNVD4KA7EZWRKWK4HG
HTTP/1.1 200 OK
Date: Wed, 20 Aug 2014 06:36:13 GMT
Server: Apache
X-Powered-By: PHP/5.3.8-1+b1
Content-Length: 7319
Connection: close
Content-Type: text/html; charset=utf-8
<?xml version="1.0"?>
<!DOCTYPE html PUBLIC "-//W3C//DTD XHTML 1.0 Strict//EN"
"http://www.w3.org/TR/2000/REC-xhtml1-20000126/DTD/xhtml1-strict.dtd">
<html>
<head>
<!-- This makes IE6 suck less (a bit) -->
<!--[if lt IE 7]>
<script src="inc/styles/ie7/ie7-standard.js" type="text/javascript">
</script>
...
</html>
WARC/1.0
WARC-Type: metadata
WARC-Date: 2014-08-20T06:36:13Z
WARC-Record-ID: <urn:uuid:e32aadef-5864-48e5-8829-c1a22223fb86>
Content-Length: 20
Content-Type: application/warc-fields
WARC-Warcinfo-ID: <urn:uuid:ac993447-4652-47a0-be86-c14c7dc60e5e>
WARC-Concurrent-To: <urn:uuid:f95806a3-162c-41d5-a7d5-a6af7084409b>
WARC-Target-URI: http://0pointer.de/photos/?gallery=Chorin%202010-10&photo=119&exif_style=&show_thumbs=
fetchTimeMs: 476
WARC/1.0
WARC-Type: request
WARC-Date: 2014-08-20T05:06:10Z
WARC-Record-ID: <urn:uuid:010961f7-7378-4ab2-b180-f971f10dff7b>
Content-Length: 316
Content-Type: application/http; msgtype=request
WARC-Warcinfo-ID: <urn:uuid:ac993447-4652-47a0-be86-c14c7dc60e5e>
WARC-IP-Address: 85.214.72.216
WARC-Target-URI: http://0pointer.de/photos/?gallery=Hamburg%20Nature&photo=40&exif_style=&show_thumbs=
GET /photos/?gallery=Hamburg%20Nature&photo=40&exif_style=&show_thumbs= HTTP/1.0
Host: 0pointer.de
Accept-Encoding: x-gzip, gzip, deflate
User-Agent: CCBot/2.0 (http://commoncrawl.org/faq/)
Accept-Language: en-us,en-gb,en;q=0.7,*;q=0.3
Accept: text/html,application/xhtml+xml,application/xml;q=0.9,*/*;q=0.8
WARC/1.0
WARC-Type: response
WARC-Date: 2014-08-20T05:06:10Z
WARC-Record-ID: <urn:uuid:7c520a24-46eb-435b-ae08-5d51bbe4ff32>
Content-Length: 7492
Content-Type: application/http; msgtype=response
WARC-Warcinfo-ID: <urn:uuid:ac993447-4652-47a0-be86-c14c7dc60e5e>
WARC-Concurrent-To: <urn:uuid:010961f7-7378-4ab2-b180-f971f10dff7b>
WARC-IP-Address: 85.214.72.216
WARC-Target-URI: http://0pointer.de/photos/?gallery=Hamburg%20Nature&photo=40&exif_style=&show_thumbs=
WARC-Payload-Digest: sha1:Z7OTT2W742LWVRPCNR7DYVSXDT72I3GH
WARC-Block-Digest: sha1:6CX2E5F3DA6PLY5R5FN7Y4YG73SFMWDI
HTTP/1.1 200 OK
Date: Wed, 20 Aug 2014 05:06:10 GMT
Server: Apache
X-Powered-By: PHP/5.3.8-1+b1
Content-Length: 7309
Connection: close
Content-Type: text/html; charset=utf-8
<?xml version="1.0"?>
<!DOCTYPE html PUBLIC "-//W3C//DTD XHTML 1.0 Strict//EN"
"http://www.w3.org/TR/2000/REC-xhtml1-20000126/DTD/xhtml1-strict.dtd">
<html>
<head>
<!-- This makes IE6 suck less (a bit) -->
<!--[if lt IE 7]>
<script src="inc/styles/ie7/ie7-standard.js" type="text/javascript">
</script>
...
</html>
WARC/1.0
WARC-Type: metadata
WARC-Date: 2014-08-20T05:06:10Z
WARC-Record-ID: <urn:uuid:7c899d9d-1934-4096-9037-f8e8edcbf238>
Content-Length: 20
Content-Type: application/warc-fields
WARC-Warcinfo-ID: <urn:uuid:ac993447-4652-47a0-be86-c14c7dc60e5e>
WARC-Concurrent-To: <urn:uuid:7c520a24-46eb-435b-ae08-5d51bbe4ff32>
WARC-Target-URI: http://0pointer.de/photos/?gallery=Hamburg%20Nature&photo=40&exif_style=&show_thumbs=
fetchTimeMs: 502
我努力了
conf = SparkConf().setAppName("wdps phase1").setMaster("local")
conf.set("textinputformat.record.delimiter", "\n\n\n")
sc = SparkContext(conf=conf)
data = sc.textFile(path)
sample = data.filter(lambda x: checkResponse(x))
checkResponse 是一个将每个 RDD 元素解析为 warc 记录并使用 python 库提取一些信息的函数。
def checkResponse(input):
try:
record = warc.WARCFile(fileobj=StringIO(input))
if record['WARC-Type'] == 'response':
return True
else:
return False
except Exception as e:
return False