0

我正在尝试在 gutenberg.org 上解析电子书中的文本,以提取有关书籍的信息,例如标题。

那里的每一本书都有这样的一行:

*** START OF THIS PROJECT GUTENBERG EBOOK THE ADVENTURES OF SHERLOCK HOLMES *** 

我想使用这样的东西:

book_name=()
index = 0
for line in finalLines:
    index+=1
    if  "*** START OF THIS PROJECT GUTENBERG EBOOK "%%%"***" in line:
        print(index, line)
        book_name=%%%

但我显然做得不对。有人可以告诉我它是如何完成的吗?

4

3 回答 3

3

正则表达式是要走的路:

import re

title_regex = re.compile(r'\*{3} START OF THIS PROJECT GUTENBERG EBOOK (.*?) \*{3}')

for index, line in enumerate(finalLines):
    match = title_regex.match(line)

    if match:
        book_name = match.group(1)
        print(index, book_name)

您还可以逐行解析它:

import urllib.request

url = 'http://www.gutenberg.org/cache/epub/1342/pg1342.txt'
book = urllib.request.urlopen(url)
lines = book.readlines()
book.close()

reached_start = False
metadata = {}

for index, line in enumerate(lines):
    if line.startswith('***'):
        if not reached_start:
            reached_start = True
        else:
            break

    if not reached_start and ':' in line:
        key, _, value = line.partition(':')
        metadata[key.lower()] = value
于 2013-05-12T02:22:22.787 回答
2

最简单的解决方案:

sp = line.split()
if sp[:7]+sp[-1:] == '*** START OF THIS PROJECT GUTENBERG EBOOK ***'.split():
    bookname = ' '.join(sp[7:-1])

如建议的那样,更好的解决方案将使用正则表达式。

如果您正在使用字节,您应该使用b'*** START OF THIS PROJECT GUTENBERG EBOOK ***', 或bytes.decode(s)用于任何字节字符串。

您的片段(带有urlopen()部分)可能如下所示:

import urllib.request
url = 'http://gutenberg.org/cache/epub/1342/pg1342.txt'
with urllib.request.urlopen(url) as book:
    finalLines = book.readlines()

booktitle_pattern = '*** START OF THIS PROJECT GUTENBERG EBOOK ***'.split()
bookname = None
for index, line in enumerate(finalLines):
    sp = [bytes.decode(word) for word in line.split()]
    if sp[:7]+sp[-1:] == booktitle_pattern :
        bookname = ' '.join(sp[7:-1])
于 2013-05-12T02:03:25.650 回答
0
import urllib.request

url = 'http://www.gutenberg.org/cache/epub/1342/pg1342.txt'
book = urllib.request.urlopen(url)
lines = book.readlines()
book.close()



import re

title_regex = re.compile(b'\*{3} START OF THIS PROJECT GUTENBERG EBOOK (.*?) \*{3}')

for index, line in enumerate(lines):
    match = title_regex.match(line)

    if match:
        book_name = match.group(1)
        print(book_name)
于 2013-05-12T02:47:55.057 回答