您不会在Project Gutenberg的 RDF 目录中找到全文。不过,它确实包含多种格式的文本 URL。下载目录 zip 文件并解压缩后,以下是从特定 RDF 文件获取 HTML 图书 URL 的方法。
filename = 'cache/epub/78/pg78.rdf'
from lxml import etree
rdf = open(filename).read()
tree = etree.fromstring(rdf)
resource_tag = '{http://www.w3.org/1999/02/22-rdf-syntax-ns#}resource'
hasFormat_tag = './/{http://purl.org/dc/terms/}hasFormat'
resources = [el.attrib[resource_tag] for el in tree.findall(hasFormat_tag)]
urls = [url for url in resources if url.endswith('htm')]
// urls[0] is 'http://www.gutenberg.org/files/78/78-h/78-h.htm'
获得所需书籍的 HTML 版本的 URL 后,以下是获取文本的方法。
import requests
from lxml import etree
response = requests.get(urls[0])
html = etree.HTML(response.text)
text = '\n'.join([el.text for el in html.findall('.//p')])
text
现在包含Tarzan的全文,减去 Project Gutenberg 元数据、目录和章节标题。
>>> text[:100]
u'\r\nI had this story from one who had no business to tell it to me, or to\r\nany other. I may credit th'
请注意,关于古腾堡的书籍之间存在不一致之处,因此您的结果可能会有所不同。