python - 用于提取“epub”信息的 Python 库

Question

我正在尝试在 python 中为 iBook 创建一个 epub 上传器。我需要一个 python 库来提取书籍信息。在我自己实现这个之前，我想知道是否有人知道已经制作的 python 库可以做到这一点。

score 46 · Accepted Answer

.epub 文件是一个包含 META-INF 目录的 zip 编码文件，该目录包含一个名为 container.xml 的文件，该文件指向另一个通常名为 Content.opf 的文件，该文件索引构成电子书的所有其他文件（基于http://www.jedisaber.com/eBooks/tutorial.asp的摘要；完整规范在http://www.idpf.org/2007/opf/opf2.0/download/）

以下 Python 代码将从 .epub 文件中提取基本元信息并将其作为 dict 返回。

import zipfile
from lxml import etree

def get_epub_info(fname):
    ns = {
        'n':'urn:oasis:names:tc:opendocument:xmlns:container',
        'pkg':'http://www.idpf.org/2007/opf',
        'dc':'http://purl.org/dc/elements/1.1/'
    }

    # prepare to read from the .epub file
    zip = zipfile.ZipFile(fname)

    # find the contents metafile
    txt = zip.read('META-INF/container.xml')
    tree = etree.fromstring(txt)
    cfname = tree.xpath('n:rootfiles/n:rootfile/@full-path',namespaces=ns)[0]

    # grab the metadata block from the contents metafile
    cf = zip.read(cfname)
    tree = etree.fromstring(cf)
    p = tree.xpath('/pkg:package/pkg:metadata',namespaces=ns)[0]

    # repackage the data
    res = {}
    for s in ['title','language','creator','date','identifier']:
        res[s] = p.xpath('dc:%s/text()'%(s),namespaces=ns)[0]

    return res

样本输出：

{
    'date': '2009-12-26T17:03:31',
    'identifier': '25f96ff0-7004-4bb0-b1f2-d511ca4b2756',
    'creator': 'John Grisham',
    'language': 'UND',
    'title': 'Ford County'
}

score 3 · Accepted Answer

例如，像epub-tools这样的东西？但这主要是关于写作 epub格式（来自各种可能的来源），epubtools（相似的拼写，不同的项目）也是如此。为了阅读它，我会尝试配套项目threepress，这是一个用于在浏览器上显示 epub 书籍的 Django 应用程序——还没有看过该代码，但我想为了展示这本书，它必须首先能够阅读;-）。

score 1 · Accepted Answer

1

查看epub 模块。它看起来是一个简单的选择。

于 2012-06-05T12:09:23.640 回答

score 0 · Accepted Answer

我在寻找类似的东西后来到这里，并受到博思韦尔先生的代码片段的启发，开始了我自己的项目。如果有人有兴趣... http://epubzilla.odeegan.com/

python - 用于提取“epub”信息的 Python 库

4 回答 4

Related

Reference