python - HTML 页面和 Python：提取正文并在其中分割文本

翻译自：https://stackoverflow.com/questions/24536796 2014-07-02T16:52:51.840

358 次

大故事

我想改进一个读取 EPUB 文件的 Python 应用程序。我想添加选项以“记住”读者最后停止的地方。这是github上此应用程序的链接

目前，我可以保存用户停止的最后一句话。我想用这些词在文本中找到它们，并从这个地方向读者展示。但是，我不知道如何分割从 html 文件正文中提取的文本并将其提供给格式化程序。

以下是所有这一切发生的摘录：

''' text dump of html '''
class Parser(htmllib.HTMLParser):
    def anchor_end(self):
        self.anchor = None
    def handle_image(self, source, alt, ismap, alight, width, height):
        global basedir
        self.handle_data(
            '[img="{0}{1}" "{2}"]'.format(basedir, source, alt)
        )

class Formatter(formatter.AbstractFormatter):
    pass

class Writer(formatter.DumbWriter):
    def __init__(self, fl, maxcol=72):
        formatter.DumbWriter.__init__(self, fl)
        self.maxcol = maxcol
    def send_label_data(self, data):
        self.send_flowing_data(data)
        self.send_flowing_data(' ')

o = StringIO.StringIO()
p = Parser(Formatter(Writer(o, maxcol)))
p.feed(html_snippet)
p.close()

return o.getvalue()

我认为我必须介入这条线

p.feed(html_snippet)

所以，各位，能不能给我建议一下我现在能做什么？

带着敬意

python - HTML 页面和 Python：提取正文并在其中分割文本

0 回答 0

Related

Reference