python - 不使用 COM/自动化从 Word 文档中提取文本的最佳方法？

Question

是否有一种合理的方法可以从不依赖于 COM 自动化的 Word 文件中提取纯文本？（这是部署在非 Windows 平台上的 Web 应用程序的一项功能 - 在这种情况下是不可协商的。）

Antiword 似乎是一个合理的选择，但似乎它可能会被放弃。

Python 解决方案将是理想的，但似乎不可用。

score 21 · Accepted Answer

使用我本周制作的原生 Python docx 模块。以下是如何从文档中提取所有文本：

document = opendocx('Hello world.docx')

# This location is where most document content lives 
docbody = document.xpath('/w:document/w:body', namespaces=wordnamespaces)[0]

# Extract all text
print getdocumenttext(document)

请参阅Python DocX 站点

100% Python，没有 COM，没有 .net，没有 Java，没有用正则表达式解析序列化的 XML。

score 16 · Accepted Answer

为此，我使用 catdoc 或 antiword，只要给出最容易解析的结果。我已经将它嵌入到 python 函数中，所以它很容易从解析系统（用 python 编写）中使用。

import os

def doc_to_text_catdoc(filename):
    (fi, fo, fe) = os.popen3('catdoc -w "%s"' % filename)
    fi.close()
    retval = fo.read()
    erroroutput = fe.read()
    fo.close()
    fe.close()
    if not erroroutput:
        return retval
    else:
        raise OSError("Executing the command caused an error: %s" % erroroutput)

# similar doc_to_text_antiword()

到 catdoc 的 -w 开关关闭换行，顺便说一句。

score 4 · Accepted Answer

如果您只想从 Word 文件 (.docx) 中提取文本，则只能使用 Python 来完成。就像 Guy Starbuck 写的那样，您只需要解压缩文件然后解析 XML。受启发python-docx，我编写了一个简单的函数来执行此操作：

try:
    from xml.etree.cElementTree import XML
except ImportError:
    from xml.etree.ElementTree import XML
import zipfile


"""
Module that extract text from MS XML Word document (.docx).
(Inspired by python-docx <https://github.com/mikemaccana/python-docx>)
"""

WORD_NAMESPACE = '{http://schemas.openxmlformats.org/wordprocessingml/2006/main}'
PARA = WORD_NAMESPACE + 'p'
TEXT = WORD_NAMESPACE + 't'


def get_docx_text(path):
    """
    Take the path of a docx file as argument, return the text in unicode.
    """
    document = zipfile.ZipFile(path)
    xml_content = document.read('word/document.xml')
    document.close()
    tree = XML(xml_content)

    paragraphs = []
    for paragraph in tree.getiterator(PARA):
        texts = [node.text
                 for node in paragraph.getiterator(TEXT)
                 if node.text]
        if texts:
            paragraphs.append(''.join(texts))

    return '\n\n'.join(paragraphs)

score 3 · Accepted Answer

使用 OpenOffice API、Python 和Andrew Pitonyak 的优秀在线宏书，我设法做到了这一点。第 7.16.4 节是开始的地方。

让它在不需要屏幕的情况下工作的另一个技巧是使用 Hidden 属性：

RO = PropertyValue('ReadOnly', 0, True, 0)
Hidden = PropertyValue('Hidden', 0, True, 0)
xDoc = desktop.loadComponentFromURL( docpath,"_blank", 0, (RO, Hidden,) )

否则，当您打开文档时，文档会在屏幕上（可能在网络服务器控制台上）弹出。

score 2 · Accepted Answer

蒂卡蟒蛇

Apache Tika 库的 Python 端口，根据文档，Apache tika 支持从 1500 多种文件格式中提取文本。

注意：它也可以与pyinstaller一起使用

使用 pip 安装：

pip install tika

样本：

#!/usr/bin/env python
from tika import parser
parsed = parser.from_file('/path/to/file')
print(parsed["metadata"]) #To get the meta data of the file
print(parsed["content"]) # To get the content of the file

链接到官方GitHub

score 1 · Accepted Answer

1

Open Office 有一个API

于 2008-09-03T20:20:00.107 回答

score 1 · Accepted Answer

对于 docx 文件，请查看 Python 脚本 docx2txt，网址为

http://cobweb.ecn.purdue.edu/~kak/distMisc/docx2txt

用于从 docx 文档中提取纯文本。

score 1 · Accepted Answer

这适用于 .doc 和 .odt。

它在命令行上调用 openoffice 将您的文件转换为文本，然后您可以简单地将其加载到 python 中。

（它似乎有其他格式选项，尽管它们显然没有记录。）

score 0 · Accepted Answer

老实说，不要使用“pip install tika ”，这是为单用户（一个开发人员在他的笔记本电脑上工作）而不是为多用户（多开发人员）开发的。

在命令行中使用 Tika 的小类 TikaWrapper.py 足以满足我们的需求。

你只需要用 JAVA_HOME 路径和 Tika jar 路径来实例化这个类，就是这样！它适用于许多格式（例如：PDF、DOCX、ODT、XLSX、PPT 等）。

#!/bin/python
# -*- coding: utf-8 -*-

# Class to extract metadata and text from different file types (such as PPT, XLS, and PDF)
# Developed by Philippe ROSSIGNOL
#####################
# TikaWrapper class #
#####################
class TikaWrapper:

    java_home = None
    tikalib_path = None

    # Constructor
    def __init__(self, java_home, tikalib_path):
        self.java_home = java_home
        self.tika_lib_path = tikalib_path

    def extractMetadata(self, filePath, encoding="UTF-8", returnTuple=False):
        '''
        - Description:
          Extract metadata from a document
        
        - Params:
          filePath: The document file path
          encoding: The encoding (default = "UTF-8")
          returnTuple: If True return a tuple which contains both the output and the error (default = False)
        
        - Examples:
          metadata = extractMetadata(filePath="MyDocument.docx")
          metadata, error = extractMetadata(filePath="MyDocument.docx", encoding="UTF-8", returnTuple=True)
        '''
        cmd = self._getCmd(self._cmdExtractMetadata, filePath, encoding)
        out, err = self._execute(cmd, encoding)
        if (returnTuple): return out, err
        return out

    def extractText(self, filePath, encoding="UTF-8", returnTuple=False):
        '''
        - Description:
          Extract text from a document
        
        - Params:
          filePath: The document file path
          encoding: The encoding (default = "UTF-8")
          returnTuple: If True return a tuple which contains both the output and the error (default = False)
        
        - Examples:
          text = extractText(filePath="MyDocument.docx")
          text, error = extractText(filePath="MyDocument.docx", encoding="UTF-8", returnTuple=True)
        '''
        cmd = self._getCmd(self._cmdExtractText, filePath, encoding)
        out, err = self._execute(cmd, encoding)
        return out, err

    # ===========
    # = PRIVATE =
    # ===========

    _cmdExtractMetadata = "${JAVA_HOME}/bin/java -jar ${TIKALIB_PATH} --metadata ${FILE_PATH}"
    _cmdExtractText = "${JAVA_HOME}/bin/java -jar ${TIKALIB_PATH} --encoding=${ENCODING} --text ${FILE_PATH}"

    def _getCmd(self, cmdModel, filePath, encoding):
        cmd = cmdModel.replace("${JAVA_HOME}", self.java_home)
        cmd = cmd.replace("${TIKALIB_PATH}", self.tika_lib_path)
        cmd = cmd.replace("${ENCODING}", encoding)
        cmd = cmd.replace("${FILE_PATH}", filePath)
        return cmd

    def _execute(self, cmd, encoding):
        import subprocess
        process = subprocess.Popen(cmd, shell=True, stdout=subprocess.PIPE, stderr=subprocess.PIPE)
        out, err = process.communicate()
        out = out.decode(encoding=encoding)
        err = err.decode(encoding=encoding)
        return out, err

score 0 · Accepted Answer

以防万一有人想用 Java 语言做有 Apache poi api。extractor.getText() 将从 docx 中提取平面文本。这是链接https://www.tutorialspoint.com/apache_poi_word/apache_poi_word_text_extraction.htm

score 0 · Accepted Answer

Textract-Plus

使用 textract-plus 可以从包括 doc 、 docm 、 dotx 和 docx 在内的大多数文档扩展名中提取文本。（它使用 antiword 作为 doc 文件的后端）参考 docs

安装-

pip install textract-plus

样本-

import textractplus as tp
text=tp.process('path/to/yourfile.doc')

python - 不使用 COM/自动化从 Word 文档中提取文本的最佳方法？

11 回答 11

Related

Reference