linux - 如何使用“catdoc”显示以 utf-8 编码的停靠文件

Question

我有很多 docx 文件，我想在终端上阅读它们。我找到了 catdoc http://www.wagner.pp.ru/~vitus/software/catdoc/

当我使用它时，输出只是不可读的字符。我的 docx 文件以 utf-8 编码。我尝试了“catdoc -u my_file.docx”但不起作用。

请帮忙。非常感谢。

score 2 · Accepted Answer

docx 是压缩的 XML 文件。

要提取和剥离 XML，请尝试基于

unzip -p "*.docx" word/document.xml | sed -e 's/<[^>]\{1,\}>//g; s/[^[:print:]]\{1,\}//g'

从命令行fu

score 1 · Accepted Answer

我天真的理解catdoc通常只能用于DOC文件。DOCX文件就像一个压缩容器，里面有一堆信息；您可以在其中找到某种 XML 格式的原始文档。

话虽如此，我使用doc2txt工具或unoconv工具（后者需要安装 OpenOffice 或 LibreOffice 套件）提取文件的内容DOCX，甚至是文件内容，取得了令人愉快的成功。DOTX

以下是我过去成功使用的一些示例工作流程：

# This one, contrary to the unoconv case, does not fire up an instance
# of either LibreOffice or OpenOffice.
docx2txt.pl < ./pesky-word-doc.docx > ./pesky-word-doc.txt

# This one, however, does fire up a rather heavy 'headless' OpenOffice
# or LibreOffice instance process per conversion. You can get around this
# using the next approach below.
unoconv -f txt -o ./pesky-word-doc.txt ./pesky-word-doc.docx

# If you need to convert a couple of dozens such documents, you might want
# to run it via a service port (you get the idea):
unoconv --listener --port=2002 &
unoconv -f txt -o outdir *.docx
unoconv -f pdf -o outdir *.docx && open ./outdir/*.pdf # Convenient, if you run MacOSX
kill -15 %-

# Kind of introducing catdoc: The sed was needed for German documents where
# somehow I couldn't find the proper encoding settings.
unoconv -f doc -o ./pesky-word-doc.doc ./pesky-word-doc.docx && \
          catdoc -u ./pesky-word-doc.doc | sed 's/ь/ü/g;s/д/ä/g;s/ц/ö/g'

还有其他选项，例如使用一些可用的 java 解析器，可以在此处和此处找到。输出质量不同，根据您的预期用途，您需要选择其中一种方法。

linux - 如何使用“catdoc”显示以 utf-8 编码的停靠文件

2 回答 2

Related

Reference