php - 使用 PHP LINUX 计算 DOC 和 DOCX 中的字符数

Question

补充：我发现最接近的计算行数的方法是对 DOC 文件使用 linux 命令“antiword”，antiword 会返回 DOC 的文本版本；而对于 DOCX，使用将从 DOCX 检索内容并通过与 antiword 相同的文本函数推送数据的调用。

现在问题来了，当您在文件中有表格时，antiword 添加了很多空格。

===

我有一个脚本可以计算 DOCX 文件中的字符数：

$zip = new ZipArchive;


$striped_content = '';
$content = '';

if(!$filename || !file_exists($filename)) return false;

$zip = zip_open($filename);

if (!$zip || is_numeric($zip)) return false;

while ($zip_entry = zip_read($zip)) {

    if (zip_entry_open($zip, $zip_entry) == FALSE) continue;

    if (zip_entry_name($zip_entry) != "word/document.xml") continue;

    $content .= zip_entry_read($zip_entry, zip_entry_filesize($zip_entry));

    zip_entry_close($zip_entry);
}// end while

zip_close($zip_entry);

$content = str_replace('</w:r></w:p></w:tc><w:tc>', " ", $content);
$content = str_replace('</w:r></w:p>', "\r\n", $content);
$striped_content = trim(strip_tags($content));

如果我有 doc 文件，我基本上使用 LibreOffice 命令行将文件转换为 docx，然后运行上面的脚本。

问题是我无法找出“HEADER”和“FOOTER”区域中有多少单词文件。如何实现？

我的服务器运行：PHP 5.3 LibreOffice CentOS 6.5

我不确定我需要提供哪些其他信息，谢谢您的回答。

ps

我曾尝试将 doc 和 docx 转换为 txt，但结果“HEADER”和“FOOTER”区域没有保存在 txt 文档中

此外，我找到的最接近的解决方案是： https ://github.com/nagilum/DOCx

图书馆分解了整个 docx 文件，你有纯文本的页眉、内容和页脚，我可以尝试从他们那里锻炼字数。但是，libreoffice 有时似乎很难将文件转换为 docx，转换后 1 页的 doc 文件在 docx 中可能有 2 页。

score 0 · Accepted Answer

任何 *.docx 文件 -- zip 存档。它包含app.xml文件，您可以在其中找到节点：

<Characters>8657</Characters>

并通过正则表达式提取值

php - 使用 PHP LINUX 计算 DOC 和 DOCX 中的字符数

1 回答 1

Related

Reference