php - PHP 无法读取从 COM .doc 到 .txt 转换的格式化文本

Question

我有很多带有数据库条目规范的 .doc 文件。我需要解析所有这些文档并使用文档中的信息创建条目。我一直在尝试使用 COM 方法。该文件在页面的顶部和底部都有纯文本......但是，规格在页面中心的表格中。如果我不取消链接新的 .txt 文件，我可以看到内容已传输到新文档，但它有一堆 [] 形式的无效字符贯穿其中。当我使用 file_get_contents() 时，它会完全忽略表格中的所有文本。

有没有办法以编程方式解决这个问题？我真的找不到关于 word.application COM 对象的 API 的任何信息。理想情况下，我想我应该去掉格式，然后将文件保存为 .txt 文件或类似的东西。

任何帮助将不胜感激。

这是我的代码：

    $dir   = $PATH."/scripts/specsheets/doc";
    $files = scandir($dir);
    foreach( $files as $file ) {
        if( strtolower(substr($file, -3)) == "doc" ) {

            $word = new COM("word.application") or die("Unable to instantiate Word");
            $word->Documents->Open($dir."/".$file);
            $new_file = substr($dir."/txt/".$file, 0, -4).".txt";

            $word->Documents[1]->SaveAs($new_file, 2);
            $word->Documents[1]->Close(false);
            $word->Quit();
            $word = NULL;
            unset($word);

            $output = file_get_contents($new_file);
            rename($dir."/".$file, $dir."/archive/".$file);

            echo utf8_encode($output);
        }
    }

score 0 · Accepted Answer

找不到使用 COM 方法的解决方案...但是如果您在 php 中使用此命令，则可以使用 Windows 的 antiword 程序来获取输出

$content = shell_exec("C:/antiword/antiword.exe ".$filename);

windows版本的链接是：

http://www-stud.rbi.informatik.uni-frankfurt.de/~markus/antiword/

它工作得很好，它甚至可以提取表中的数据。绝对解决了我的问题。

php - PHP 无法读取从 COM .doc 到 .txt 转换的格式化文本

1 回答 1

Related

Reference