17

我想用 PHP 从 word 文档中提取文本内容。

我在 Microsoft Word for Mac 2011 中创建了一个新的 Word 文档。编辑:还通过在 Windows 7 下的 Microsoft Word 中创建相同的文档进行了测试。

文件的内容是

The quick brown fox jumps over the lazy dog

我已将它作为 Word 97-2004 文档 (.doc) 保存到磁盘。

我正在使用phpoffice/phpword和这段代码来提取文本:

<?php

$source = "word.doc";

$phpWord = \PhpOffice\PhpWord\IOFactory::load($source, 'MsDoc');

$text = '';

$sections = $phpWord->getSections();

foreach ($sections as $s) {
    $els = $s->getElements();
    foreach ($els as $e) {
        if (get_class($e) === 'PhpOffice\PhpWord\Element\Text') {
            $text .= $e->getText();
        } elseif (get_class($e) === 'PhpOffice\PhpWord\Section\TextBreak') {
            $text .= " \n";
        } else {
            throw new Exception('Unknown class type ' . get_class($e));
        }
    }
}

print $text;

此代码的输出只是文本的一部分:

The quick brown fox j

代码有问题,还是某种兼容性问题?

编辑:

如果我在输出var_dump($els);之前添加一个foreach ($els as $e) {是这样的:

array(1) {
  [0]=>
  object(PhpOffice\PhpWord\Element\Text)#1265 (14) {
    ["text":protected]=>
    string(21) "The quick brown fox j"
    ["fontStyle":protected]=>
    object(PhpOffice\PhpWord\Style\Font)#1267 (25) {
      ["aliases":protected]=>
      array(1) {
        ["line-height"]=>
        string(10) "lineHeight"
      }
      ["type":"PhpOffice\PhpWord\Style\Font":private]=>
      string(4) "text"
      ["name":"PhpOffice\PhpWord\Style\Font":private]=>
      NULL
      ["hint":"PhpOffice\PhpWord\Style\Font":private]=>
      NULL
      ["size":"PhpOffice\PhpWord\Style\Font":private]=>
      NULL
      ["color":"PhpOffice\PhpWord\Style\Font":private]=>
      NULL
      ["bold":"PhpOffice\PhpWord\Style\Font":private]=>
      bool(false)
      ["italic":"PhpOffice\PhpWord\Style\Font":private]=>
      bool(false)
      ["underline":"PhpOffice\PhpWord\Style\Font":private]=>
      string(4) "none"
      ["superScript":"PhpOffice\PhpWord\Style\Font":private]=>
      bool(false)
      ["subScript":"PhpOffice\PhpWord\Style\Font":private]=>
      bool(false)
      ["strikethrough":"PhpOffice\PhpWord\Style\Font":private]=>
      bool(false)
      ["doubleStrikethrough":"PhpOffice\PhpWord\Style\Font":private]=>
      bool(false)
      ["smallCaps":"PhpOffice\PhpWord\Style\Font":private]=>
      bool(false)
      ["allCaps":"PhpOffice\PhpWord\Style\Font":private]=>
      bool(false)
      ["fgColor":"PhpOffice\PhpWord\Style\Font":private]=>
      NULL
      ["scale":"PhpOffice\PhpWord\Style\Font":private]=>
      NULL
      ["spacing":"PhpOffice\PhpWord\Style\Font":private]=>
      NULL
      ["kerning":"PhpOffice\PhpWord\Style\Font":private]=>
      NULL
      ["paragraph":"PhpOffice\PhpWord\Style\Font":private]=>
      object(PhpOffice\PhpWord\Style\Paragraph)#1266 (26) {
        ["aliases":protected]=>
        array(1) {
          ["line-height"]=>
          string(10) "lineHeight"
        }
        ["basedOn":"PhpOffice\PhpWord\Style\Paragraph":private]=>
        string(6) "Normal"
        ["next":"PhpOffice\PhpWord\Style\Paragraph":private]=>
        NULL
        ["alignment":"PhpOffice\PhpWord\Style\Paragraph":private]=>
        string(0) ""
        ["indentation":"PhpOffice\PhpWord\Style\Paragraph":private]=>
        NULL
        ["spacing":"PhpOffice\PhpWord\Style\Paragraph":private]=>
        NULL
        ["lineHeight":"PhpOffice\PhpWord\Style\Paragraph":private]=>
        NULL
        ["widowControl":"PhpOffice\PhpWord\Style\Paragraph":private]=>
        bool(true)
        ["keepNext":"PhpOffice\PhpWord\Style\Paragraph":private]=>
        bool(false)
        ["keepLines":"PhpOffice\PhpWord\Style\Paragraph":private]=>
        bool(false)
        ["pageBreakBefore":"PhpOffice\PhpWord\Style\Paragraph":private]=>
        bool(false)
        ["numStyle":"PhpOffice\PhpWord\Style\Paragraph":private]=>
        NULL
        ["numLevel":"PhpOffice\PhpWord\Style\Paragraph":private]=>
        int(0)
        ["tabs":"PhpOffice\PhpWord\Style\Paragraph":private]=>
        array(0) {
        }
        ["shading":"PhpOffice\PhpWord\Style\Paragraph":private]=>
        NULL
        ["borderTopSize":protected]=>
        NULL
        ["borderTopColor":protected]=>
        NULL
        ["borderLeftSize":protected]=>
        NULL
        ["borderLeftColor":protected]=>
        NULL
        ["borderRightSize":protected]=>
        NULL
        ["borderRightColor":protected]=>
        NULL
        ["borderBottomSize":protected]=>
        NULL
        ["borderBottomColor":protected]=>
        NULL
        ["styleName":protected]=>
        NULL
        ["index":protected]=>
        NULL
        ["isAuto":"PhpOffice\PhpWord\Style\AbstractStyle":private]=>
        bool(false)
      }
      ["shading":"PhpOffice\PhpWord\Style\Font":private]=>
      NULL
      ["rtl":"PhpOffice\PhpWord\Style\Font":private]=>
      bool(false)
      ["styleName":protected]=>
      NULL
      ["index":protected]=>
      NULL
      ["isAuto":"PhpOffice\PhpWord\Style\AbstractStyle":private]=>
      bool(false)
    }
    ["paragraphStyle":protected]=>
    object(PhpOffice\PhpWord\Style\Paragraph)#1266 (26) {
      ["aliases":protected]=>
      array(1) {
        ["line-height"]=>
        string(10) "lineHeight"
      }
      ["basedOn":"PhpOffice\PhpWord\Style\Paragraph":private]=>
      string(6) "Normal"
      ["next":"PhpOffice\PhpWord\Style\Paragraph":private]=>
      NULL
      ["alignment":"PhpOffice\PhpWord\Style\Paragraph":private]=>
      string(0) ""
      ["indentation":"PhpOffice\PhpWord\Style\Paragraph":private]=>
      NULL
      ["spacing":"PhpOffice\PhpWord\Style\Paragraph":private]=>
      NULL
      ["lineHeight":"PhpOffice\PhpWord\Style\Paragraph":private]=>
      NULL
      ["widowControl":"PhpOffice\PhpWord\Style\Paragraph":private]=>
      bool(true)
      ["keepNext":"PhpOffice\PhpWord\Style\Paragraph":private]=>
      bool(false)
      ["keepLines":"PhpOffice\PhpWord\Style\Paragraph":private]=>
      bool(false)
      ["pageBreakBefore":"PhpOffice\PhpWord\Style\Paragraph":private]=>
      bool(false)
      ["numStyle":"PhpOffice\PhpWord\Style\Paragraph":private]=>
      NULL
      ["numLevel":"PhpOffice\PhpWord\Style\Paragraph":private]=>
      int(0)
      ["tabs":"PhpOffice\PhpWord\Style\Paragraph":private]=>
      array(0) {
      }
      ["shading":"PhpOffice\PhpWord\Style\Paragraph":private]=>
      NULL
      ["borderTopSize":protected]=>
      NULL
      ["borderTopColor":protected]=>
      NULL
      ["borderLeftSize":protected]=>
      NULL
      ["borderLeftColor":protected]=>
      NULL
      ["borderRightSize":protected]=>
      NULL
      ["borderRightColor":protected]=>
      NULL
      ["borderBottomSize":protected]=>
      NULL
      ["borderBottomColor":protected]=>
      NULL
      ["styleName":protected]=>
      NULL
      ["index":protected]=>
      NULL
      ["isAuto":"PhpOffice\PhpWord\Style\AbstractStyle":private]=>
      bool(false)
    }
    ["phpWord":protected]=>
    object(PhpOffice\PhpWord\PhpWord)#1247 (3) {
      ["sections":"PhpOffice\PhpWord\PhpWord":private]=>
      array(1) {
        [0]=>
        object(PhpOffice\PhpWord\Element\Section)#1261 (16) {
          ["container":protected]=>
          string(7) "Section"
          ["style":"PhpOffice\PhpWord\Element\Section":private]=>
          object(PhpOffice\PhpWord\Style\Section)#1262 (28) {
            ["orientation":"PhpOffice\PhpWord\Style\Section":private]=>
            string(8) "portrait"
            ["paper":"PhpOffice\PhpWord\Style\Section":private]=>
            object(PhpOffice\PhpWord\Style\Paper)#1263 (8) {
              ["sizes":"PhpOffice\PhpWord\Style\Paper":private]=>
              array(6) {
                ["A3"]=>
                array(3) {
                  [0]=>
                  int(297)
                  [1]=>
                  int(420)
                  [2]=>
                  string(2) "mm"
                }
                ["A4"]=>
                array(3) {
                  [0]=>
                  int(210)
                  [1]=>
                  int(297)
                  [2]=>
                  string(2) "mm"
                }
                ["A5"]=>
                array(3) {
                  [0]=>
                  int(148)
                  [1]=>
                  int(210)
                  [2]=>
                  string(2) "mm"
                }
                ["Folio"]=>
                array(3) {
                  [0]=>
                  float(8.5)
                  [1]=>
                  int(13)
                  [2]=>
                  string(2) "in"
                }
                ["Legal"]=>
                array(3) {
                  [0]=>
                  float(8.5)
                  [1]=>
                  int(14)
                  [2]=>
                  string(2) "in"
                }
                ["Letter"]=>
                array(3) {
                  [0]=>
                  float(8.5)
                  [1]=>
                  int(11)
                  [2]=>
                  string(2) "in"
                }
              }
              ["size":"PhpOffice\PhpWord\Style\Paper":private]=>
              string(2) "A4"
              ["width":"PhpOffice\PhpWord\Style\Paper":private]=>
              int(11870)
              ["height":"PhpOffice\PhpWord\Style\Paper":private]=>
              int(16787)
              ["styleName":protected]=>
              NULL
              ["index":protected]=>
              NULL
              ["aliases":protected]=>
              array(0) {
              }
              ["isAuto":"PhpOffice\PhpWord\Style\AbstractStyle":private]=>
              bool(false)
            }
            ["pageSizeW":"PhpOffice\PhpWord\Style\Section":private]=>
            int(11906)
            ["pageSizeH":"PhpOffice\PhpWord\Style\Section":private]=>
            int(16838)
            ["marginTop":"PhpOffice\PhpWord\Style\Section":private]=>
            int(1417)
            ["marginLeft":"PhpOffice\PhpWord\Style\Section":private]=>
            int(1417)
            ["marginRight":"PhpOffice\PhpWord\Style\Section":private]=>
            int(1417)
            ["marginBottom":"PhpOffice\PhpWord\Style\Section":private]=>
            int(1417)
            ["gutter":"PhpOffice\PhpWord\Style\Section":private]=>
            int(0)
            ["headerHeight":"PhpOffice\PhpWord\Style\Section":private]=>
            int(720)
            ["footerHeight":"PhpOffice\PhpWord\Style\Section":private]=>
            int(720)
            ["pageNumberingStart":"PhpOffice\PhpWord\Style\Section":private]=>
            NULL
            ["colsNum":"PhpOffice\PhpWord\Style\Section":private]=>
            int(1)
            ["colsSpace":"PhpOffice\PhpWord\Style\Section":private]=>
            int(720)
            ["breakType":"PhpOffice\PhpWord\Style\Section":private]=>
            NULL
            ["lineNumbering":"PhpOffice\PhpWord\Style\Section":private]=>
            NULL
            ["borderTopSize":protected]=>
            NULL
            ["borderTopColor":protected]=>
            NULL
            ["borderLeftSize":protected]=>
            NULL
            ["borderLeftColor":protected]=>
            NULL
            ["borderRightSize":protected]=>
            NULL
            ["borderRightColor":protected]=>
            NULL
            ["borderBottomSize":protected]=>
            NULL
            ["borderBottomColor":protected]=>
            NULL
            ["styleName":protected]=>
            NULL
            ["index":protected]=>
            NULL
            ["aliases":protected]=>
            array(0) {
            }
            ["isAuto":"PhpOffice\PhpWord\Style\AbstractStyle":private]=>
            bool(false)
          }
          ["headers":"PhpOffice\PhpWord\Element\Section":private]=>
          array(0) {
          }
          ["footers":"PhpOffice\PhpWord\Element\Section":private]=>
          array(0) {
          }
          ["elements":protected]=>
          array(1) {
            [0]=>
            *RECURSION*
          }
          ["phpWord":protected]=>
          *RECURSION*
          ["sectionId":protected]=>
          int(1)
          ["docPart":protected]=>
          string(7) "Section"
          ["docPartId":protected]=>
          int(1)
          ["elementIndex":protected]=>
          int(1)
          ["elementId":protected]=>
          NULL
          ["relationId":protected]=>
          NULL
          ["nestedLevel":"PhpOffice\PhpWord\Element\AbstractElement":private]=>
          int(0)
          ["parentContainer":"PhpOffice\PhpWord\Element\AbstractElement":private]=>
          NULL
          ["mediaRelation":protected]=>
          bool(false)
          ["collectionRelation":protected]=>
          bool(false)
        }
      }
      ["collections":"PhpOffice\PhpWord\PhpWord":private]=>
      array(5) {
        ["Bookmarks"]=>
        object(PhpOffice\PhpWord\Collection\Bookmarks)#1248 (1) {
          ["items":"PhpOffice\PhpWord\Collection\AbstractCollection":private]=>
          array(0) {
          }
        }
        ["Titles"]=>
        object(PhpOffice\PhpWord\Collection\Titles)#1249 (1) {
          ["items":"PhpOffice\PhpWord\Collection\AbstractCollection":private]=>
          array(0) {
          }
        }
        ["Footnotes"]=>
        object(PhpOffice\PhpWord\Collection\Footnotes)#1250 (1) {
          ["items":"PhpOffice\PhpWord\Collection\AbstractCollection":private]=>
          array(0) {
          }
        }
        ["Endnotes"]=>
        object(PhpOffice\PhpWord\Collection\Endnotes)#1251 (1) {
          ["items":"PhpOffice\PhpWord\Collection\AbstractCollection":private]=>
          array(0) {
          }
        }
        ["Charts"]=>
        object(PhpOffice\PhpWord\Collection\Charts)#1252 (1) {
          ["items":"PhpOffice\PhpWord\Collection\AbstractCollection":private]=>
          array(0) {
          }
        }
      }
      ["metadata":"PhpOffice\PhpWord\PhpWord":private]=>
      array(3) {
        ["DocInfo"]=>
        object(PhpOffice\PhpWord\Metadata\DocInfo)#1253 (12) {
          ["creator":"PhpOffice\PhpWord\Metadata\DocInfo":private]=>
          string(0) ""
          ["lastModifiedBy":"PhpOffice\PhpWord\Metadata\DocInfo":private]=>
          string(0) ""
          ["created":"PhpOffice\PhpWord\Metadata\DocInfo":private]=>
          int(1483515248)
          ["modified":"PhpOffice\PhpWord\Metadata\DocInfo":private]=>
          int(1483515248)
          ["title":"PhpOffice\PhpWord\Metadata\DocInfo":private]=>
          string(0) ""
          ["description":"PhpOffice\PhpWord\Metadata\DocInfo":private]=>
          string(0) ""
          ["subject":"PhpOffice\PhpWord\Metadata\DocInfo":private]=>
          string(0) ""
          ["keywords":"PhpOffice\PhpWord\Metadata\DocInfo":private]=>
          string(0) ""
          ["category":"PhpOffice\PhpWord\Metadata\DocInfo":private]=>
          string(0) ""
          ["company":"PhpOffice\PhpWord\Metadata\DocInfo":private]=>
          string(0) ""
          ["manager":"PhpOffice\PhpWord\Metadata\DocInfo":private]=>
          string(0) ""
          ["customProperties":"PhpOffice\PhpWord\Metadata\DocInfo":private]=>
          array(0) {
          }
        }
        ["Protection"]=>
        object(PhpOffice\PhpWord\Metadata\Protection)#1254 (1) {
          ["editing":"PhpOffice\PhpWord\Metadata\Protection":private]=>
          NULL
        }
        ["Compatibility"]=>
        object(PhpOffice\PhpWord\Metadata\Compatibility)#1255 (1) {
          ["ooxmlVersion":"PhpOffice\PhpWord\Metadata\Compatibility":private]=>
          int(12)
        }
      }
    }
    ["sectionId":protected]=>
    NULL
    ["docPart":protected]=>
    string(7) "Section"
    ["docPartId":protected]=>
    int(1)
    ["elementIndex":protected]=>
    int(1)
    ["elementId":protected]=>
    string(6) "5d531b"
    ["relationId":protected]=>
    NULL
    ["nestedLevel":"PhpOffice\PhpWord\Element\AbstractElement":private]=>
    int(0)
    ["parentContainer":"PhpOffice\PhpWord\Element\AbstractElement":private]=>
    string(7) "Section"
    ["mediaRelation":protected]=>
    bool(false)
    ["collectionRelation":protected]=>
    bool(false)
  }
}
4

3 回答 3

4

而不是检查每个类的文本,您可以使用

                    $sections = $phpWord->getSections();

                    foreach ($sections as $s) {
                        $els = $s->getElements();
                        /** @var ElementTest $e */
                        foreach ($els as $e) {
                            $class = get_class($e);
                            if (method_exists($class, 'getText')) {
                                $text .= $e->getText();
                            } else {
                                $text .= "\n";
                            }
                        }
                    }

于 2019-05-29T16:23:21.247 回答
4

尝试创建您的阅读器之前

$source = "word.doc";
// create your reader object
$phpWordReader = \PhpOffice\PhpWord\IOFactory::createReader('MsDoc');
// read source
if($phpWordReader->canRead($source)) {
$phpWord = $phpWordReader->load($source);
... // rest of your code
}

答案基于此示例API 文档

于 2017-01-05T10:18:57.043 回答
2

您可以使用 catdoc http://www.wagner.pp.ru/~vitus/software/catdoc/从 word 文档中提取 txt

它可以安装在 Ubuntu 上,使用

sudo apt-get install catdoc

一旦你有 catdoc 在你的系统上工作,你可以使用 shell_exec() 从 php 调用它

<?php

$text = shell_exec('/(fullpath)/catdoc /(fullpath)/word.doc');

print $text;

?>

请务必将 (fullpath) 替换为 catdoc 和您的 word doc 的实际路径。

编辑----加法

如果您可以将文件保存为.docx而不是.doc会更容易一些。您可以使用unzip而不是catdoc

只需更换:

$text = shell_exec('/(fullpath)/catdoc /(fullpath)/word.doc');

$text = shell_exec("/(fullpath)/unzip -p /(fullpath)/word.docx word/document.xml | sed -e 's/<[^>]\{1,\}>//g; s/[^[:print:]]\{1,\}//g'");

您可以对大多数其他命令行文档到文本转换器使用相同的技术。只需将 shell_exec() 中的命令替换为适用于您系统的命令即可。您可以检查如何从 .doc 和 .docx 文件中提取纯文本?(unix)用于其他 unix/linux 替代品

对于其他 PHP 替代方案,请查看如何从 word 文件 .doc、docx、.xlsx、.pptx php 中提取文本

于 2017-01-05T15:50:45.763 回答