java - 如何使用 Apache HWPF 从 DOC 文件中提取文本和图像

Question

我下载了Apache HWPF。我想用它来读取 doc 文件并将其文本写入纯文本文件。我不太了解HWPF。

我非常简单的程序在这里：

我现在有3个问题：

一些包有错误（他们找不到 apache hdf）。我该如何修复它们？
如何使用 HWDF 的方法来查找和提取图像？
我的程序的某些部分不完整且不正确。所以请帮我完成它。

我必须在 2 天内完成这个程序。

我再次重复请帮助我完成这个。

非常感谢你们的帮助！！！

这是我的基本代码：

public class test {
  public void m1 (){
    String filesname = "Hello.doc";
    POIFSFileSystem fs = null;
    fs = new POIFSFileSystem(new FileInputStream(filesname ); 
    HWPFDocument doc = new HWPFDocument(fs);
    WordExtractor we = new WordExtractor(doc);
    String str = we.getText() ;
    String[] paragraphs = we.getParagraphText();
    Picture pic = new Picture(. . .) ;
    pic.writeImageContent( . . . ) ;
    PicturesTable picTable = new PicturesTable( . . . ) ;
    if ( picTable.hasPicture( . . . ) ){
      picTable.extractPicture(..., ...);
      picTable.getAllPictures() ;
    }
}

score 1 · Accepted Answer

Apache Tika将为您执行此操作。它处理与 POI 的对话以执行 HWPF 工作，并为您提供文件内容的 XHTML 或纯文本。如果您注册了一个递归解析器，那么您还将获得所有嵌入的图像。

score 1 · Accepted Answer

    //you can use the org.apache.poi.hwpf.extractor.WordExtractor to get the text
    String fileName = "example.doc";
    HWPFDocument wordDoc = new HWPFDocument(new FileInputStream(fileName));
    WordExtractor extractor = new WordExtractor(wordDoc);
    String[] text = extractor.getParagraphText();
    int lineCounter = text.length;
    String articleStr = ""; // This string object use to store text from the word document.
    for(int index = 0;index < lineCounter;++ index){
        String paragraphStr = text[index].replaceAll("\r\n","").replaceAll("\n","").trim();
        int paragraphLength = paragraphStr.length();
        if(paragraphLength != 0){
            articleStr.concat(paragraphStr);
        }
    }
    //you can use the org.apache.poi.hwpf.usermodel.Picture to get the image
    List<Picture> picturesList = wordDoc.getPicturesTable().getAllPictures();
    for(int i = 0;i < picturesList.size();++i){
        BufferedImage image = null;
        Picture pic = picturesList.get(i);
        image = ImageIO.read(new ByteArrayInputStream(pic.getContent()));
        if(image != null){
            System.out.println("Image["+i+"]"+" ImageWidth:"+image.getWidth()+" ImageHeight:"+image.getHeight()+" Suggest Image Format:"+pic.suggestFileExtension());
        }
    }

score 0 · Accepted Answer

如果您只想这样做，并且不关心编码，则可以使用Antiword。

$ antiword file.doc > out.txt

score 0 · Accepted Answer

事后很久我才知道这一点，但我在谷歌代码上找到了 TextMining，更准确且非常易于使用。然而，它几乎是被遗弃的代码。

java - 如何使用 Apache HWPF 从 DOC 文件中提取文本和图像

4 回答 4

Related

Reference