java - Java 使用 POI 读取 .doc 文件

Question

嗨，我正在尝试从 doc 和 docx 文件中读取文本，对于 doc 文件，我正在这样做

package test;
import java.io.File;
import java.io.FileInputStream;
import org.apache.poi.hwpf.HWPFDocument;
import org.apache.poi.hwpf.extractor.WordExtractor;

public class ReadFile {
public static void main(String[] args) {
        File file = null;
        WordExtractor extractor = null;
        try {

            file = new File("C:\\Users\\rijo\\Downloads\\r.doc");
            FileInputStream fis = new FileInputStream(file.getAbsolutePath());
            HWPFDocument document = new HWPFDocument(fis);
            extractor = new WordExtractor(document);
            String fileData = extractor.getText();
            System.out.println(fileData);
        } catch (Exception exep) {
        }
    }
}

但这给了我一个org/apache/poi/OldFileFormatException例外。

知道如何解决这个问题吗？

我还需要阅读 Docx 和 PDF 文件吗？有什么好方法可以读取所有类型的文件吗？

score 6 · Accepted Answer

使用以下 jars（如果版本号在这里起作用）：

dom4j-1.7-20060614
poi-3.9-20121203
poi-ooxml-3.9-20121203
poi-ooxml-schemas-3.9-20121203
poi-scratchpad-3.9-20121203
xmlbeans-2.4.0

我打了这个：

import java.io.File;
import java.io.FileInputStream;
import java.io.FileNotFoundException;
import java.io.IOException;

import org.apache.poi.xwpf.extractor.XWPFWordExtractor;
import org.apache.poi.xwpf.usermodel.XWPFDocument;
import org.apache.poi.hwpf.HWPFDocument;
import org.apache.poi.hwpf.extractor.WordExtractor;

public class SO {
public static void main(String[] args){

            //Alternate between the two to check what works.
    //String FilePath = "D:\\Users\\username\\Desktop\\Doc1.docx";
    String FilePath = "D:\\Users\\username\\Desktop\\Bob.doc";
    FileInputStream fis;

    if(FilePath.substring(FilePath.length() -1).equals("x")){ //is a docx
    try {
        fis = new FileInputStream(new File(FilePath));
        XWPFDocument doc = new XWPFDocument(fis);
        XWPFWordExtractor extract = new XWPFWordExtractor(doc);
        System.out.println(extract.getText());
    } catch (IOException e) {

        e.printStackTrace();
    }
    } else { //is not a docx
        try {
            fis = new FileInputStream(new File(FilePath));
            HWPFDocument doc = new HWPFDocument(fis);
            WordExtractor extractor = new WordExtractor(doc);
            System.out.println(extractor.getText());
        } catch (IOException e) {
            e.printStackTrace();
        }
    }
  }
}

这使我可以分别从 .docx 和 .doc 中读取文本。如果这在您的 PC 上不起作用，则您使用的外部 jar 可能存在问题。

不过试一试:)祝你好运！

score 1 · Accepted Answer

如果您查看OldFileFormatException的 javadocs ，您会看到原因

如果给定的文件比当前支持的文件旧，POI 抛出的所有异常的基类。

这意味着HWPFDocumentr.doc不支持您使用的。可能是它支持最新的格式（现在也有很长时间了。不确定ApachePOI是否支持格式）。docxdocHWPFDocument

score 0 · Accepted Answer

我不知道您为什么使用 WordExtractor 只是为了从 .doc 中获取文本。对我来说，使用一种方法就足够了：

import org.apache.poi.hwpf.HWPFDocument;
...
File fin = new File(yourFilePath);
FileInputStream fis = new FileInputStream(fin);
HWPFDocument doc = new HWPFDocument(fis);
String text = doc.getDocumentText();
System.out.println(text);
...

要使用 .pdf，请使用另一个 Apache：pdfbox。

java - Java 使用 POI 读取 .doc 文件

3 回答 3

Related

Reference