是否有机会在办公文件(doc、docx、xls、xlsx、ppt、pptx、...)中列出所有嵌入的对象(doc、...、txt)?
我正在使用 Apache POI (Java) Library 从办公文件中提取文本。我不需要从嵌入对象中提取所有文本,包含所有嵌入文档的文件名的日志文件会很好(类似于:)string objectFileNames = getEmbeddedFileNames(fileInputStream)
。
示例:我有一个 Word 文档“test.doc”,其中包含另一个名为“excel.xls”的文件。我想将 excel.xls 的文件名(在这种情况下)写入日志文件。
我使用 apache 主页 ( https://poi.apache.org/text-extraction.html )中的一些示例代码进行了尝试。但我的代码总是返回相同的(“页脚文本:页眉文本”)。
我尝试的是:
private static void test(String inputfile, String outputfile) throws Exception {
String[] extractedText = new String[100];
int emb = 0;//used for counter of embedded objects
InputStream fis = new FileInputStream(inputfile);
PrintWriter out = new PrintWriter(outputfile);//Text in File (txt) schreiben
System.out.println("Emmbedded Search started. Inputfile: " + inputfile);
//Based on Apache sample Code
emb = 0;//Reset Counter
POIFSFileSystem emb_fileSystem = new POIFSFileSystem(fis);
// Firstly, get an extractor for the Workbook
POIOLE2TextExtractor oleTextExtractor =
ExtractorFactory.createExtractor(emb_fileSystem);
// Then a List of extractors for any embedded Excel, Word, PowerPoint
// or Visio objects embedded into it.
POITextExtractor[] embeddedExtractors =
ExtractorFactory.getEmbededDocsTextExtractors(oleTextExtractor);
for (POITextExtractor textExtractor : embeddedExtractors) {
// If the embedded object was an Excel spreadsheet.
if (textExtractor instanceof ExcelExtractor) {
ExcelExtractor excelExtractor = (ExcelExtractor) textExtractor;
extractedText[emb] = (excelExtractor.getText());
}
// A Word Document
else if (textExtractor instanceof WordExtractor) {
WordExtractor wordExtractor = (WordExtractor) textExtractor;
String[] paragraphText = wordExtractor.getParagraphText();
for (String paragraph : paragraphText) {
extractedText[emb] = paragraph;
}
// Display the document's header and footer text
System.out.println("Footer text: " + wordExtractor.getFooterText());
System.out.println("Header text: " + wordExtractor.getHeaderText());
}
// PowerPoint Presentation.
else if (textExtractor instanceof PowerPointExtractor) {
PowerPointExtractor powerPointExtractor =
(PowerPointExtractor) textExtractor;
extractedText[emb] = powerPointExtractor.getText();
emb++;
extractedText[emb] = powerPointExtractor.getNotes();
}
// Visio Drawing
else if (textExtractor instanceof VisioTextExtractor) {
VisioTextExtractor visioTextExtractor =
(VisioTextExtractor) textExtractor;
extractedText[emb] = visioTextExtractor.getText();
}
emb++;//Count Embedded Objects
}//Close For Each Loop POIText...
for(int x = 0; x <= extractedText.length; x++){//Write Results to TXT
if (extractedText[x] != null){
System.out.println(extractedText[x]);
out.println(extractedText[x]);
}
else {
break;
}
}
out.close();
}
Inputfile 是 xls,其中包含一个 doc 文件作为 object,而 outputfile 是 txt。
谢谢如果有人可以帮助我。