我使用下面的一些代码片段从 .doc 文件中提取文本
HWPFDocument document = new HWPFDocument(new FileInputStream(inputFile));
Range range = document.getRange();
int len = range.numParagraphs();
StringBuilder builder = new StringBuilder();
for (int i = 0; i < len; i++) {
builder.append(range.getParagraph(i).text());
}
和
HWPFDocument document = new HWPFDocument(new FileInputStream(inputFile));
WordExtractor wordExtractor = new WordExtractor(document);
String[] paragraphs = wordExtractor.getParagraphText();
StringBuilder builder = new StringBuilder();
for (String p : paragraphs) {
builder.append(p);
}
但是,他们俩总是输出一些奇怪的字符。例如:PAGEREF_Toc351848910\h10HYPERLINK\l
_Toc351848911
CITATIONPla\l1033[HYPERLINK\l"Pla"13]
。所以,我想知道它们来自哪里以及从 .doc 文件中提取文本时如何删除它们
提前致谢