我使用 itext 将 pdf 转换为文本文件,它实际上效果很好,但对于某些单词,它会执行以下操作:例如,在 pdf 中有类似“提出主要思想”的短语,但 itext 创建了一个类似“presentthemainideas”的输出。无论如何要纠正这种行为?
String pdf="/home/can/Downloads/NLP/textSummarization/A New Approach for Multi-Document Update Summarization.pdf";
String txt="/home/can/myWorkSpace/PDFConverterProject/outputs/bb.txt";
StringBuffer text=new StringBuffer() ;
String resultText="";
PdfReader reader;
try {
reader = new PdfReader(pdf);
PdfReaderContentParser parser = new PdfReaderContentParser(reader);
PrintWriter out = new PrintWriter(new FileOutputStream(txt));
TextExtractionStrategy strategy;
for (int i = 1; i <= reader.getNumberOfPages(); i++) {
strategy = parser.processContent(i, new SimpleTextExtractionStrategy());
text.append(strategy.getResultantText());
}
resultText=text.toString();
resultText = resultText.replaceAll("-\n", "");
out.println("-->"+resultText);
StringTokenizer stringTokenizer=new StringTokenizer(resultText, "\n");
PrintWriter lineWriter = new PrintWriter(new FileOutputStream("/home/can/myWorkSpace/PDFConverterProject/outputs/line.txt"));
while (stringTokenizer.hasMoreTokens()){
String curToken = stringTokenizer.nextToken();
lineWriter.println("line-->"+curToken);
}
lineWriter.flush();
lineWriter.close();
out.flush();
out.close();
} catch (IOException e) {
// TODO Auto-generated catch block
e.printStackTrace();
}
}