我正在使用 pdf 框解析器使用 java 从 pdf 文件中读取数据。它将读取 pdf 文件中的所有内容。
下面是我从 pdf 文件中读取数据并将其存储到文本文件中的示例代码。示例代码:
public class PDFTextParser {
PDFParser parser;
String parsedText;
PDFTextStripper pdfStripper;
PDDocument pdDoc;
PdfReader read;
COSDocument cosDoc;
PDDocumentInformation pdDocInfo;
PdfTextExtractor extract;
// PDFTextParser Constructor
public PDFTextParser() {
}
// Extract text from PDF Document
String pdftoText(String fileName) {
System.out.println("Parsing text from PDF file " + fileName + "....");
File f = new File(fileName);
if (!f.isFile()) {
System.out.println("File " + fileName + " does not exist.");
return null;
}
try {
parser = new PDFParser(new FileInputStream(f));
} catch (Exception e) {
System.out.println("Unable to open PDF Parser.");
return null;
}
try {
parser.parse();
cosDoc = parser.getDocument();
pdfStripper = new PDFTextStripper();
pdDoc = new PDDocument(cosDoc);
parsedText = pdfStripper.getText(pdDoc);
} catch (Exception e) {
System.out.println("An exception occured in parsing the PDF Document.");
e.printStackTrace();
try {
if (cosDoc != null) cosDoc.close();
if (pdDoc != null) pdDoc.close();
} catch (Exception e1) {
e.printStackTrace();
}
return null;
}
System.out.println("Done.");
return parsedText;
}
// Write the parsed text from PDF to a file
void writeTexttoFile(String pdfText, String fileName) {
System.out.println("\nWriting PDF text to output text file " + fileName + "....");
try {
PrintWriter pw = new PrintWriter(fileName);
pw.print(pdfText);
pw.close();
} catch (Exception e) {
System.out.println("An exception occured in writing the pdf text to file.");
e.printStackTrace();
}
System.out.println("Done.");
}
//Extracts text from a PDF Document and writes it to a text file
public static void test() {
String args[]={"C://Sample_Voice.pdf","C://CNP/Sample.txt"};
if (args.length != 2) {
System.out.println("Usage: java PDFTextParser <InputPDFFilename> <OutputTextFile>");
System.exit(1);
}
PDFTextParser pdfTextParserObj = new PDFTextParser();
String pdfToText = pdfTextParserObj.pdftoText(args[0]).replaceAll("®", "");
if (pdfToText == null) {
System.out.println("PDF to Text Conversion failed.");
}
else {
System.out.println("\nThe text parsed from the PDF Document....\n" + pdfToText);
pdfTextParserObj.writeTexttoFile(pdfToText, args[1]);
}
}
public static void main(String args[]) throws IOException
{
test();
}
}
我的要求是只获取原始文本而不获取
1)header
2)footer
3)hiperlinks.
怎么做。请给我建议。
谢谢