我目前正在尝试从 PDF 文件中自动提取重要的关键字。我能够从 PDF 文档中获取文本信息。但现在我需要知道,这些关键字有哪些字体大小和字体系列。



public static void main(String[] args) throws IOException {
    String src = "SEM_081145.pdf";

    PdfReader reader = new PdfReader(src);

    SemTextExtractionStrategy semTextExtractionStrategy = new SemTextExtractionStrategy();

    PrintWriter out = new PrintWriter(new FileOutputStream(src + ".txt"));
    Rectangle rect = new Rectangle(70, 80, 490, 580);
    RenderFilter filter = new RegionTextRenderFilter(rect);

    for (int i = 1; i <= reader.getNumberOfPages(); i++) {
        // strategy = new FilteredTextRenderListener(new LocationTextExtractionStrategy(), filter);
        out.println(PdfTextExtractor.getTextFromPage(reader, i, semTextExtractionStrategy));

我已经实现了 TextExtraction Strategy SemTextExtractionStrategy,如下所示:

public class SemTextExtractionStrategy implements TextExtractionStrategy {

private String text;

public void beginTextBlock() {

public void renderText(TextRenderInfo renderInfo) {
    text = renderInfo.getText();



public void endTextBlock() {

public void renderImage(ImageRenderInfo renderInfo) {

public String getResultantText() {
    return text;

我可以获取 FontType,但没有获取字体大小的方法。还有其他方法或如何获取当前文本段的字体大小?

Or are there any other libraries which can fetch out the font size from TextSegments? I already had a look into PDFBox, and PDFTextStream. The PDF Shareware Library from Aspose would perfectly do the job. But it's very expensive and I need to use an open source project.


4 回答 4


Thanks to Alexis I could convert his C# solution into Java code:

text = renderInfo.getText();

Vector curBaseline = renderInfo.getBaseline().getStartPoint();
Vector topRight = renderInfo.getAscentLine().getEndPoint();

Rectangle rect = new Rectangle(curBaseline.get(0), curBaseline.get(1), topRight.get(0), topRight.get(1));
float curFontSize = rect.getHeight();
于 2012-06-06T15:51:25.183 回答

I had some trouble using Alexis' and Prine's solution, since it doesn't deal with rotated text correctly. So this is what I do (sorry, in Scala):

val x0 = info.getAscentLine.getEndPoint
val x1 = info.getBaseline.getStartPoint
val x2 = info.getBaseline.getEndPoint
val length1 = (x2.subtract(x1)).cross((x1.subtract(x0))).lengthSquared
val length2 = x2.subtract(x1).lengthSquared
(length1, length2) match {
  case (0, 0) => 0
  case _ => length1 / length2
于 2012-06-15T12:41:13.713 回答

You can adapt the code provided in this answer, in particular this code snippet:

Vector curBaseline = renderInfo.GetBaseline().GetStartPoint();
Vector topRight = renderInfo.GetAscentLine().GetEndPoint();
iTextSharp.text.Rectangle rect = new iTextSharp.text.Rectangle(curBaseline[Vector.I1], curBaseline[Vector.I2], topRight[Vector.I1], topRight[Vector.I2]);
Single curFontSize = rect.Height;

This answer is in C#, but the API is so similar that the conversion to Java should be straightforward.

于 2012-06-05T11:26:34.073 回答

If you want the exact fontsize, use the following code in your renderText:

float fontsize = renderInfo.getAscentLine().getStartPoint().get(1)
     - renderInfo.getDescentLine().getStartPoint().get(1);

Modify this as indicated in the other answers for rorated text.

于 2015-11-03T23:21:15.233 回答