0

我想逐个字符地解析整个 PDF 字符,并能够在该 PDF 文档上获取该字符的 ASCII 值、字体和矩形,以后可以将其保存为位图。我尝试使用 PdfTextExtractor.GetTextFromPage 但这会将 PDF 中的整个文本作为字符串提供。

4

1 回答 1

4

与 iTextSharp 捆绑的文本提取策略(特别是LocationTextExtractionStrategy默认情况下由PdfTextExtractor.GetTextFromPage无策略参数的重载使用)仅允许直接访问收集的纯文本,而不是位置。

克里斯·哈斯MyLocationTextExtractionStrategy

@Chris Haas 在他的旧答案中介绍了以下扩展LocationTextExtractionStrategy

public class MyLocationTextExtractionStrategy : LocationTextExtractionStrategy {
    //Hold each coordinate
    public List<RectAndText> myPoints = new List<RectAndText>();

    //Automatically called for each chunk of text in the PDF
    public override void RenderText(TextRenderInfo renderInfo) {
        base.RenderText(renderInfo);

        //Get the bounding box for the chunk of text
        var bottomLeft = renderInfo.GetDescentLine().GetStartPoint();
        var topRight = renderInfo.GetAscentLine().GetEndPoint();

        //Create a rectangle from it
        var rect = new iTextSharp.text.Rectangle(
                                                bottomLeft[Vector.I1],
                                                bottomLeft[Vector.I2],
                                                topRight[Vector.I1],
                                                topRight[Vector.I2]
                                                );

        //Add this to our main collection
        this.myPoints.Add(new RectAndText(rect, renderInfo.GetText()));
    }
}

它利用了这个助手类

//Helper class that stores our rectangle and text
public class RectAndText {
    public iTextSharp.text.Rectangle Rect;
    public String Text;
    public RectAndText(iTextSharp.text.Rectangle rect, String text) {
        this.Rect = rect;
        this.Text = text;
    }
}

此策略使文本块及其封闭矩形在公共成员List<RectAndText> myPoints中可用,您可以像这样访问:

//Create an instance of our strategy
var t = new MyLocationTextExtractionStrategy();

//Parse page 1 of the document above
using (var r = new PdfReader(testFile)) {
    var ex = PdfTextExtractor.GetTextFromPage(r, 1, t);
}

//Loop through each chunk found
foreach (var p in t.myPoints) {
    Console.WriteLine(string.Format("Found text {0} at {1}x{2}", p.Text, p.Rect.Left, p.Rect.Bottom));
}

对于您逐个字符解析整个 PDF 字符并能够获取该字符的 ASCII 值、字体和矩形的任务,这里只有两个细节是错误的:

  • 像这样返回的文本块可能包含多个字符
  • 未提供字体信息。

因此,我们必须稍微调整一下:

一个新的CharLocationTextExtractionStrategy

除了MyLocationTextExtractionStrategy类之外CharLocationTextExtractionStrategy,它还通过字形拆分输入并提供字体名称:

public class CharLocationTextExtractionStrategy : LocationTextExtractionStrategy
{
    //Hold each coordinate
    public List<RectAndTextAndFont> myPoints = new List<RectAndTextAndFont>();

    //Automatically called for each chunk of text in the PDF
    public override void RenderText(TextRenderInfo wholeRenderInfo)
    {
        base.RenderText(wholeRenderInfo);

        foreach (TextRenderInfo renderInfo in wholeRenderInfo.GetCharacterRenderInfos())
        {
            //Get the bounding box for the chunk of text
            var bottomLeft = renderInfo.GetDescentLine().GetStartPoint();
            var topRight = renderInfo.GetAscentLine().GetEndPoint();

            //Create a rectangle from it
            var rect = new iTextSharp.text.Rectangle(
                                                    bottomLeft[Vector.I1],
                                                    bottomLeft[Vector.I2],
                                                    topRight[Vector.I1],
                                                    topRight[Vector.I2]
                                                    );

            //Add this to our main collection
            this.myPoints.Add(new RectAndTextAndFont(rect, renderInfo.GetText(), renderInfo.GetFont().PostscriptFontName));
        }
    }
}

//Helper class that stores our rectangle, text, and font
public class RectAndTextAndFont
{
    public iTextSharp.text.Rectangle Rect;
    public String Text;
    public String Font;
    public RectAndTextAndFont(iTextSharp.text.Rectangle rect, String text, String font)
    {
        this.Rect = rect;
        this.Text = text;
        this.Font = font;
    }
}

像这样使用这种策略

CharLocationTextExtractionStrategy strategy = new CharLocationTextExtractionStrategy();

using (var pdfReader = new PdfReader(testFile))
{
    PdfTextExtractor.GetTextFromPage(pdfReader, 1, strategy);
}

foreach (var p in strategy.myPoints)
{
    Console.WriteLine(string.Format("<{0}> in {3} at {1}x{2}", p.Text, p.Rect.Left, p.Rect.Bottom, p.Font));
}

您按字符获取信息,包括字体。

于 2016-01-21T14:09:01.910 回答