c# - 如何从pdf文件C#中读取带有AnchorText的超链接

Question

我已经从 PDF 文件中获取了链接值，例如http://google.com ，但我需要获取锚文本值click here。如何取锚链接值文本？

我使用以下 URL 获取了 PDF 文件的 URL 值：例如，从 pdf 文件中读取超链接。

Anchor a = new Anchor("Test Anchor");
a.Reference = "http://www.google.com";
myParagraph.Add(a);

在这里我得到了，http://www.google.com但我需要得到锚值，即Test Anchor

需要你的建议。

score 5 · Accepted Answer

从 PDF 文件中，您需要确定放置链接的区域，然后使用 iTextSharp 阅读链接下方的文本。

这样您就可以提取链接下方的文本。这种方法的局限性在于，如果链接区域比文本宽，则提取将读取该区域下的全文。

private void GetAllHyperlinksFromPDFDocument(string pdfFilePath)
{
    string linkTextBuilder = "";
    string linkReferenceBuilder = "";

    PdfDictionary PageDictionary = default(PdfDictionary);
    PdfArray Annots = default(PdfArray);
    PdfReader R = new PdfReader(pdfFilePath);

    List<BinaryHyperlink> ret = new List<BinaryHyperlink>();

    //Loop through each page
    for (int i = 1; i <= R.NumberOfPages; i++)
    {
        //Get the current page
        PageDictionary = R.GetPageN(i);

        //Get all of the annotations for the current page
        Annots = PageDictionary.GetAsArray(PdfName.ANNOTS);

        //Make sure we have something
        if ((Annots == null) || (Annots.Length == 0))
            continue;

        //Loop through each annotation

        foreach (PdfObject A in Annots.ArrayList)
        {
            //Convert the itext-specific object as a generic PDF object
            PdfDictionary AnnotationDictionary = (PdfDictionary)PdfReader.GetPdfObject(A);

            //Make sure this annotation has a link
            if (!AnnotationDictionary.Get(PdfName.SUBTYPE).Equals(PdfName.LINK))
                continue;

            //Make sure this annotation has an ACTION
            if (AnnotationDictionary.Get(PdfName.A) == null)
                continue;

            //Get the ACTION for the current annotation
            PdfDictionary AnnotationAction = (PdfDictionary)AnnotationDictionary.GetAsDict(PdfName.A);
            if (AnnotationAction.Get(PdfName.S).Equals(PdfName.URI))
            {
                //Get action link URL : linkReferenceBuilder
                PdfString Link = AnnotationAction.GetAsString(PdfName.URI);
                if (Link != null)
                    linkReferenceBuilder = Link.ToString();

                //Get action link text : linkTextBuilder
                var LinkLocation = AnnotationDictionary.GetAsArray(PdfName.RECT);
                List<string> linestringlist = new List<string>();
                iTextSharp.text.Rectangle rect = new iTextSharp.text.Rectangle(((PdfNumber)LinkLocation[0]).FloatValue, ((PdfNumber)LinkLocation[1]).FloatValue, ((PdfNumber)LinkLocation[2]).FloatValue, ((PdfNumber)LinkLocation[3]).FloatValue);
                RenderFilter[] renderFilter = new RenderFilter[1];
                renderFilter[0] = new RegionTextRenderFilter(rect);
                ITextExtractionStrategy textExtractionStrategy = new FilteredTextRenderListener(new LocationTextExtractionStrategy(), renderFilter);
                linkTextBuilder = PdfTextExtractor.GetTextFromPage(R, i, textExtractionStrategy).Trim();
            }
        }
    }
}

score 1 · Accepted Answer

不幸的是，我认为您无法做到这一点，至少在没有大量猜测的情况下不会。在 HTML 中这很容易，因为超链接和它的文本一起存储为：

<a href="http://www.example.com/">Click here</a>

但是，在 PDF 中，这两个实体不以任何形式的关系存储。我们认为的 PDF 中的“超链接”在技术上是恰好位于文本顶部的 PDF 注释。您可以通过在 Adobe Acrobat Pro 等编辑程序中打开 PDF 来查看。您可以更改文本，但“可点击”区域不会更改。您还可以移动和调整“可点击”区域的大小，并将其放在文档中的任何位置。

创建 PDF 时，iText/iTextSharp 将其抽象出来，因此您不必考虑这一点。您可以创建带有可点击文本的“超链接”，但是当它生成 PDF 时，它最终会将文本创建为普通文本，计算矩形坐标，然后在该矩形上放置注释。

我确实说过您可以尝试猜测这一点，它可能适合您，也可能不适合您。为此，您需要获取用于注释的矩形，然后找到也在这些坐标处的文本。但是，由于填充问题，它不会是完全匹配的。如果您绝对必须在超链接下获取文本，那么这是我所知道的唯一方法。祝你好运！

c# - 如何从pdf文件C#中读取带有AnchorText的超链接

2 回答 2

Related

Reference