c# - 如何使用 iTextSharp 从便笺中获取文本？

Question

我正在尝试使用 iTextSharp 从 pdf 中提取所有文本。目前，我只能获取页面上的实际文本，而不是用户评论中包含的文本或 Adobe 所称的“便笺”。有没有办法做到这一点？到目前为止，这是我的代码，但我只得到空字符串：

    PdfReader pdfRead = new PdfReader(pdfFilePath);
    AcroFields form = pdfRead.AcroFields;            

    string txt = "";
    for (int page = 1; page <= pdfRead.NumberOfPages; ++page)
    {
           PdfDictionary pagedic = pdfRead.GetPageN(page);
           PdfArray annotarray = (PdfArray)PdfReader.GetPdfObject(pagedic.Get(PdfName.ANNOTS));

           if (annotarray == null || annotarray.Size == 0)
                 continue;

           foreach (PdfObject A in annotarray.ArrayList)
           {
                 PdfDictionary AnnotationDictionary = (PdfDictionary)PdfReader.GetPdfObject(A);

                 txt += AnnotationDictionary.GetAsString(PdfName.NOTE);
                 txt += "\n";
           }
     }

score 3 · Accepted Answer

我不了解 C#，但您可以在此处找到对应的部分（此示例中使用的文件是pages.pdf）。这个例子的输出是：

Annotation 1
/Contents: This is a post-it annotation
/Subtype: /Text
/Rect: [36, 768, 56, 788]
/T: Example
Annotation 2
/C: [0, 0, 1]
/Border: [0, 0, 0]
/A: Dictionary
/Subtype: /Link
/Rect: [66.67, 785.52, 98, 796.62]

第一个注释是便笺注释（用 ISO-32000-1 的话来说，文本注释），您要查找的键不是PdfName.NOTE，而是PdfName.T标题和PdfName.CONTENTS内容。

score 2 · Accepted Answer

if (AnnotationDictionary.Get(PdfName.SUBTYPE).Equals(PdfName.TEXT))
{
     string Title = AnnotationDictionary.GetAsString(PdfName.T).ToString();
     string Content = AnnotationDictionary.GetAsString(PdfName.CONTENTS).ToString();
}

c# - 如何使用 iTextSharp 从便笺中获取文本？

2 回答 2

Related

Reference