c# - 使用 iTextSharp c# 从 PDF 中逐行提取文本

Question

我需要对从 PDF 文档中提取的数据进行一些分析。

使用iTextSharp，我使用该PdfTextExtractor.GetTextFromPage方法从 PDF 文档中提取内容，它以一行长的形式返回给我。

有没有办法逐行获取文本，以便我可以将它们存储在数组中？这样我就可以逐行分析数据，这将更加灵活。

下面是我使用的代码：

       string urlFileName1 = "pdf_link";
        PdfReader reader = new PdfReader(urlFileName1);
        string text = string.Empty;
        for (int page = 1; page <= reader.NumberOfPages; page++)
        {
            text += PdfTextExtractor.GetTextFromPage(reader, page);
        }
        reader.Close();
        candidate3.Text = text.ToString();

score 11 · Accepted Answer

    public void ExtractTextFromPdf(string path)
    {
        using (PdfReader reader = new PdfReader(path))
        {
            StringBuilder text = new StringBuilder();
            ITextExtractionStrategy Strategy = new iTextSharp.text.pdf.parser.LocationTextExtractionStrategy();

            for (int i = 1; i <= reader.NumberOfPages; i++)
            {
                string page = "";

                page = PdfTextExtractor.GetTextFromPage(reader, i,Strategy);
                string[] lines = page.Split('\n');
                foreach (string line in lines)
                {
                    MessageBox.Show(line);
                }
            }
        }
    }

score 3 · Accepted Answer

这里的所有其他代码示例都对我不起作用，可能是由于对 itext7 API 的更改。

这个最小的例子在这里工作正常：

var pdfReader = new iText.Kernel.Pdf.PdfReader(fileName);
var pdfDocument = new iText.Kernel.Pdf.PdfDocument(pdfReader);
var contents = iText.Kernel.Pdf.Canvas.Parser.PdfTextExtractor.GetTextFromPage(pdfDocument.GetFirstPage());

score 3 · Accepted Answer

我知道这是在一个较旧的帖子上发布的，但我花了很多时间试图弄清楚这一点，所以我将与未来尝试谷歌搜索的人分享这个：

using System;
using System.Text;
using iTextSharp.text.pdf;
using iTextSharp.text.pdf.parser;

namespace PDFApp2
{
class Program
{
    static void Main(string[] args)
    {

        string filePath = @"Your said path\the file name.pdf";
        string outPath = @"the output said path\the text file name.txt";
        int pagesToScan = 2;

        string strText = string.Empty;
        try
        {
            PdfReader reader = new PdfReader(filePath);

            for (int page = 1; page <= pagesToScan; page ++) //(int page = 1; page <= reader.NumberOfPages; page++) <- for scanning all the pages in A PDF
            {
                ITextExtractionStrategy its = new iTextSharp.text.pdf.parser.LocationTextExtractionStrategy();
                strText = PdfTextExtractor.GetTextFromPage(reader, page, its);

                strText = Encoding.UTF8.GetString(ASCIIEncoding.Convert(Encoding.Default, Encoding.UTF8, Encoding.Default.GetBytes(strText)));
                //creating the string array and storing the PDF line by line
                string[] lines = strText.Split('\n');
                foreach (string line in lines)
                {
                    //Creating and appending to a text file
                    using (System.IO.StreamWriter file = new System.IO.StreamWriter(outPath, true))
                    {
                        file.WriteLine(line);
                    }
                }
            }

            reader.Close();
        }
        catch (Exception ex)
        {
            Console.Write(ex);
        }
    }
}
}

我让程序从一个设定的路径读取 PDF，然后输出到一个文本文件，但你可以将它操作到任何东西。这是建立在 Snziv Gupta 的回应之上的。

score 1 · Accepted Answer

LocationTextExtractionStrategy 将自动在输出文本中插入 '\n'。但是，有时它会在不应该插入的地方插入 '\n'。在这种情况下，您需要构建自定义 TextExtractionStrategy 或 RenderListener。基本上检测换行符的代码是方法

public virtual bool SameLine(ITextChunkLocation other) {
            return OrientationMagnitude == other.OrientationMagnitude &&
                   DistPerpendicular == other.DistPerpendicular;
        }

在某些情况下，如果 DistPerpendicular 和 other.DistPerpendicular 之间只有很小的差异，则不应插入 '\n'，因此您需要将其更改为 Math.Abs(DistPerpendicular - other.DistPerpendicular) < 10

或者，您可以将这段代码放在自定义 TextExtractionStrategy/RenderListener 类的 RenderText 方法中

score 0 · Accepted Answer

使用 LocationTextExtractionStrategy 代替 SimpleTextExtractionStrategy。LocationTextExtractionStrategy 提取的文本在行尾包含换行符。

ITextExtractionStrategy Strategy = new FilteredTextRenderListener(new LocationTextExtractionStrategy(), renderFilter);
string pdftext = PdfTextExtractor.GetTextFromPage(reader,pageno, Strategy);
string[] words = pdftext.Split('\n');
return words;

score -2 · Accepted Answer

-2

尝试

 String page = PdfTextExtractor.getTextFromPage(reader, 2);
 String s1[]=page.split("\n");

于 2013-05-09T12:52:14.127 回答

c# - 使用 iTextSharp c# 从 PDF 中逐行提取文本

6 回答 6

Related

Reference