c# - c# - PdfDocument.GetTextWithFormatting() 不占用所有页面

Question

我正在尝试打开一个大 PDF 文件，但使用此代码

using BitMiracle.Docotic.Pdf;

PdfDocument pdf = new PdfDocument("document.pdf")
string document = pdf.GetTextWithFormatting();

该字符串document采用前 87 页（共 174 页）。为什么只需要文件的前半部分？

编辑：这是库的评估模式限制。有一些替代方案吗？

score 2 · Accepted Answer

您观察到的行为是由于评估模式限制。在试用模式下使用时，库会施加以下限制：

使用库生成的文档包含打印在每一页上的评估通知。
对于所有现有文档，图书馆仅读取一半页面。

要在不受评估模式限制的情况下评估库，您可以在我们的网站上获得免费的限时许可证。

score 0 · Accepted Answer

您可以尝试阅读每一页的文字：

StringBuilder sb = new StringBuilder();
var options = new PdfTextExtractionOptions
                {
                    WithFormatting = false,
                    SkipInvisibleText = true
                };
using (PdfDocument pdf = new PdfDocument("document.pdf"))
{
    int pageIndex = 1;
    foreach(var page in pdf.Pages)
    {
        Console.WriteLine("Page {0}", pageIndex++);
        sb.AppendLine(page.GetText(options));
    }
}
string allText = sb.ToString();

完成此操作后，您应该在控制台中看到 pdf 中每一页的一行。

我可能是 87 之后的页面没有文字。例如，它们可能是扫描页面的图像。

您可以通过尝试从第 87 页之后的 PDF 中选择、复制和粘贴文本来测试这一点。如果可以，那么它很可能是 BitMiracle DLL 中的错误。

c# - c# - PdfDocument.GetTextWithFormatting() 不占用所有页面

2 回答 2

Related

Reference