text - Extract Arabic Text using iTextsharp get number only?

Question

I try To extract Arabic Text from PDF file but it extract only number and the result like this :

: 7234569 1439/08/07 : : 1 2375173941 14 08 6 39266 1050672243 2280 30 400 24 415 24 15 720 30 402 30 499 14 07 1 610117038085 0 1069508677 0 :

My code :

public static string GetTextFromAllPages(string pdfPath) {
    PdfReader reader = new PdfReader(pdfPath);
    string result = null ;
    //for (int i = 1; i <= reader.NumberOfPages; i++)
    result = PdfTextExtractor.GetTextFromPage(reader, 1, new LocationTextExtractionStrategy()); return result;
}

Any help Please?

score 0 · Accepted Answer

PDF 中阿拉伯字形的嵌入字体包含此ToUnicode CMap：

/CIDInit /ProcSet findresource begin
12 dict begin
begincmap
/CIDSystemInfo << /Registry (Adobe) /Ordering (UCS) /Supplement 0 >> def
/CMapName /Adobe-Identity-UCS def
/CMapType 2 def
1 begincodespacerange
<0000> <FFFF>
endcodespacerange
endcmap
CMapName currentdict /CMap defineresource pop
end
end

根据 ISO 32000-1，第 9.10.3 节ToUnicode CMaps：

它应使用beginbfchar、endbfchar、beginbfrange和endbfrange运算符来定义从字符代码到以 UTF-16BE 编码表示的 Unicode 字符序列的映射。

不幸的是，您的 CMap 根本不使用这些运算符，因此没有定义任何到 Unicode 的映射。

此外，字体具有Identity-H的编码，其后代 CIDFont 具有 ROS Adobe-Identity-0，这意味着字符代码、CID 和 GID 值对于一个字符是相等的，但并不意味着它们与 Unicode 的任何映射。

因此，字体缺少根据 ISO 32000-1 第 9.10.2 节将字符代码映射到 Unicode 值所需的文本提取信息。

（在这种情况下，文本提取器只能猜测，而这种猜测通常只适用于提取器优化的特殊类型的文档。您可能希望尝试增强 iText 以便能够在您的情况下正确猜测，但这需要您可以详细研究 PDF 规范、iText 文本提取代码和您的示例文件。）

顺便说一句，文本提取是否可行的一个很好的第一个测试是在 Adobe Reader 中打开 PDF，然后将有问题的文本复制并粘贴到编辑器或文字处理器中。如果这不起作用（并且在手头的情况下它不起作用），则文件可能确实包含用于文本提取的不完整或误导性信息（或根本没有）。

text - Extract Arabic Text using iTextsharp get number only?

1 回答 1

Related

Reference