0

我正在尝试阅读 PDF 文件。下面的回调也打印消息,但我无法从 PDF 中获取任何信息。

    let pdfBundlePath = Bundle.main.path(forResource: "sample", ofType: "pdf")
    let pdfURL = URL.init(fileURLWithPath: pdfBundlePath!)
    let pdf = CGPDFDocument(pdfURL as CFURL)        

    let operatorTableRef = CGPDFOperatorTableCreate()

    CGPDFOperatorTableSetCallback(operatorTableRef!, "BT") { (scanner, info) in
        print("Begin text object")
    }
    CGPDFOperatorTableSetCallback(operatorTableRef!, "ET") { (scanner, info) in
        print("End text object")
    }
    CGPDFOperatorTableSetCallback(operatorTableRef!, "Tf") { (scanner, info) in
        print("Select font")
    }
    CGPDFOperatorTableSetCallback(operatorTableRef!, "Tj") { (scanner, info) in
        print("Show text")
    }
    CGPDFOperatorTableSetCallback(operatorTableRef!, "TJ") { (scanner, info) in
        print("Show text, allowing individual glyph positioning")
    }

        let page = pdf!.page(at: 1)
        let stream = CGPDFContentStreamCreateWithPage(page!)
        let scanner = CGPDFScannerCreate(stream, operatorTableRef, nil)
        CGPDFScannerScan(scanner)
        CGPDFScannerRelease(scanner)
        CGPDFContentStreamRelease(stream)

输出:

Begin text object
Select font
Show text, allowing individual glyph positioning
End text object

// the same output for at least 10 or more times.

但是我不确定如何从中获取实际的字符串?任何建议将不胜感激。

4

1 回答 1

2

我有带有“你好,世界”文本的 pdf(通过从 TextEdit 导出为 pdf 创建)

这个回调函数

CGPDFOperatorTableSetCallback(operatorTableRef!, "TJ") { (scanner, info) in
    print("Show text, allowing individual glyph positioning")
    var pa: CGPDFArrayRef?
    withUnsafeMutablePointer(to: &pa, { (ppa) -> () in
        let r = CGPDFScannerPopArray(scanner, ppa)
        print("TJ", r)
        if r {
            let count = CGPDFArrayGetCount(ppa.pointee!)
            var j = 0
            for i in 0..<count {
                var str: CGPDFStringRef?
                let r = CGPDFArrayGetString(ppa.pointee!, i, &str)
                if r {
                    let string = String(cString: CGPDFStringGetBytePtr(str!)!)
                    print(string, i, j)
                    j += 1
                }
            }
        }
    })
}

打印我

Show text, allowing individual glyph positioning
TJ true
h 0 0
e 2 1
l 4 2
l 6 3
o 8 4
, 10 5
  12 6
w 14 7
o 16 8
rl 18 9
d 20 10

我认为它表明,至少对于拉丁字母来说,获取字符串是可能的:-)。

对于 Tj 运算符,回调函数可以很简单

CGPDFOperatorTableSetCallback(operatorTableRef!, "Tj") { (scanner, info) in
        print("Show text")
        var text: CGPDFStringRef?
        withUnsafeMutablePointer(to: &text, { (p) -> () in
            let r = CGPDFScannerPopString(scanner, p)
            if r {
                let string = String(cString: CGPDFStringGetBytePtr(p.pointee!)!)
                print(string)
            }
        })
    }

警告!要正确显示所有字符,必须使用字体信息,但这是另一回事。对于拉丁字符,此解决方案应按原样工作。

为了能够“提取”所有字符串,必须实现所有文本显示运算符

更新 因为PDFKit 在两个苹果平台(来自 iOS11)上都可用,我建议将它用于文本提取。这个过程非常简单

于 2017-05-31T10:41:36.440 回答