python-3.x - 使用 camelot 从 pdf 中提取文本时如何剥离 (CID:)

Question

我正在使用 Camelot 从 pdf 中提取文本。pdf 还包含 Camelot 为其打印Cid的中文字符。例如（cid:3634）

我想去掉那些 CID，因为汉字对我没有影响。

我试过这个：

>>> tables = camelot.read_pdf('D:/iolo/1.  Hangcha/1.  FORKLIFTS ELECTRIC/2.  NK15E - 3 WHEEL - NEW-(2014)/copy.pdf',pages='12',strip_text='(cid:[0-9])')

但只删除 CID 帧而不是其中的数字。

请在此处查看示例输出图像请帮助。

score 0 · Accepted Answer

目前，Camelot 参数strip_text不支持正则表达式（参见官方存储库）。

相反，您可以使用 Pandasreplace方法：

for table in tables:
    table.df.replace(to_replace='\(cid\:[0-9]+\)', value='', inplace=True, regex=True)

python-3.x - 使用 camelot 从 pdf 中提取文本时如何剥离 (CID:)

1 回答 1

Related

Reference