python-3.x - 如何将 PDF 中的表格解析为非英语语言

Question

我正在使用 Camelot 和 tabula 来解析带有西里尔符号的 pdf 文件。但是在输出的 CSV 文件中，我得到了没有俄语符号的混乱字体。

什么可以帮助我解析非英语语言的 pdf 表？

import camelot
file = 'file-name.pdf'
tables = camelot.read_pdf(file, pages = "1-end", encoding='utf-8')

输出： 0055529-1295-06-UT。 Р“Р§Р§45

score 0 · Accepted Answer

所以，基本上，Camelot 很适合西里尔字母。

pip install camelot-py[cv]
import pandas as pd
import camelot
file = 'file-name.pdf'
tables = camelot.read_pdf(file, pages = "4, 5", encoding='utf-8')
df_p4 = tables[0].df

输出将非常原始，需要清理，但符号不会被破坏，我认为这是一个很好的结果。

python-3.x - 如何将 PDF 中的表格解析为非英语语言

1 回答 1

Related

Reference