“python-camelot”的相关标签问题

0 投票

1 回答

870 浏览

ghostscript - 如何在 aws Lambda 中安装 Ghostscript？

错误信息：Please make sure that Ghostscript is installed", "errorType": "RuntimeError"

使用 pip install -t 安装依赖项。
但仍然得到错误。

如何在我的 python 代码中安装“ghostscript”依赖项？

ghostscript python-camelot

2019-11-15T12:34:14.020

0 投票

1 回答

1065 浏览

python - Camelot PDF 尺寸

在发布此内容之前，我已经广泛搜索了 stackoverflow，并且无法在 camelot 页面尺寸上找到任何内容。有这个问题，它建议使用table_region但不能解决 OP 的问题或我的问题。不幸的是，我无法评论跟进 OP，看看他们是否找到了解决方案。

我正在尝试做的事情：

我正在使用 Camelot 来识别表格（显然）。有时，当我知道可能包含感兴趣表格的页面区域时，我只想在该区域中进行搜索。这很容易使用camelot.read_pdf()'s table_regionkwarg 完成——我只需要提供一对坐标供 Camelot 搜索。

问题是，我使用 PyMuPDF 获得这些坐标，所以它们在 PyMuPDF 的坐标系中。我已经想出了如何翻译这些坐标，但我错过了来自 Camelot 的一个关键信息——页面的尺寸。这些值很容易在 PyMuPDF（Page 类.bound()属性）中获得，我需要 Camelot 等价物。如果有人认为可能有其他选择，我可以在这里提供代数的进一步解释

到目前为止我所尝试的：

我阅读了文档。由于文档中的这一行，我想知道这是否可以提供一种获取尺寸的方法：“使用 Lattice 时可能会出现未检测到较小线条的情况。计算检测到的最小线条的大小通过将 PDF 页面的尺寸与称为的比例因子相除line_scale。默认情况下，其值为 15"

我对替代方案持开放态度，基本上我要么想检查页面的区域是否包含表格（在 PyMuPDF 坐标系中描述的区域，对于 pdf 页面，尺寸通常为（612、792），原点位于顶部左角。camelot 的原点在左下角）或者页面上的任何表格都在给定区域中，如果这有意义的话。

2019-12-03T19:19:35.197

0 投票

0 回答

179 浏览

odoo-12 - Camelot python如何添加文件路径

在camelotPython 中只接受当前目录。如果临时添加文件，则不被接受。有什么解决办法吗？

我收到以下错误：

如何添加正确的路径？请用一个例子来解释这一点：这将对我有所帮助。

odoo-12 python-camelot

2019-12-30T21:56:21.103

0 投票

0 回答

76 浏览

machine-learning - 如何使用 Tabula 检索表格的坐标？

我正在尝试从 PDF 文档中删除表格（转换为图像）。这个想法是识别表格坐标，然后使用它将相应的像素值更改为 255（白色）。我使用 .bbox 选项使用 Camelot 函数获得坐标，但 Camelot 将非表格标识为表格。我觉得 tabula 更擅长提取表格。你能帮我用 Tabula 检索表格坐标吗？

machine-learning computer-vision data-science tabula python-camelot

2019-12-31T06:23:08.910

0 投票

1 回答

195 浏览

python-camelot - table_regions 和 table_areas 的区别

我阅读并重新阅读了文档，但我仍然不明白两者之间的区别，table_regions对table_areas 我来说，这两个参数做同样的事情......，但文档指定 then table_regionswork on approximate region.

我认为文档可以更具体地说明什么意思approximate region和有什么区别table_areas

我希望有人能清楚地向我解释这两个功能之间的区别

python-camelot

2020-01-20T15:53:07.323

0 投票

1 回答

297 浏览

python - 多处理 Python 3

我一直在尝试为python 3上的一系列任务创建一个多处理池。任务如下：1.阅读pdf文件并捕获pdf文件中的表格，然后是- 2.创建一个pickle文件来存储表对象 3. 加载泡菜文件

为了测试目的，我在三个 pdf 文件上以序列化和并行化模式运行了 python 代码。排序是在 200 秒内运行整个过程，并在工作目录中创建泡菜文件。但是，多处理不会在目录中生成泡菜文件，但运行该过程需要 39 秒。

测序代码如下：

代码的输出如下：

序列化的输出多处理的代码如下：

我非常感谢您对此的宝贵反馈。这是至关重要的，因为有时 20 MB 的 pdf 文件需要很长时间才能转换为存储在其中的表对象的 pickle 文件。因此，该进程停留在第一个作业（即大小为 20 MB 的 pdf）上，并且在第一个作业完成之前无法移动到下一个作业。

谢谢

python multithreading multiprocessing pickle python-camelot

2020-01-21T08:50:14.750

0 投票

1 回答

418 浏览

loops - 使用 Camelot.py 保存包含表的单个 csv 文件而不覆盖

我正在努力制作一个代码，该代码可以从pdf中提取表格并将其保存到循环中的csv文件中。

在我的文件夹中，我有大约 250 个 pdf 文件，每个文件都包含一个我想提取并放入 csv 文件的表格。我正在使用 Camelot.py 提取表格，该程序与单个文件完美配合。

我希望程序从 pdf 中提取表格，然后保存一个与包含表格的 pdf 具有相同文件名的 csv 文件。我试图构建一个代码（见下文），循环通过 pdf.files 但我无法将每个 pdf.file 的输出保存到单独的 csv 文件中。

我不知道如何在代码中指定，程序应将每个 pdf 文件的表导出到与 Camelot 循环中的 pdf 文件同名的 csv 文件。

我希望有人能够就如何从这里开始提供一些建议 - 在此先感谢。

loops csv pdf python-camelot

2020-01-28T08:24:12.993

0 投票

1 回答

1856 浏览

python - 无法在 MacOS Catalina 的 Python 3.7(Anaconda) 中导入 camelot

我的环境规格

python --version

Python 3.7.6
anaconda --version

anaconda 命令行客户端（1.7.2 版）
sw_vers

产品名称：Mac OS X

产品版本：10.15.2

构建版本：19C57

我使用以下命令camelot从 conda-forge安装。

当我尝试导入camelot以进行 pdf 解析和文本提取时附加日志。

我用谷歌搜索了上面的错误，发现了这个问题。但是，我无法解决这个问题，因为我无法找出我需要对名为libglib.

python python-3.x macos anaconda python-camelot

2020-02-02T13:15:16.370

0 投票

0 回答

862 浏览

python - Same table is extracted twice from a pdf by Camelot-py

I am trying to extract tables from a multiple page PDF file using camelot-py v0.7.3.

So far it has been the best pdf reader tool for me. I just needed to read pdf line by line and detect table manually. I tried many other tools such as tabula, PyPDF2/4, pdfminer, etc. Some of them could not detect the text itself properly and some of them disturbed the word sequences or spacing between the columns.

But camelot-py gave me the data in the format which is best suited for my application.

In the process of extracting data from the pdf using camelot-py, it detects all tables' data almost very well except few errors:

It is grouping multiple tables together in same 'TableList' element. But I am able separate these grouped tables. So no need to worry here.
Last table from these grouped tables is repeated in a saparate 'TableList' element. This repeatition is the main concern for me.

The code used for above process is as below:

Why camelot-py is repeating some tables? Is there any way to handle this repeatition?

More info:

Input PDF File: I can't share the pdf files because of sensitive data. But here are some details which will give you good idea about its structure: All pages contain only tables. Page 1: Contains Table1 which contain customer's info. Table 2 to 4 with same structure

Page 2: Contains some rows from Table 4 and Table 5 to 7 with same structure as Table 2

Page 3: Table 8 to10 with same structure as Table 2

Output CSV files:

foo-page-1-table-1: Contain Table 1

foo-page-1-table-2: Contain last row (repeated) from Table 1 and Table 2 to 4

foo-page-2-table-1: Contain Table 7 (repeated with First row missing)

foo-page-2-table-2: Contain some end rows from Table 4 and Table 5 to 7

foo-page-3-table-1: Contain Table 10 (repeated fully)

foo-page-3-table-2: Contain Table 8 to 10

python pdf-reader pdf-parsing python-camelot

2020-02-21T18:12:57.123

0 投票

1 回答

1411 浏览

python - Camelot：table_area 和 table_regions 不能按预期工作

几天来，我一直试图让 Camelot 在 pdf 页面的特定区域上工作，但这一直让我感到困惑。我查看并尝试了文档建议、一些错误报告和这个 SO 问题，但无济于事。我可以使用一些帮助。

我从文档中举了一个例子，因为它有不止一张桌子，这张。我修改了原始命令以仅提取两个表中的一个，来自：

tables = camelot.read_pdf('12s0324.pdf', flavor='stream', strip_text=' .\n')

到：

tables = camelot.read_pdf('12s0324.pdf', flavor='stream', strip_text='\n', table_area=['33,297,386,65'], pages = '1')

然而：

我更改了正则表达式，因为它消除了单词之间的空格，
使用table_area而不是文档，table_areas因为前者触发了详细说明，而第二个是错误（这里解释了错误，文档似乎仍然是错误的）
尝试提取两个表并使用camelot的绘图功能检查各个区域，如此处文档中所述，因此它们应该是正确的，
也试过使用table_regions，至少它拉出一张桌子而不是两张，但它仍然相当不准确（见下面的评论）

所以这里是我在上面提到的pdf上的试验结果：

第一个：table_area在'35,591,385,343'PDF区域（顶表）上使用

注意表格是两个，它在顶部和底部都包含不需要的文本，这些文本不应该在使用选择的区域内plot()。

二：table_regions在同一个'35,591,385,343'PDF区域，顶表上使用

显然，只有一张表，在选定区域之外出现不需要的文本的问题。

第三：table_area在'33,297,386,65'PDF区域上使用（底部表格）

它拿起了两张桌子，显然第一张仍然是第一张。不需要的文本也有同样的问题，但现在是预期的。

第四：table_regions在'33,297,386,65'PDF区域上使用（底部表格）

更好，但它会像上面那样拾取不需要的文本。

我真的很重视建议或指示。提前致谢！

python pdf python-camelot

2020-03-01T17:24:19.367

问题标签 [python-camelot]

Reference