“pdfminer”的相关标签问题_Stack Overflow中文网

0 投票

1 回答

15783 浏览

python - Python PDFMIner - PDF 到 CSV

我希望能够将 PDF 转换为 CSV 文件，并找到了几个有用的脚本，但是作为 Python 新手，我有一个问题：

您在哪里指定要打印到的 PDF 和 CSV 的文件路径？

我正在使用 Python 2.7.11 和 PDFMiner 20140328。

2016-04-27T23:10:04.567

0 投票

2 回答

2486 浏览

jquery - PDFQuery：获取元素所在的页码

这是我第一次使用PDFQuery来抓取 PDF。

我需要做的是从有几页的价目表中获取价格，我想将产品代码提供给 PDFQuery，它应该找到代码并在它旁边返回价格。问题是使用 Github 页面上的第一个示例获取文本的位置，但它清楚地表示“请注意，我们不必知道名称在页面上的位置，或者它在哪个页面上”。我的价目表就是这种情况，但是所有其他示例都指定了页码（LTPage[pageid=1]），但我看不到我们从哪里得到页码。

如果我不指定页码，它会为所有页面返回同一位置的所有文本。

另外，我添加了一个exactText函数，因为代码可能是，例如，“92005”、“92005C”、“92005G”，所以:contains单独使用并没有多大帮助。

我尝试选择元素所在的页面，并使用 JQuery .closest，但都没有运气。

我检查了PDFMiner 文档和PyQuery 文档，但我没有看到任何对我有帮助的东西 =(

我的代码现在看起来像这样：

任何帮助都非常感谢，伙计们和女孩们！！！

jquery python pdf pdfminer pyquery

2016-05-07T17:37:08.473

0 投票

2 回答

14133 浏览

python - 我想使用 PDFminer 将文本从 PDF 提取到 .text 文件。我找到了代码，但我不知道如何使用它

这是我在这里某处找到的代码。我不知道如何使用它。有人可以帮我完成这个并帮助我转换样本 pdf 吗？

python python-2.7 pdfminer

2016-05-21T21:32:31.150

0 投票

1 回答

2590 浏览

python - 在python中从具有特定标题的PDF中提取数据

我想用python解析PDF文件。我已经看到 PDFMiner 的示例无法解释我的要求。

例如，如果我想解析一份简历，它包含各种字段，如摘要、经验和爱好。

我有兴趣只提取经验，这个经验字段将在第一位或第二位或任何地方，我需要确定经验字段的位置并需要提取数据。

我怎样才能做到这一点？

python parsing pdf pdfminer pdf-parsing

2016-06-07T09:16:21.163

0 投票

0 回答

1140 浏览

python - Losing information when extracting text from PDF using PDFMiner

I'm using Python 3.4 on Windows 7 and hoping I can extract text from PDF files using PDFMiner. However, losing information was quite common when I was testing. For some files, it may be just a matter of a few sentences. But I've encountered situations where half of the text could not be extracted, depending on the file format. Here's my full code:

I wonder if there's a way to extract the full text using PDFMiner. I've heard of poppler, but I can't seem to find how to use it as a Python library. Besides, I don't want to use the command line. Can anyone help?

Here's an example: a thesis. Several paragraphs were lost when extracting using the code above. Like in the 2nd page, I could only extract first half of the page until "Pereira, Tishby, and Lee (1993)" at the middle. Then it just skip right to the next page for no apparent reason.

python python-3.x pdf poppler pdfminer

2016-06-16T02:27:13.440

0 投票

0 回答

548 浏览

python - 即使在 Python 中尝试使用 pdfminer、pdf2txt、textract 也无法将 pdf 转换为文本

我无法从最初从 InDesign 和 Illustrator 转换的 pdf 文件中提取文本。我正在做一个需要这些 pdf 文件中的数据的项目。我在 Python 中尝试过 pdfminer、pdf2txt 库，但在这种情况下它们都不起作用。对于普通的pdf，它工作得很好。但是，对于这些特殊的 pdf 文件，它只是给出了空格。谁能帮我解决这个问题？谢谢。

python text adobe-indesign pdf-conversion pdfminer

2016-06-21T18:09:06.647

0 投票

1 回答

147 浏览