1

我正在尝试使用slate模块从 pdf 文件中提取文本,如下所示

$sudo pip install https://codeload.github.com/timClicks/slate/zip/master
Collecting https://codeload.github.com/timClicks/slate/zip/master
  Downloading https://codeload.github.com/timClicks/slate/zip/master
Requirement already satisfied: distribute in /usr/lib/python3.5/site-packages (from slate==0.5.2)
Requirement already satisfied: pdfminer3k in /usr/lib/python3.5/site-packages (from slate==0.5.2)
Requirement already satisfied: setuptools>=0.7 in /usr/lib/python3.5/site-packages (from distribute->slate==0.5.2)
Requirement already satisfied: pytest>=2.0 in /usr/lib/python3.5/site-packages (from pdfminer3k->slate==0.5.2)
Requirement already satisfied: ply>=3.4 in /usr/lib/python3.5/site-packages (from pdfminer3k->slate==0.5.2)
Requirement already satisfied: py>=1.4.29 in /usr/lib/python3.5/site-packages (from pytest>=2.0->pdfminer3k->slate==0.5.2)
Installing collected packages: slate
  Found existing installation: slate 0.3
    Uninstalling slate-0.3:
      Successfully uninstalled slate-0.3
  Running setup.py install for slate ... done
Successfully installed slate-0.5.2

我正在尝试:

#!/usr/bin/python3
import slate

with open('/var/tmp/PhysRevB.93.014203.pdf') as fp:
    doc = slate.PDF(fp)
print(len(doc))
print(doc[0])

这给了我错误:

$python3 tstslt.py 
Traceback (most recent call last):
  File "tstslt.py", line 2, in <module>
    import slate
  File "/usr/lib/python3.5/site-packages/slate/__init__.py", line 66, in <module>
    from .classes import PDF
  File "/usr/lib/python3.5/site-packages/slate/classes.py", line 25, in <module>
    import utils
ImportError: No module named 'utils'

我可以使用 提取文本PyPDF2,但看看 slate 是否更好。

4

3 回答 3

1

slate3k是python3原始slate的一个分支。

您可以使用安装 slate3kpip install slate3k

于 2019-02-06T10:09:37.377 回答
1

根据这个问题,slate 的依赖项之一(pdfminer)不支持 Python3

(...)

所需的“pdfminer”不起作用,因为它目前与 python 3.5 不兼容。

在他们的自述文件中是这样说的:

https://github.com/euske/pdfminer

“安装 Python 2.6 或更新版本。(不支持 Python 3。)”

于 2017-10-10T06:40:32.963 回答
0

安装slate3k后,您还必须设置模式,如何打开文件:

#/usr/bin/python3
import slate

with open('/var/tmp/PhysRevB.93.014203.pdf', 'rb') as fp:
    doc = slate.PDF(fp)
print(len(doc))
print(doc[0])
于 2020-06-21T09:08:12.890 回答