我正在尝试通过输入坐标来提取 pdf miner 中的文本,我已经搜索了互联网,但找不到与此相关的任何文档或代码。到目前为止,我找到了一个提取文本并输出其坐标的代码。
LTTextBoxHorizontal
(317.564, 91.32756, 580.93228, 116.24235999999999)
SHOULD ANY OF THE ABOVE DESCRIBED POLICIES BE CANCELLED BEFORE
THE EXPIRATION DATE THEREOF, NOTICE WILL BE DELIVERED IN
ACCORDANCE WITH THE POLICY PROVISIONS.
这是我获得的输出坐标和文本之一。我也试过pdfquery但我有很多错误。
File "C:\Python27\lib\site-packages\pyquery-1.2.11-py2.7.egg\pyquery\pyquery.py", line 268, in __call__
result = self._copy(*args, parent=self, **kwargs)
File "C:\Python27\lib\site-packages\pyquery-1.2.11-py2.7.egg\pyquery\pyquery.py", line 253, in _copy
return self.__class__(*args, **kwargs)
File "C:\Python27\lib\site-packages\pyquery-1.2.11-py2.7.egg\pyquery\pyquery.py", line 239, in __init__
xpath = self._css_to_xpath(selector)
File "C:\Python27\lib\site-packages\pyquery-1.2.11-py2.7.egg\pyquery\pyquery.py", line 249, in _css_to_xpath
return self._translator.css_to_xpath(selector, prefix)
File "build\bdist.win32\egg\cssselect\xpath.py", line 192, in css_to_xpath
File "build\bdist.win32\egg\cssselect\parser.py", line 355, in parse
File "build\bdist.win32\egg\cssselect\parser.py", line 370, in parse_selector_group
File "build\bdist.win32\egg\cssselect\parser.py", line 378, in parse_selector
File "build\bdist.win32\egg\cssselect\parser.py", line 437, in parse_simple_selector
File "build\bdist.win32\egg\cssselect\parser.py", line 535, in parse_attrib
cssselect.parser.SelectorSyntaxError: Expected string or ident, got <NUMBER '1' at 14>
有人可以帮我吗?