因此,我阅读了 pdf 的每一页并将每个 xml 提取附加到一个字符串变量中。使用Page.get_text(“xml”)
. 文本输出由许多单元组成
<page id="page0" width="595.276" height="841.89">\n<block bbox="84.95639 235.90979 382.4564 316.3398">\n<line bbox="84.96 235.90979 382.4564 278.3298" wmode="0" dir="1 0">\n<font name="AkkuratPro-Bold" size="35">
我知道这些是文本周围的边界框,并且在文档中指定这些最好使用 lxml 解析。所以我尝试了下面的实现方式。
from lxml import etree
root = etree.fromstring(texts)
并得到以下错误:
Traceback (most recent call last):
File "C:\Users\z34534534\Anaconda3\lib\site-packages\IPython\core\interactiveshell.py", line 3418, in run_code
exec(code_obj, self.user_global_ns, self.user_ns)
File "<ipython-input-11-209409d1172d>", line 3, in <module>
root = etree.fromstring(texts)
File "src/lxml/etree.pyx", line 3237, in lxml.etree.fromstring
File "src/lxml/parser.pxi", line 1896, in lxml.etree._parseMemoryDocument
File "src/lxml/parser.pxi", line 1777, in lxml.etree._parseDoc
File "src/lxml/parser.pxi", line 1082, in lxml.etree._BaseParser._parseUnicodeDoc
File "src/lxml/parser.pxi", line 615, in lxml.etree._ParserContext._handleParseResultDoc
File "src/lxml/parser.pxi", line 725, in lxml.etree._handleParseResult
File "src/lxml/parser.pxi", line 654, in lxml.etree._raiseParseError
File "<string>", line 196
XMLSyntaxError: Extra content at the end of the document, line 196, column 2
我真的很想知道当前实现 lxml 和使用边界框从 pdf 文档中获取文本的方式。