0

因此,我阅读了 pdf 的每一页并将每个 xml 提取附加到一个字符串变量中。使用Page.get_text(“xml”). 文本输出由许多单元组成

<page id="page0" width="595.276" height="841.89">\n<block bbox="84.95639 235.90979 382.4564 316.3398">\n<line bbox="84.96 235.90979 382.4564 278.3298" wmode="0" dir="1 0">\n<font name="AkkuratPro-Bold" size="35">

我知道这些是文本周围的边界框,并且在文档中指定这些最好使用 lxml 解析。所以我尝试了下面的实现方式。

from lxml import etree

root = etree.fromstring(texts)

并得到以下错误:

Traceback (most recent call last):

  File "C:\Users\z34534534\Anaconda3\lib\site-packages\IPython\core\interactiveshell.py", line 3418, in run_code
    exec(code_obj, self.user_global_ns, self.user_ns)

  File "<ipython-input-11-209409d1172d>", line 3, in <module>
    root = etree.fromstring(texts)

  File "src/lxml/etree.pyx", line 3237, in lxml.etree.fromstring

  File "src/lxml/parser.pxi", line 1896, in lxml.etree._parseMemoryDocument

  File "src/lxml/parser.pxi", line 1777, in lxml.etree._parseDoc

  File "src/lxml/parser.pxi", line 1082, in lxml.etree._BaseParser._parseUnicodeDoc

  File "src/lxml/parser.pxi", line 615, in lxml.etree._ParserContext._handleParseResultDoc

  File "src/lxml/parser.pxi", line 725, in lxml.etree._handleParseResult

  File "src/lxml/parser.pxi", line 654, in lxml.etree._raiseParseError

  File "<string>", line 196
XMLSyntaxError: Extra content at the end of the document, line 196, column 2

我真的很想知道当前实现 lxml 和使用边界框从 pdf 文档中获取文本的方式。

4

0 回答 0