我对 PDF 编码了解不多,但我认为您可以通过修改pdf.py
. 在该PageObject.extractText
方法中,您会看到发生了什么:
def extractText(self):
[...]
for operands,operator in content.operations:
if operator == "Tj":
_text = operands[0]
if isinstance(_text, TextStringObject):
text += _text
elif operator == "T*":
text += "\n"
elif operator == "'":
text += "\n"
_text = operands[0]
if isinstance(_text, TextStringObject):
text += operands[0]
elif operator == '"':
_text = operands[2]
if isinstance(_text, TextStringObject):
text += "\n"
text += _text
elif operator == "TJ":
for i in operands[0]:
if isinstance(i, TextStringObject):
text += i
如果运算符是Tj
或TJ
(在您的示例 PDF 中为 Tj),则仅附加文本并且不添加换行符。现在您不一定要添加换行符,至少如果我正在阅读 PDF 参考正确:Tj/TJ
只是单个和多个显示字符串运算符,并且某种分隔符的存在不是强制性的。
无论如何,如果您将此代码修改为类似
def extractText(self, Tj_sep="", TJ_sep=""):
[...]
if operator == "Tj":
_text = operands[0]
if isinstance(_text, TextStringObject):
text += Tj_sep
text += _text
[...]
elif operator == "TJ":
for i in operands[0]:
if isinstance(i, TextStringObject):
text += TJ_sep
text += i
那么默认行为应该是相同的:
In [1]: pdf.getPage(1).extractText()[1120:1250]
Out[1]: u'ing an individual which, because of name, identifyingnumber, mark or description can be readily associated with a particular indiv'
但是您可以在需要时更改它:
In [2]: pdf.getPage(1).extractText(Tj_sep=" ")[1120:1250]
Out[2]: u'ta" means any information concerning an individual which, because of name, identifying number, mark or description can be readily '
或者
In [3]: pdf.getPage(1).extractText(Tj_sep="\n")[1120:1250]
Out[3]: u'ta" means any information concerning an individual which, because of name, identifying\nnumber, mark or description can be readily '
或者,您可以通过就地修改操作数本身来简单地添加分隔符,但这可能会破坏其他内容(例如get_original_bytes
让我紧张的方法)。
最后,pdf.py
如果您不想编辑自己,则不必编辑:您可以简单地将这个方法拉出到一个函数中。