python - 从pdfplumber中提取后如何删除英文单词之间的空格

Question

我已将文本从 pdf（使用 pdfplumber）提取到 txt，但在 PDF 文件中没有的单词之间有一些空格。

我尝试使用“Previous_word”+“current_word”组合来查找单词，并检查它们是否存在于 NLTK.words 中以找出单词之间有多余空格的位置，但效果不佳。

我正在寻找一些建议，谢谢

score 0 · Accepted Answer

我建议寻找两个不在您的语料库中的后续单词的出现，这应该揭示这种拆分不会导致其他英语单词的所有情况。

score 0 · Accepted Answer

将带有两个空格的单词放入列表的示例逻辑，然后您可以实现您喜欢的功能：

text = """
asdasd  asd asdd d
uuurr ii ii  rrr
"""

words = text.split(" ") #<- split if 1 spaces
dictionary = list() #<- dictionary list to compare
words_wrapper = list() #<- list of words with 2 spaces

for idx in range(len(words)):
    if words[idx] == '':
        word = f"{words[idx-1]} {words[idx+1]}"
        words_wrapper.append(word)
        if word in dictionary:
            pass #<- do sth 
            
# Print filtered words
print(words_wrapper)

或者您也可以使用 .join 将带有 2 个空格的单词组合在一起：

text = """
asdasd  asd asdd d
uuurr ii ii  rrr
"""

print("".join(text.split("  ")))

python - 从pdfplumber中提取后如何删除英文单词之间的空格

2 回答 2

Related

Reference