python - 比较循环中的先前值并在公差范围内附加到字符串

Question

我有一个如下列表：

word_list = '''
[{'bottom': Decimal('58.650'),
  'text': 'Welcome'
{'bottom': Decimal('74.101'),
  'text': 'This'
},
 {'bottom': Decimal('74.101'),
  'text': 'is'
},
 {'bottom': Decimal('77.280'),
  'text': 'Oliver'
}]
'''

表示一系列单词：Contact Name is Oliver，它是从 PDF 文件中提取的。该bottom值是从底部到页面顶部的距离。

该列表bottom按键排序：

words = sorted(word_list, key=itemgetter('bottom'))

我正在尝试迭代列表和每个单词以查看该单词是否属于同一行 - 或者它应该附加到新行。

我想这样做的方法是比较bottom每个循环中的值，公差为xx. 例如，This is OliverPDF 文件中的所有单词都在同一行 - 但底部值不相等（因此是公差级别）。

预期产出

我试图最终得到的结果是：

[{'text': 'Welcome',
  'line:' 1
{'text': 'This is Oliver',
  'line': 2
}]

这是我到目前为止所拥有的：

for i, word in enumerate(word_list):
    previous_element = word_list[i-1] if i > 0 else None
    current_element = word
    next_element = word_list[i +1] if i < len(word_list) - 1 else None

    if math.isclose(current_element['bottom'], next_element['bottom'], abs_tol=5):
       # Append the word to the line

我有点卡在上面的循环中。我似乎无法弄清楚这math.isclose()是否正确以及如何实际附加 theline[i]和实际单词以创建一个行句。

score 0 · Accepted Answer

我认为您不需要使用math功能；你可以自己检查阈值。也许是这样的：

from decimal import Decimal

word_list = [
    {
        'bottom': Decimal('58.650'),
        'text': 'Welcome',
    },
    {
        'bottom': Decimal('74.101'),
        'text': 'This',
    },
    {
        'bottom': Decimal('77.280'),
        'text': 'Oliver',
    },
    {
        'bottom': Decimal('74.101'),
        'text': 'is',
    },
]
word_list = sorted(word_list, key=lambda x: x['bottom'])

threshold = Decimal('10')
current_row = [word_list[0], ]
row_list = [current_row, ]

for word in word_list[1:]:
    if abs(current_row[-1]['bottom'] - word['bottom']) <= threshold:
        # distance is small, use same row
        current_row.append(word)
    else:
        # distance is big, create new row
        current_row = [word, ]
        row_list.append(current_row)

print('final output')
for i, row in enumerate(row_list):
    data = {
        'line': i,
        'text': ' '.join(elem['text'] for elem in row),
    }
    print(data)

这段代码的输出是：

final output
{'line': 0, 'text': 'Welcome'}
{'line': 1, 'text': 'This is Oliver'}

score 0 · Accepted Answer

line_sentence_map = {}
tolerance = 5
line = 1
what_you_want = []
for i in range(len(word_list)):
    if(i == 0):
        previous_line_threshold = word_list[i]['bottom']
        line_sentence_map[line] = []
    if(word_list[i]['bottom'] - previous_line_threshold > tolerance):
        what_you_want.append({"line":line,"text":' '.join(line_sentence_map[line])})
        line +=1
        previous_line_threshold = word_list[i]['bottom']
        line_sentence_map[line] = []
    line_sentence_map[line].append(word_list[i]['text'])
    if i == len(word_list) - 1:
        what_you_want.append({"line": line, "text": ' '.join(line_sentence_map[line])})

在这里，what_you_want 会给你你想要的——

[{'text': 'Welcome', 'line': 1}, {'text': 'This is Oliver', 'line': 2}]

干杯!

python - 比较循环中的先前值并在公差范围内附加到字​​符串

预期产出

2 回答 2

Related

Reference

python - 比较循环中的先前值并在公差范围内附加到字符串