我有以下比较器:
public static class WordComparator implements Comparator<Word> {
@Override
public int compare(Word word1, Word word2) {
//TODO find a better way to determine threshold
int threshold = 10; //allowed difference in height
int word1y = (int)Math.round(word1.bbox.y1 * 1.0 / threshold);
int word2y = (int)Math.round(word2.bbox.y1 * 1.0 / threshold);
if (word1y == word2y) {
return word1.bbox.x1 - word2.bbox.x1;
}
else {
return word1y - word2y;
}
}
}
在任何Collection<Word>
您可以使用此比较器的情况下,它应该首先根据 y1(y 坐标,此处word1.bbox.y1
)和 x1(x 坐标,此处word.bbox.x1
)对单词进行排序。当前的实现还使用一种机制来规范化彼此 10 y 范围内的所有单词。
但是我得出的结论是,我当前的代码不起作用。我现在的问题是:如何制作一个可以比较两个不同字段的比较器?我已经有了返回值,等等——我只需要找到正确的方法来完成它。
我希望你能帮我解决这个问题。
请求输出示例:
w = [word_50, [188, 1455, 280, 1482, 92, 27], false, Totaal]
w = [word_58, [1324, 1547, 1370, 1573, 46, 26], false, EU]
w = [word_59, [1465, 1546, 1568, 1577, 103, 31], false, 173,50]
w = [word_56, [300, 1558, 329, 1583, 29, 25], false, te]
w = [word_62, [381, 2082, 605, 2119, 224, 37], false, verkrijgbaar!]
w = [word_61, [305, 2093, 369, 2114, 64, 21], false, ons]
w = [word_65, [605, 2114, 650, 2166, 45, 52], false, ]
w = [word_68, [184, 2258, 319, 2382, 135, 124], false, ]
w = [word_72, [296, 2278, 349, 2319, 53, 41], false, J]
w = [word_73, [411, 2302, 470, 2322, 59, 20], false, ‚n.]
w = [word_74, [571, 2319, 602, 2320, 31, 1], false, ]
w = [word_76, [434, 2330, 635, 2357, 201, 27], false, Kerstkaarten]
w = [word_77, [338, 2367, 436, 2393, 98, 26], false, Bestel]
w = [word_69, [184, 2382, 338, 2409, 154, 27], false, ]
w = [word_80, [1805, 2392, 1979, 2413, 174, 21], false, 37.45.08.070]
w = [word_82, [1745, 2430, 1881, 2458, 136, 28], false, Groningen]
w = [word_84, [1666, 2470, 1741, 2492, 75, 22], false, B.T.W.]
w = [word_86, [1795, 2469, 1981, 2492, 186, 23], false, 821.82.468.501]
w = [word_88, [1741, 2547, 1873, 2575, 132, 28], false, Algemene]
w = [word_108, [841, 2584, 1018, 2624, 177, 40], false, Betaling:]
w = [word_111, [1295, 2582, 1336, 2613, 41, 31], false, 14]
w = [word_102, [203, 2590, 261, 2630, 58, 40], false, Wij]
w = [word_107, [640, 2585, 825, 2627, 185, 42], false, opdracht.]
w = [word_90, [1666, 2593, 1695, 2609, 29, 16], false, en]
w = [word_104, [431, 2597, 454, 2620, 23, 23], false, u]
w = [word_106, [570, 2595, 628, 2619, 58, 24], false, uw]
w = [word_92, [1666, 2625, 1709, 2654, 43, 29], false, zijn]
w = [word_96, [1875, 2664, 1933, 2686, 58, 22], false, 1181]
w = [word_116, [561, 2683, 751, 2715, 190, 32], false, factuurnr.]
w = [word_119, [1108, 2678, 1321, 2710, 213, 32], false, vermelden.]
w = [word_114, [265, 2685, 423, 2724, 158, 39], false, betaling]
w = [word_117, [769, 2690, 815, 2713, 46, 23], false, en]
w = [word_98, [1708, 2703, 1739, 2726, 31, 23], false, de]
w = [word_101, [1863, 2703, 1999, 2730, 136, 27], false, Groningen]
w = [word_125, [828, 2772, 1359, 2813, 531, 41], false, administratie@biuemule.nl]
w = [word_123, [555, 2778, 646, 2809, 91, 31], false, deze]
w = [word_121, [309, 2787, 441, 2819, 132, 32], false, vragen]
w = [word_122, [455, 2787, 544, 2809, 89, 22], false, over]
w = [word_124, [660, 2777, 814, 2808, 154, 31], false, factuur:]
w = [word_120, [204, 2782, 298, 2812, 94, 30], false, Voor]
w = [word_100, [1829, 2705, 1853, 2725, 24, 20], false, te]
w = [word_99, [1750, 2704, 1816, 2725, 66, 21], false, K.v‚K.]
w = [word_97, [1668, 2704, 1696, 2733, 28, 29], false, bij]
w = [word_115, [435, 2692, 548, 2724, 113, 32], false, graag]
w = [word_113, [200, 2687, 254, 2727, 54, 40], false, Bij]
w = [word_118, [830, 2682, 1090, 2713, 260, 31], false, debiteurennr.]
w = [word_95, [1754, 2670, 1863, 2687, 109, 17], false, nummer]
w = [word_94, [1666, 2664, 1744, 2687, 78, 23], false, onder]
w = [word_93, [1721, 2624, 1893, 2654, 172, 30], false, gedeponeerd]
w = [word_105, [469, 2595, 559, 2620, 90, 25], false, voor]
w = [word_91, [1709, 2585, 1998, 2614, 289, 29], false, betalingsvoorwaarden]
w = [word_109, [1031, 2585, 1130, 2615, 99, 30], false, netto]
w = [word_103, [274, 2589, 416, 2622, 142, 33], false, danken]
w = [word_112, [1350, 2580, 1481, 2622, 131, 42], false, dagen.]
w = [word_110, [1144, 2583, 1278, 2614, 134, 31], false, binnen]
w = [word_89, [1883, 2547, 2006, 2575, 123, 28], false, leverings-]
w = [word_87, [1666, 2549, 1733, 2570, 67, 21], false, Onze]
w = [word_85, [1754, 2470, 1786, 2492, 32, 22], false, NL]
w = [word_83, [1894, 2430, 2020, 2452, 126, 22], false, 02045251]
w = [word_81, [1666, 2432, 1733, 2453, 67, 21], false, K.v.K.]
w = [word_79, [1666, 2391, 1794, 2414, 128, 23], false, Rabobank]
w = [word_78, [449, 2365, 528, 2398, 79, 33], false, tijdig]
w = [word_70, [528, 2339, 685, 2409, 157, 70], false, ]
w = [word_75, [225, 2332, 420, 2359, 195, 27], false, INTERCARD]
w = [word_71, [224, 2323, 254, 2324, 30, 1], false, ]
w = [word_67, [635, 2290, 685, 2339, 50, 49], false, ]
w = [word_66, [349, 2258, 650, 2290, 301, 32], false, ]
w = [word_63, [425, 2123, 434, 2138, 9, 15], false, \I]
w = [word_64, [206, 2114, 650, 2258, 444, 144], false, ]
w = [word_60, [248, 2085, 290, 2120, 42, 35], false, Bij]
w = [word_57, [341, 1557, 458, 1583, 117, 26], false, betalen]
w = [word_55, [188, 1558, 288, 1584, 100, 26], false, Totaal]
w = [word_51, [294, 1455, 368, 1480, 74, 25], false, BTW]
w = [word_54, [1536, 1448, 1571, 1473, 35, 25], false, 70]
输入是同一个列表,但是以任何随机顺序。当前使用的“编码”是:w = [word.id, [word.bbox.x1, word.bbox.y1, word.bbox.x2, word.bbox.y2, word.bbox.width, word.bbox.height], word.isStrong, word.content]
.
所以你应该只看word.bbox.y1
和word.bbox.x1
值。如您所见,它显然不是随机的,它现在被格式化为一种围绕 y 值的抛物线。