2

i have a set of strings along with their co-ordinates and rectangular bounds int two similar pages. these strings are different in three possible ways. (i) a string can be moved to a new location on a page. (ii) a string is in the same location but it is modified. say ( abc --> abd or ABC) (iii) a combination of (i) and (ii). (iv) a string might be missing.

i tried to use locality sensitive hashing but couldn't find a good hash function for this. Can anyone please suggest me a good hash function or another algorithm to solve this problem. thanks in advance

4

1 回答 1

2

所以我们有一个源字符串列表和一个最大大小为S的目标字符串列表。我们希望找到一种方法将每个字符串分配给一个不同的字符串,以使不匹配字符的总数最小化T|S|TS

(请注意,因为我们正在寻找一种与 匹配的方法T,所以会隐式处理缺少S字符串的情况)S

如果这是对您的问题@programer8的准确解释,我相信这是一个分配问题,可以通过匈牙利算法在立方时间内解决:wiki文章中提到的“工人”是您的目标字符串,“任务”是源字符串,源字符串和目标字符串之间不匹配的字符数是工作人员执行任务的成本。

唯一的问题是您的工作人员/目标字符串少于任务/源字符串,但您可以通过添加虚拟工作人员来解决这个问题。

于 2013-12-10T01:11:53.053 回答