我正在尝试在 Java 中实现以下基本滑动窗口算法。我明白了它的基本概念,但我对一些措辞有点困惑,特别是粗体的句子:
一个固定宽度 w 的滑动窗口在文件中移动,并且在文件中的每个位置 k 处,计算其内容的指纹。令 k 为块边界(即,Fk mod n = 0)。我们不采用整个块的哈希值,而是选择该块中滑动窗口的数字最小指纹。然后我们计算块内这个随机选择的窗口的哈希值。直观地说,这种方法将允许块内的小编辑对相似度计算的影响较小。该方法产生一个可变长度的文档签名,其中签名中的指纹数量与文档长度成正比。
请在下面查看我的代码/结果。我了解算法的基本思想吗?根据粗体文本,“在其块中选择滑动窗口的数字最小指纹”是什么意思?我目前只是散列整个块。
代码:
public class BSW {
/**
* @param args
*/
public static void main(String[] args) {
int w = 15; // fixed width of sliding window
char[] chars = "Once upon a time there lived in a certain village a little
country girl, the prettiest creature who was ever seen. Her mother was
excessively fond of her; and her grandmother doted on her still more. This
good woman had a little red riding hood made for her. It suited the girl so
extremely well that everybody called her Little Red Riding Hood."
.toCharArray();
List<String> fingerprints = new ArrayList<String>();
for (int i = 0; i < chars.length; i = i + w) {
StringBuffer sb = new StringBuffer();
if (i + w < chars.length) {
sb.append(chars, i, w);
System.out.println(i + ". " + sb.toString());
} else {
sb.append(chars, i, chars.length - i);
System.out.println(i + ". " + sb.toString());
}
fingerprints.add(hash(sb));
}
}
private static String hash(StringBuffer sb) {
// Implement hash (MD5)
return sb.toString();
}
}
结果:
0. Once upon a tim
15. e there lived i
30. n a certain vil
45. lage a little c
60. ountry girl, th
75. e prettiest cre
90. ature who was e
105. ver seen. Her m
120. other was exces
135. sively fond of
150. her; and her gr
165. andmother doted
180. on her still m
195. ore. This good
210. woman had a lit
225. tle red riding
240. hood made for h
255. er. It suited t
270. he girl so extr
285. emely well that
300. everybody call
315. ed her Little R
330. ed Riding Hood.