我正在考虑编写一个 Accumulo 迭代器来返回一个表百分位数的随机样本。
我将不胜感激任何建议。
纳克斯,
克里斯
稍微扩展 Ben Tse 的答案以允许可变数量的选择:
import java.util.Random;
import org.apache.accumulo.core.data.Key;
import org.apache.accumulo.core.data.Value;
import org.apache.accumulo.core.iterators.Filter;
public class RandomAcceptFilter extends Filter {
private Random rand = new Random();
private double percentToAllow;
public static final String RATIO = "ratio";
public static final String DEFAULT = "0.05";
@Override
public void init(SortedKeyValueIterator<Key, Value> source, Map<String, String> options, IteratorEnvironment env) throws IOException {
super.init(source, options, env);
String option = options.containsKey(RATIO) ? options.get(RATIO) : DEFAULT;
this.percentToAllow = Double.parseDouble(option);
}
@Override
public boolean accept(Key k, Value v) {
return rand.nextDouble() < this.percentToAllow;
}
}
然后,当您从代码中调用迭代器时,您会这样做
IteratorSetting itr = new IteratorSetting(15, "myIterator", RandomAcceptFilter.class);
itr.addOption(RandomAcceptFilter.RATIO, "0.20");
myScanner.addScanIterator(itr);
显然,您需要添加边界检查等,但您明白了。
您可以扩展 org.apache.accumulo.core.iterators.Filter 并随机接受 x% 的条目。以下迭代器将随机返回 5% 的条目。
import java.util.Random;
import org.apache.accumulo.core.data.Key;
import org.apache.accumulo.core.data.Value;
import org.apache.accumulo.core.iterators.Filter;
public class RandomAcceptFilter extends Filter {
private Random rand = new Random();
@Override
public boolean accept(Key k, Value v) {
return rand.nextDouble() < .05;
}
}