我用 Mallet 训练了一个主题模型,我想将其序列化以备后用。我在两个测试文档上运行,然后反序列化并在同一个文档上运行加载的模型,结果完全不同。
我保存/加载文档(附加代码)的方式有什么问题吗?
谢谢!
List<Pipe> pipeList = initPipeList();
// Begin by importing documents from text to feature sequences
InstanceList instances = new InstanceList(new SerialPipes(pipeList));
for (String document : documents) {
Instance inst = new Instance(document, "","","");
instances.addThruPipe(inst);
}
ParallelTopicModel model = new ParallelTopicModel(numTopics, alpha_t * numTopics, beta_w);
model.addInstances(instances);
model.setNumThreads(numThreads);
model.setNumIterations(numIterations);
model.estimate();
printProbabilities(model, "doc 1"); // I replaced the contents of the docs due to copywrite issues
printProbabilities(model, "doc 2");
model.write(new File("model.bin"));
model = ParallelTopicModel.read("model.bin");
printProbabilities(model, "doc 1");
printProbabilities(model, "doc 2");
的定义printProbabilities()
:
public void printProbabilities(ParallelTopicModel model, String doc) {
List<Pipe> pipeList = initPipeList();
InstanceList instances = new InstanceList(new SerialPipes(pipeList));
instances.addThruPipe(new Instance(doc, "", "", ""));
double[] probabilities = model.getInferencer().getSampledDistribution(instances.get(0), 10, 1, 5);
for (int i = 0; i < probabilities.length; i++) {
double probability = probabilities[i];
if (probability > 0.01) {
System.out.println("Topic " + i + ", probability: " + probability);
}
}
}