java - 使用 Java 中的 Mallet 在 LDA 中折叠（估计新文档的主题）

Question

我正在通过 Java 使用 Mallet，但我不知道如何根据我训练过的现有主题模型评估新文档。

我生成模型的初始代码与Mallett Developers Guide for Topic Modeling中的代码非常相似，之后我只是将模型保存为 Java 对象。在稍后的过程中，我从文件中重新加载该 Java 对象，通过添加新实例.addInstances()，然后希望仅根据原始训练集中找到的主题评估这些新实例。

这个 stats.SE 线程提供了一些高级建议，但我看不到如何将它们用于 Mallet 框架。

非常感谢任何帮助。

score 5 · Accepted Answer

推理实际上也列在问题中提供的示例链接中（最后几行）。

对于任何对保存/加载训练模型然后使用它来推断新文档的模型分布的整个代码感兴趣的人 - 这里有一些片段：

完成后model.estimate()，您就拥有了经过实际训练的模型，因此您可以使用标准 Java 对其进行序列化ObjectOutputStream（因为ParallelTopicModelimplements Serializable）：

try {
    FileOutputStream outFile = new FileOutputStream("model.ser");
    ObjectOutputStream oos = new ObjectOutputStream(outFile);
    oos.writeObject(model);
    oos.close();
} catch (FileNotFoundException ex) {
    // handle this error
} catch (IOException ex) {
    // handle this error
}

但是请注意，当您推断时，您还需要Instance通过相同的管道传递新句子（as）以便对其进行预处理（tokenzie 等）因此，您还需要保存管道列表（因为我们正在使用SerialPipe什么时候可以创建一个实例然后序列化它）：

// initialize the pipelist (using in model training)
SerialPipes pipes = new SerialPipes(pipeList);

try {
    FileOutputStream outFile = new FileOutputStream("pipes.ser");
    ObjectOutputStream oos = new ObjectOutputStream(outFile);
    oos.writeObject(pipes);
    oos.close();
} catch (FileNotFoundException ex) {
    // handle error
} catch (IOException ex) {
    // handle error
}

为了加载模型/管道并将它们用于推理，我们需要反序列化：

private static void InferByModel(String sentence) {
    // define model and pipeline
    ParallelTopicModel model = null;
    SerialPipes pipes = null;

    // load the model
    try {
        FileInputStream outFile = new FileInputStream("model.ser");
        ObjectInputStream oos = new ObjectInputStream(outFile);
        model = (ParallelTopicModel) oos.readObject();
    } catch (IOException ex) {
        System.out.println("Could not read model from file: " + ex);
    } catch (ClassNotFoundException ex) {
        System.out.println("Could not load the model: " + ex);
    }

    // load the pipeline
    try {
        FileInputStream outFile = new FileInputStream("pipes.ser");
        ObjectInputStream oos = new ObjectInputStream(outFile);
        pipes = (SerialPipes) oos.readObject();
    } catch (IOException ex) {
        System.out.println("Could not read pipes from file: " + ex);
    } catch (ClassNotFoundException ex) {
        System.out.println("Could not load the pipes: " + ex);
    }

    // if both are properly loaded
    if (model != null && pipes != null){

        // Create a new instance named "test instance" with empty target 
        // and source fields note we are using the pipes list here
        InstanceList testing = new InstanceList(pipes);   
        testing.addThruPipe(
            new Instance(sentence, null, "test instance", null));

        // here we get an inferencer from our loaded model and use it
        TopicInferencer inferencer = model.getInferencer();
        double[] testProbabilities = inferencer
                   .getSampledDistribution(testing.get(0), 10, 1, 5);
        System.out.println("0\t" + testProbabilities[0]);
    }
}

出于某种原因，我没有得到与原始模型完全相同的推断 - 但这是另一个问题的问题（如果有人知道，我很乐意听到）

score 3 · Accepted Answer

我在Mallet 的首席开发人员的幻灯片中找到了答案：

TopicInferencer inferencer = model.getInferencer();
double[] topicProbs = inferencer.getSampledDistribution(newInstance, 100, 10, 10);

java - 使用 Java 中的 Mallet 在 LDA 中折叠（估计新文档的主题）

2 回答 2

Related

Reference