4

我想根据其内容将某些数据分类为不同的类。我使用朴素贝叶斯分类器完成了它,我得到了一个输出作为它所属的最佳类别。但是现在我想将训练集中的新闻以外的新闻分类到“其他”类中。我不能手动将训练数据以外的每个/每个数据添加到某个类中,因为它有大量其他类别。那么有什么方法可以对其他数据进行分类吗?

private static File TRAINING_DIR = new File("4news-train");
private static File TESTING_DIR = new File("4news-test");
private static String[] CATEGORIES = { "c1", "c2", "c3", "others" };

private static int NGRAM_SIZE = 6;

public static void main(String[] args) throws ClassNotFoundException, IOException {
    DynamicLMClassifier<NGramProcessLM> classifier = DynamicLMClassifier.createNGramProcess(CATEGORIES, NGRAM_SIZE);
    for (int i = 0; i < CATEGORIES.length; ++i) {
        File classDir = new File(TRAINING_DIR, CATEGORIES[i]);
        if (!classDir.isDirectory()) {
            String msg = "Could not find training directory=" + classDir + "\nTraining directory not found";
            System.out.println(msg); // in case exception gets lost in shell
            throw new IllegalArgumentException(msg);
        }

        String[] trainingFiles = classDir.list();
        for (int j = 0; j < trainingFiles.length; ++j) {
            File file = new File(classDir, trainingFiles[j]);
            String text = Files.readFromFile(file, "ISO-8859-1");
            System.out.println("Training on " + CATEGORIES[i] + "/" + trainingFiles[j]);
            Classification classification = new Classification(CATEGORIES[i]);
            Classified<CharSequence> classified = new Classified<CharSequence>(text, classification);
            classifier.handle(classified);
        }
    }
}
4

2 回答 2

1

朴素贝叶斯在每个分类中为您提供“信心”,因为它计算

P(y|x) ~ P(y)P(x|y)

P(x)由它标准化的概率是x成为 class 的一部分y。你可以简单地切断这个值并说,

cl(x) = "other" iff max_{over y}(P(y|x)) < T

例如,哪里T可以是训练集的最小置信度

T = min_{over x and y in Training set}( P(y|x) )
于 2014-02-18T10:01:45.230 回答
0

只需序列化对象...这意味着将中间对象写入文件,这将是您的模型...

然后进行测试,您只需将数据传递到模型中,无需每次都进行训练......这对您来说会更容易

于 2014-09-19T08:18:54.137 回答