我刚刚阅读了论文Distributed Representations of Sentences and Documents。在情感分析实验部分,它说,“在学习了训练句子及其子短语的向量表示后,我们将它们输入逻辑回归以学习电影评分的预测因子。” 所以它使用逻辑回归算法作为分类器来确定标签是什么。
然后我转到 dl4j,我阅读了示例“ParagraphVectorsClassifierExample”,代码显示如下:
void makeParagraphVectors() throws Exception {
ClassPathResource resource = new ClassPathResource("paravec/labeled");
// build a iterator for our dataset
iterator = new FileLabelAwareIterator.Builder()
.addSourceFolder(resource.getFile())
.build();
tokenizerFactory = new DefaultTokenizerFactory();
tokenizerFactory.setTokenPreProcessor(new CommonPreprocessor());
// ParagraphVectors training configuration
paragraphVectors = new ParagraphVectors.Builder()
.learningRate(0.025)
.minLearningRate(0.001)
.batchSize(1000)
.epochs(20)
.iterate(iterator)
.trainWordVectors(true)
.tokenizerFactory(tokenizerFactory)
.build();
// Start model training
paragraphVectors.fit();
}
void checkUnlabeledData() throws IOException {
/*
At this point we assume that we have model built and we can check
which categories our unlabeled document falls into.
So we'll start loading our unlabeled documents and checking them
*/
ClassPathResource unClassifiedResource = new ClassPathResource("paravec/unlabeled");
FileLabelAwareIterator unClassifiedIterator = new FileLabelAwareIterator.Builder()
.addSourceFolder(unClassifiedResource.getFile())
.build();
/*
Now we'll iterate over unlabeled data, and check which label it could be assigned to
Please note: for many domains it's normal to have 1 document fall into few labels at once,
with different "weight" for each.
*/
MeansBuilder meansBuilder = new MeansBuilder(
(InMemoryLookupTable<VocabWord>)paragraphVectors.getLookupTable(),
tokenizerFactory);
LabelSeeker seeker = new LabelSeeker(iterator.getLabelsSource().getLabels(),
(InMemoryLookupTable<VocabWord>) paragraphVectors.getLookupTable());
while (unClassifiedIterator.hasNextDocument()) {
LabelledDocument document = unClassifiedIterator.nextDocument();
INDArray documentAsCentroid = meansBuilder.documentAsVector(document);
List<Pair<String, Double>> scores = seeker.getScores(documentAsCentroid);
/*
please note, document.getLabel() is used just to show which document we're looking at now,
as a substitute for printing out the whole document name.
So, labels on these two documents are used like titles,
just to visualize our classification done properly
*/
log.info("Document '" + document.getLabels() + "' falls into the following categories: ");
for (Pair<String, Double> score: scores) {
log.info(" " + score.getFirst() + ": " + score.getSecond());
}
}
}
它演示了 doc2vec 如何将任意文档与标签相关联,但它隐藏了幕后的实现。我的问题是:逻辑回归也这样做吗?如果不是,那是什么?我怎样才能通过逻辑回归来做到这一点?