nlp - 限制斯坦福 NER 中的迭代次数

Question

我正在自定义数据集上训练斯坦福 NER CRF 模型，但用于训练模型的迭代次数现在已经达到 333 次迭代——即，这个训练过程已经持续了几个小时。以下是终端中打印的消息 -

Iter 335 evals 400 <D> [M 1.000E0] 2.880E3 38054.87s |5.680E1| {6.652E-6} 4.488E-4 - 
Iter 336 evals 401 <D> [M 1.000E0] 2.880E3 38153.66s |1.243E2| {1.456E-5} 4.415E-4 -
 -

下面给出了正在使用的属性文件 - 有什么方法可以将迭代次数限制为 20 次。

location of the training file
trainFile = TRAIN5000.tsv
#location where you would like to save (serialize to) your
#classifier; adding .gz at the end automatically gzips the file,
#making it faster and smaller
serializeTo = ner-model_TRAIN5000.ser.gz

#structure of your training file; this tells the classifier
#that the word is in column 0 and the correct answer is in
#column 1
map = word=0,answer=1

#these are the features we'd like to train with
#some are discussed below, the rest can be
#understood by looking at NERFeatureFactory
useClassFeature=true
useWord=true
useNGrams=true
#no ngrams will be included that do not contain either the
#beginning or end of the word
noMidNGrams=true
useDisjunctive=true
maxNGramLeng=6
usePrev=true
useNext=true
useSequences=true
usePrevSequences=true
maxLeft=1
#the next 4 deal with word shape features
useTypeSeqs=true
useTypeSeqs2=true
useTypeySequences=true
wordShape=chris2useLC
saveFeatureIndexToDisk = true
printFeatures=true
flag useObservedSequencesOnly=true
featureDiffThresh=0.05

score 1 · Accepted Answer

我尝试通过在https://nlp.stanford.edu/software/crf-faq.htmlStanford CoreNLP CRF classifier中描述的带有 IOB 标记的标记化文本来训练生物医学 (BioNER) 模型。

我的语料库——来自下载的资源——非常大（约 150 万行；6 个特征：基因；...）。由于培训似乎无限期地进行，我绘制了值的比率以了解进度：

阅读 Java 源代码，我发现默认TOL（tolerance; 用于决定何时终止训练会话）值为 1E-6 (0.000001)，在.../CoreNLP/src/edu/stanford/nlp/optimization/QNMinimizer.java.

看着那个情节，我最初的训练课程永远不会完成。[该图还表明，设置较大的TOL值，例如tolerance=0.05，将触发训练的提前终止，因为该TOL值是由训练课程开始附近发生的“噪音”触发的。我通过我的文件tolerance=0.05中的条目确认了这一点；.prop但是，等的TOL值是“好的”。]0.010.005

maxIterations=20如@StanfordNLPHelp（在此线程中的其他位置）所述，将“ ”添加到属性文件似乎被忽略，除非我还在我的属性文件中添加并更改了tolerance=值；bioner.prop例如

tolerance=0.005
maxIterations=20    ## optional

在这种情况下，分类器会快速训练模型 ( bioner.ser.gz)。[当我将这maxIterations一行添加到我的.prop文件中时，没有添加该tolerance行，模型就像以前一样“永远”运行。]

.prop可以在此处找到可以包含在文件中的参数列表：

https://nlp.stanford.edu/nlp/javadoc/javanlp-3.5.0/edu/stanford/nlp/ie/NERFeatureFactory.html

score 1 · Accepted Answer

1

简短回答：使用tolerance（默认为 1e-4）。还有一个参数maxIterations被忽略。

于 2018-09-24T13:57:08.580 回答

score 0 · Accepted Answer

maxQNItr=21在你的道具文件中使用。它将运行多达 20 次迭代。从大卫的回答中得到帮助

score -1 · Accepted Answer

-1

添加maxIterations=20到属性文件。

于 2017-04-09T07:20:53.070 回答

nlp - 限制斯坦福 NER 中的迭代次数

4 回答 4

Related

Reference