java - 提高基于斯坦福标记程序的性能

Question

我刚刚实现了一个在 Java 中使用斯坦福 POS 标记器的程序。

我使用了一个几 KB 大小的输入文件，由几百个单词组成。我什至将堆大小设置为 600 MB。

但它仍然很慢，有时会耗尽堆内存。如何提高其执行速度和内存性能？我希望能够使用几 MB 作为输入。

  public static void postag(String args) throws ClassNotFoundException

  {

     try

     {

     File filein=new File("c://input.txt");

     String content = FileUtils.readFileToString(filein);

     MaxentTagger tagger = new MaxentTagger("postagging/wsj-0-18-bidirectional-distsim.tagger");

     String tagged = tagger.tagString(content);

        try 
        {
            File file = new File("c://output.txt");
            if (!file.exists()) 
            {
                file.createNewFile();
            } 

            FileWriter fw = new FileWriter(file.getAbsoluteFile());
            BufferedWriter bw = new BufferedWriter(fw);
            bw.write("\n"+tagged);
            bw.close();

            }
              catch (IOException e) 
              {
                    e.printStackTrace();
               }

     } catch (IOException e1)
     {
         e1.printStackTrace();
     }

 }

score 8 · Accepted Answer

主要的第一条建议是使用wsj-0-18-left3words-distsim.tagger（或者可能更好，english-left3words-distsim.tagger最近的版本，用于一般文本），而不是wsj-0-18-bidirectional-distsim.tagger. 虽然双向标注器的标注性能稍好一些，但它的速度大约慢了 6 倍，并且使用了大约两倍的内存。图 FWIW：在 2012 款 MacBook Pro 上，当给出足够的文本来“预热”时，left3words标注器将以每秒约 35000 个字的速度标注文本。

关于内存使用的另一条建议是，如果您有大量文本，请确保将其以tagString()合理大小的块传递，而不是全部作为一个巨大的字符串，因为整个字符串将立即被标记，添加到内存要求。

java - 提高基于斯坦福标记程序的性能

1 回答 1

Related

Reference