java - 加上对整数对象的操作，从目录中读取多个文件以在 Java 中创建词袋

Question

词袋与文档术语矩阵是一样的吗？

我有一个包含许多文件的训练数据集。我想将它们全部读入数据结构（哈希图？），为特定类别的文档（科学、宗教、体育或性）创建一个词袋模型，为感知器实现做准备。

现在我有最简单的简单 Java I/o 构造，即

    String text; 
    BufferedReader br = new BufferedReader(new FileReader("file"));

    while ((text = br.readLine()) != null) 
    {
        //read in multiple files
        //generate a hash map with each unique word
        //as a key and the frequency with which that
        //word appears as the value
    }

所以我想要做的是从一个目录中的多个文件中读取输入并将所有数据保存到一个底层结构中，如何做到这一点？我应该把它写到某个地方的文件中吗？

根据我对词袋的理解，我认为哈希图，正如我在上面代码的注释中所描述的那样会起作用。那正确吗？我怎么能实现这样的事情来同步读取来自多个文件的输入。我应该如何存储它，以便以后可以将其合并到我的感知器算法中？

我已经看到这样做了：

  String names = new String[]{"a.txt", "b.txt", "c.txt"};
  StringBuffer strContent = new StringBuffer("");

  for (String name : names) {
      File file = new File(name); 
      int ch;
      FileInputStream stream = null;  
      try {
          stream = new FileInputStream(file);   
          while( (ch = stream.read()) != -1) {
          strContent.append((char) ch); 
          }
      } finally {
          stream.close();  
      } 
   }

但这是一个蹩脚的解决方案，因为您需要提前指定所有文件，我认为应该更加动态。如果可能的话。

score 1 · Accepted Answer

你可以试试下面的程序，它是动态的，你只需要提供你的目录路径。

public class BagOfWords {

ConcurrentHashMap<String, Set<String>> map = new ConcurrentHashMap<String, Set<String>>();

public static void main(String[] args) throws IOException {
    File file = new File("F:/Downloads/Build/");
    new BagOfWords().iterateDirectory(file);
}

private void iterateDirectory(File file) throws IOException {
    for (File f : file.listFiles()) {
        if (f.isDirectory()) {
            iterateDirectory(file);
        } else {
            // Read File
            // Split and put it in a set
            // add to map
        }
    }
}

}

score 0 · Accepted Answer

我认为这非常接近，但存在某种差异int以及integer如何调和？

ConcurrentHashMap> map = new ConcurrentHashMap>();

        public static void main(String[] args) throws IOException 
        {
            String path = "path";
            File file = new File( path );
            new BagOfWords().iterateDirectory(file);
        }    

        private void iterateDirectory(File file) throws IOException 
        {
            for (File f : file.listFiles()) 
            {
                if (f.isDirectory()) 
                {
                    iterateDirectory(file);
                } 
                else 
                {

                    String line; 
                    BufferedReader br = new BufferedReader(new FileReader("file"));

                    while ((line = br.readLine()) != null) 
                    {

                        String[] words = line.split(" ");//those are your words

                        // Read File
                        // Split and put it in a set
                        // add to map
                        String word;

                        for (int i = 0; i < words.length; i++) 
                        {
                            word = words[i];
                            if (!map.containsKey(word))
                            {
                                map.put(word, 0);
                            }
                            map.put(word, map.get(word) + 1);
                        }

                    }

                }
            }
        }

java - 加上对整数对象的操作，从目录中读取多个文件以在 Java 中创建词袋

2 回答 2

Related

Reference