我正在尝试在 java 中实现感知器算法,只是一层类型,而不是完全神经网络类型。这是我正在尝试解决的分类问题。
我需要做的是为政治、科学、体育和无神论四个类别之一的每个文档创建一个词袋特征向量。这是数据。
我正在努力实现这一点(直接引用这个问题的第一个答案):
例子:
Document 1 = ["I", "am", "awesome"]
Document 2 = ["I", "am", "great", "great"]
字典是:
["I", "am", "awesome", "great"]
所以作为向量的文档看起来像:
Document 1 = [1, 1, 1, 0]
Document 2 = [1, 1, 0, 2]
有了它,你可以做各种花哨的数学运算并将其输入你的感知器。
我已经能够生成全局字典,现在我需要为每个文档制作一个,但我怎样才能让它们保持一致呢?文件夹结构非常简单,即 `/politics/' 里面有很多文章,对于每一篇我都需要针对全局字典创建一个特征向量。我认为我正在使用的迭代器让我感到困惑。
这是主要课程:
public class BagOfWords
{
static Set<String> global_dict = new HashSet<String>();
static boolean global_dict_complete = false;
static String path = "/home/Workbench/SUTD/ISTD_50.570/assignments/data/train";
public static void main(String[] args) throws IOException
{
//each of the diferent categories
String[] categories = { "/atheism", "/politics", "/science", "/sports"};
//cycle through all categories once to populate the global dict
for(int cycle = 0; cycle <= 3; cycle++)
{
String general_data_partition = path + categories[cycle];
File file = new File( general_data_partition );
Iterateur.iterateDirectory(file, global_dict, global_dict_complete);
}
//after the global dict has been filled up
//cycle through again to populate a set of
//words for each document, compare it to the
//global dict.
for(int cycle = 0; cycle <= 3; cycle++)
{
if(cycle == 3)
global_dict_complete = true;
String general_data_partition = path + categories[cycle];
File file = new File( general_data_partition );
Iterateur.iterateDirectory(file, global_dict, global_dict_complete);
}
//print the data struc
//for (String s : global_dict)
//System.out.println( s );
}
}
这会遍历数据结构:
public class Iterateur
{
static void iterateDirectory(File file,
Set<String> global_dict,
boolean global_dict_complete) throws IOException
{
for (File f : file.listFiles())
{
if (f.isDirectory())
{
iterateDirectory(file, global_dict, global_dict_complete);
}
else
{
String line;
BufferedReader br = new BufferedReader(new FileReader( f ));
while ((line = br.readLine()) != null)
{
if (global_dict_complete == false)
{
Dictionary.populate_dict(file, f, line, br, global_dict);
}
else
{
FeatureVecteur.generateFeatureVecteur(file, f, line, br, global_dict);
}
}
}
}
}
}
这填满了全局字典:
public class Dictionary
{
public static void populate_dict(File file,
File f,
String line,
BufferedReader br,
Set<String> global_dict) throws IOException
{
while ((line = br.readLine()) != null)
{
String[] words = line.split(" ");//those are your words
String word;
for (int i = 0; i < words.length; i++)
{
word = words[i];
if (!global_dict.contains(word))
{
global_dict.add(word);
}
}
}
}
}
这是填写文档特定字典的初步尝试:
public class FeatureVecteur
{
public static void generateFeatureVecteur(File file,
File f,
String line,
BufferedReader br,
Set<String> global_dict) throws IOException
{
Set<String> file_dict = new HashSet<String>();
while ((line = br.readLine()) != null)
{
String[] words = line.split(" ");//those are your words
String word;
for (int i = 0; i < words.length; i++)
{
word = words[i];
if (!file_dict.contains(word))
{
file_dict.add(word);
}
}
}
}
}