我正在尝试计算文本文件中单词的频率。但我必须使用不同的方法。例如,如果文件包含 BRAIN-ISCHEMIA 和 ISCHEMIA-BRAIN,我需要计算 BRAIN-ISCHEMIA 两次(并留下 ISCHEMIA-BRAIN),反之亦然。这是我的一段代码-
// Mapping of String->Integer (word -> frequency)
HashMap<String, Integer> frequencyMap = new HashMap<String, Integer>();
// Iterate through each line of the file
String[] temp;
String currentLine;
String currentLine2;
while ((currentLine = in.readLine()) != null) {
// Remove this line if you want words to be case sensitive
currentLine = currentLine.toLowerCase();
temp=currentLine.split("-");
currentLine2=temp[1]+"-"+temp[0];
// Iterate through each word of the current line
// Delimit words based on whitespace, punctuation, and quotes
StringTokenizer parser = new StringTokenizer(currentLine);
while (parser.hasMoreTokens()) {
String currentWord = parser.nextToken();
Integer frequency = frequencyMap.get(currentWord);
// Add the word if it doesn't already exist, otherwise increment the
// frequency counter.
if (frequency == null) {
frequency = 0;
}
frequencyMap.put(currentWord, frequency + 1);
}
StringTokenizer parser2 = new StringTokenizer(currentLine2);
while (parser2.hasMoreTokens()) {
String currentWord2 = parser2.nextToken();
Integer frequency = frequencyMap.get(currentWord2);
// Add the word if it doesn't already exist, otherwise increment the
// frequency counter.
if (frequency == null) {
frequency = 0;
}
frequencyMap.put(currentWord2, frequency + 1);
}
}
// Display our nice little Map
System.out.println(frequencyMap);
但是对于以下文件-
缺血-谷氨酸 缺血-脑 谷氨酸-脑 脑-耐受 脑-耐受 耐受-脑 谷氨酸-缺血 缺血-谷氨酸
我得到以下输出-
{谷氨酸-脑=1, 缺血-谷氨酸=3, 缺血-脑=1, 谷氨酸-缺血=3, 脑-耐受=3, 脑-缺血=1, 耐受-脑=3, 脑-谷氨酸=1}
我认为问题出在第二个。任何关于这个问题的观点都将受到高度赞赏。