我建议您逐行阅读文件,然后split
在单词边界上调用以获取单词数。
public static void main(String[] args) throws IOException {
final File file = new File("myFile");
try (final BufferedReader bufferedReader =
new BufferedReader(new InputStreamReader(new FileInputStream(file), "UTF-8"))) {
String line;
while ((line = bufferedReader.readLine()) != null) {
final String[] words = line.split("\\b");
System.out.println(words.length + " words in line \"" + line + "\".");
}
}
}
这样可以避免从您的程序中调用 grep。
您得到的奇怪字符很可能与使用错误的编码有关。你确定你的文件是“UTF-8”吗?
编辑
OP 希望逐行读取一个文件,然后在另一个文件中搜索读取行的出现。
这仍然可以使用 java 更轻松地完成。根据您的其他文件有多大,您可以先将其读入内存并进行搜索,也可以逐行搜索
将文件读入内存的简单示例:
public static void main(String[] args) throws UnsupportedEncodingException, IOException {
final File corpusFile = new File("corpus");
final String corpusFileContent = readFileToString(corpusFile);
final File file = new File("myEngramFile");
try (final BufferedReader bufferedReader =
new BufferedReader(new InputStreamReader(new FileInputStream(file), "UTF-8"))) {
String line;
while ((line = bufferedReader.readLine()) != null) {
final int matches = countOccurencesOf(line, corpusFileContent);
};
}
}
private static String readFileToString(final File file) throws IOException {
final StringBuilder stringBuilder = new StringBuilder();
try (final FileChannel fc = new RandomAccessFile(file, "r").getChannel()) {
final ByteBuffer byteBuffer = ByteBuffer.allocate(4096);
final CharsetDecoder charsetDecoder = Charset.forName("UTF-8").newDecoder();
while (fc.read(byteBuffer) > 0) {
byteBuffer.flip();
stringBuilder.append(charsetDecoder.decode(byteBuffer));
byteBuffer.reset();
}
}
return stringBuilder.toString();
}
private static int countOccurencesOf(final String countMatchesOf, final String inString) {
final Matcher matcher = Pattern.compile("\\b" + countMatchesOf + "\\b").matcher(inString);
int count = 0;
while (matcher.find()) {
++count;
}
return count;
}
如果您的“语料库”文件小于一百兆字节左右,这应该可以正常工作。任何更大的,你会想要改变“countOccurencesOf”方法像这样
private static int countOccurencesOf(final String countMatchesOf, final File inFile) throws IOException {
final Pattern pattern = Pattern.compile("\\b" + countMatchesOf + "\\b");
int count = 0;
try (final BufferedReader bufferedReader =
new BufferedReader(new InputStreamReader(new FileInputStream(inFile), "UTF-8"))) {
String line;
while ((line = bufferedReader.readLine()) != null) {
final Matcher matcher = pattern.matcher(line);
while (matcher.find()) {
++count;
}
};
}
return count;
}
现在您只需将“文件”对象传递给方法而不是字符串化文件。
请注意,流式传输方法逐行读取文件并因此删除换行符,String
如果您Pattern
依赖它们存在,则需要在解析之前将它们添加回来。