我使用巨大的数据文件,有时我只需要知道这些文件的行数,通常我打开它们并逐行读取它们,直到到达文件末尾
我想知道是否有更聪明的方法来做到这一点
我使用巨大的数据文件,有时我只需要知道这些文件的行数,通常我打开它们并逐行读取它们,直到到达文件末尾
我想知道是否有更聪明的方法来做到这一点
这是迄今为止我发现的最快的版本,比 readLines 快大约 6 倍。在 150MB 的日志文件上,这需要 0.35 秒,而使用 readLines() 时需要 2.40 秒。只是为了好玩,linux 的 wc -l 命令需要 0.15 秒。
public static int countLinesOld(String filename) throws IOException {
InputStream is = new BufferedInputStream(new FileInputStream(filename));
try {
byte[] c = new byte[1024];
int count = 0;
int readChars = 0;
boolean empty = true;
while ((readChars = is.read(c)) != -1) {
empty = false;
for (int i = 0; i < readChars; ++i) {
if (c[i] == '\n') {
++count;
}
}
}
return (count == 0 && !empty) ? 1 : count;
} finally {
is.close();
}
}
编辑,9 1/2 年后:我几乎没有 Java 经验,但无论如何我都尝试将此代码与LineNumberReader
下面的解决方案进行基准测试,因为没有人这样做让我感到困扰。似乎特别是对于大文件,我的解决方案更快。尽管在优化器完成体面的工作之前似乎需要运行几次。我对代码进行了一些尝试,并制作了一个始终最快的新版本:
public static int countLinesNew(String filename) throws IOException {
InputStream is = new BufferedInputStream(new FileInputStream(filename));
try {
byte[] c = new byte[1024];
int readChars = is.read(c);
if (readChars == -1) {
// bail out if nothing to read
return 0;
}
// make it easy for the optimizer to tune this loop
int count = 0;
while (readChars == 1024) {
for (int i=0; i<1024;) {
if (c[i++] == '\n') {
++count;
}
}
readChars = is.read(c);
}
// count remaining characters
while (readChars != -1) {
System.out.println(readChars);
for (int i=0; i<readChars; ++i) {
if (c[i] == '\n') {
++count;
}
}
readChars = is.read(c);
}
return count == 0 ? 1 : count;
} finally {
is.close();
}
}
1.3GB 文本文件的基准测试结果,y 轴以秒为单位。我已经使用同一个文件执行了 100 次运行,并使用System.nanoTime()
. 您可以看到它countLinesOld
有一些异常值,并且countLinesNew
没有,虽然它只是快一点,但差异在统计上是显着的。LineNumberReader
显然更慢。
我已经实现了该问题的另一种解决方案,我发现它在计算行数方面更有效:
try
(
FileReader input = new FileReader("input.txt");
LineNumberReader count = new LineNumberReader(input);
)
{
while (count.skip(Long.MAX_VALUE) > 0)
{
// Loop just in case the file is > Long.MAX_VALUE or skip() decides to not read the entire file
}
result = count.getLineNumber() + 1; // +1 because line index starts at 0
}
对于不以换行符结尾的多行文件,接受的答案有一个错误。没有换行符的单行文件将返回 1,但没有换行符的两行文件也将返回 1。这是解决此问题的公认解决方案的实现。endsWithoutNewLine 检查除了最终读取之外的所有内容都是浪费的,但与整体功能相比应该是微不足道的时间明智。
public int count(String filename) throws IOException {
InputStream is = new BufferedInputStream(new FileInputStream(filename));
try {
byte[] c = new byte[1024];
int count = 0;
int readChars = 0;
boolean endsWithoutNewLine = false;
while ((readChars = is.read(c)) != -1) {
for (int i = 0; i < readChars; ++i) {
if (c[i] == '\n')
++count;
}
endsWithoutNewLine = (c[readChars - 1] != '\n');
}
if(endsWithoutNewLine) {
++count;
}
return count;
} finally {
is.close();
}
}
使用java-8,您可以使用流:
try (Stream<String> lines = Files.lines(path, Charset.defaultCharset())) {
long numOfLines = lines.count();
...
}
如果文件在文件末尾没有换行符,上面的 count() 方法的答案会给我行错误计数 - 它无法计算文件中的最后一行。
这种方法对我来说效果更好:
public int countLines(String filename) throws IOException {
LineNumberReader reader = new LineNumberReader(new FileReader(filename));
int cnt = 0;
String lineRead = "";
while ((lineRead = reader.readLine()) != null) {}
cnt = reader.getLineNumber();
reader.close();
return cnt;
}
我测试了上述计算行数的方法,这是我对在我的系统上测试的不同方法的观察
文件大小:1.6 Gb 方法:
此外,Java8方法似乎很方便:
Files.lines(Paths.get(filePath), Charset.defaultCharset()).count()
[Return type : long]
我知道这是一个老问题,但接受的解决方案与我需要它做的并不完全匹配。因此,我对其进行了改进以接受各种换行符(而不仅仅是换行符)并使用指定的字符编码(而不是 ISO-8859- n)。多合一方法(酌情重构):
public static long getLinesCount(String fileName, String encodingName) throws IOException {
long linesCount = 0;
File file = new File(fileName);
FileInputStream fileIn = new FileInputStream(file);
try {
Charset encoding = Charset.forName(encodingName);
Reader fileReader = new InputStreamReader(fileIn, encoding);
int bufferSize = 4096;
Reader reader = new BufferedReader(fileReader, bufferSize);
char[] buffer = new char[bufferSize];
int prevChar = -1;
int readCount = reader.read(buffer);
while (readCount != -1) {
for (int i = 0; i < readCount; i++) {
int nextChar = buffer[i];
switch (nextChar) {
case '\r': {
// The current line is terminated by a carriage return or by a carriage return immediately followed by a line feed.
linesCount++;
break;
}
case '\n': {
if (prevChar == '\r') {
// The current line is terminated by a carriage return immediately followed by a line feed.
// The line has already been counted.
} else {
// The current line is terminated by a line feed.
linesCount++;
}
break;
}
}
prevChar = nextChar;
}
readCount = reader.read(buffer);
}
if (prevCh != -1) {
switch (prevCh) {
case '\r':
case '\n': {
// The last line is terminated by a line terminator.
// The last line has already been counted.
break;
}
default: {
// The last line is terminated by end-of-file.
linesCount++;
}
}
}
} finally {
fileIn.close();
}
return linesCount;
}
这个解决方案的速度与公认的解决方案相当,在我的测试中慢了大约 4%(尽管 Java 中的计时测试是出了名的不可靠)。
/**
* Count file rows.
*
* @param file file
* @return file row count
* @throws IOException
*/
public static long getLineCount(File file) throws IOException {
try (Stream<String> lines = Files.lines(file.toPath())) {
return lines.count();
}
}
在 JDK8_u31 上测试。但与这种方法相比,性能确实很慢:
/**
* Count file rows.
*
* @param file file
* @return file row count
* @throws IOException
*/
public static long getLineCount(File file) throws IOException {
try (BufferedInputStream is = new BufferedInputStream(new FileInputStream(file), 1024)) {
byte[] c = new byte[1024];
boolean empty = true,
lastEmpty = false;
long count = 0;
int read;
while ((read = is.read(c)) != -1) {
for (int i = 0; i < read; i++) {
if (c[i] == '\n') {
count++;
lastEmpty = true;
} else if (lastEmpty) {
lastEmpty = false;
}
}
empty = false;
}
if (!empty) {
if (count == 0) {
count = 1;
} else if (!lastEmpty) {
count++;
}
}
return count;
}
}
经过测试,速度非常快。
使用扫描仪的直接方法
static void lineCounter (String path) throws IOException {
int lineCount = 0, commentsCount = 0;
Scanner input = new Scanner(new File(path));
while (input.hasNextLine()) {
String data = input.nextLine();
if (data.startsWith("//")) commentsCount++;
lineCount++;
}
System.out.println("Line Count: " + lineCount + "\t Comments Count: " + commentsCount);
}
我得出的结论是wc -l
:s 计算换行符的方法很好,但在最后一行不以换行符结尾的文件上返回非直观的结果。
@er.vikas 解决方案基于 LineNumberReader 但在行数上加一会在最后一行以换行符结尾的文件上返回非直观的结果。
因此,我制作了一个处理如下的算法:
@Test
public void empty() throws IOException {
assertEquals(0, count(""));
}
@Test
public void singleNewline() throws IOException {
assertEquals(1, count("\n"));
}
@Test
public void dataWithoutNewline() throws IOException {
assertEquals(1, count("one"));
}
@Test
public void oneCompleteLine() throws IOException {
assertEquals(1, count("one\n"));
}
@Test
public void twoCompleteLines() throws IOException {
assertEquals(2, count("one\ntwo\n"));
}
@Test
public void twoLinesWithoutNewlineAtEnd() throws IOException {
assertEquals(2, count("one\ntwo"));
}
@Test
public void aFewLines() throws IOException {
assertEquals(5, count("one\ntwo\nthree\nfour\nfive\n"));
}
它看起来像这样:
static long countLines(InputStream is) throws IOException {
try(LineNumberReader lnr = new LineNumberReader(new InputStreamReader(is))) {
char[] buf = new char[8192];
int n, previousN = -1;
//Read will return at least one byte, no need to buffer more
while((n = lnr.read(buf)) != -1) {
previousN = n;
}
int ln = lnr.getLineNumber();
if (previousN == -1) {
//No data read at all, i.e file was empty
return 0;
} else {
char lastChar = buf[previousN - 1];
if (lastChar == '\n' || lastChar == '\r') {
//Ending with newline, deduct one
return ln;
}
}
//normal case, return line number + 1
return ln + 1;
}
}
如果你想要直观的结果,你可以使用它。如果您只想要wc -l
兼容性,只需使用@er.vikas 解决方案,但不要在结果中添加一个并重试跳过:
try(LineNumberReader lnr = new LineNumberReader(new FileReader(new File("File1")))) {
while(lnr.skip(Long.MAX_VALUE) > 0){};
return lnr.getLineNumber();
}
在 Java 代码中使用 Process 类怎么样?然后读取命令的输出。
Process p = Runtime.getRuntime().exec("wc -l " + yourfilename);
p.waitFor();
BufferedReader b = new BufferedReader(new InputStreamReader(p.getInputStream()));
String line = "";
int lineCount = 0;
while ((line = b.readLine()) != null) {
System.out.println(line);
lineCount = Integer.parseInt(line);
}
不过需要尝试一下。将发布结果。
这个有趣的解决方案实际上非常有效!
public static int countLines(File input) throws IOException {
try (InputStream is = new FileInputStream(input)) {
int count = 1;
for (int aChar = 0; aChar != -1;aChar = is.read())
count += aChar == '\n' ? 1 : 0;
return count;
}
}
似乎您可以使用 LineNumberReader 采取几种不同的方法。
我这样做了:
int lines = 0;
FileReader input = new FileReader(fileLocation);
LineNumberReader count = new LineNumberReader(input);
String line = count.readLine();
if(count.ready())
{
while(line != null) {
lines = count.getLineNumber();
line = count.readLine();
}
lines+=1;
}
count.close();
System.out.println(lines);
更简单的是,您可以使用 Java BufferedReader lines() 方法返回元素流,然后使用 Stream count() 方法对所有元素进行计数。然后只需在输出中添加 1 即可获取文本文件中的行数。
例如:
FileReader input = new FileReader(fileLocation);
LineNumberReader count = new LineNumberReader(input);
int lines = (int)count.lines().count() + 1;
count.close();
System.out.println(lines);
在基于 Unix 的系统上,使用wc
命令行上的命令。
知道文件中有多少行的唯一方法是计算它们。您当然可以从您的数据中创建一个指标,为您提供一行的平均长度,然后获取文件大小并将其除以 avg。长度,但这并不准确。
如果您没有任何索引结构,您将无法阅读完整的文件。但是您可以通过避免逐行读取并使用正则表达式来匹配所有行终止符来优化它。
EOF 处没有换行符('\n')的多行文件的最佳优化代码。
/**
*
* @param filename
* @return
* @throws IOException
*/
public static int countLines(String filename) throws IOException {
int count = 0;
boolean empty = true;
FileInputStream fis = null;
InputStream is = null;
try {
fis = new FileInputStream(filename);
is = new BufferedInputStream(fis);
byte[] c = new byte[1024];
int readChars = 0;
boolean isLine = false;
while ((readChars = is.read(c)) != -1) {
empty = false;
for (int i = 0; i < readChars; ++i) {
if ( c[i] == '\n' ) {
isLine = false;
++count;
}else if(!isLine && c[i] != '\n' && c[i] != '\r'){ //Case to handle line count where no New Line character present at EOF
isLine = true;
}
}
}
if(isLine){
++count;
}
}catch(IOException e){
e.printStackTrace();
}finally {
if(is != null){
is.close();
}
if(fis != null){
fis.close();
}
}
LOG.info("count: "+count);
return (count == 0 && !empty) ? 1 : count;
}
带有正则表达式的扫描仪:
public int getLineCount() {
Scanner fileScanner = null;
int lineCount = 0;
Pattern lineEndPattern = Pattern.compile("(?m)$");
try {
fileScanner = new Scanner(new File(filename)).useDelimiter(lineEndPattern);
while (fileScanner.hasNext()) {
fileScanner.next();
++lineCount;
}
}catch(FileNotFoundException e) {
e.printStackTrace();
return lineCount;
}
fileScanner.close();
return lineCount;
}
还没打卡
如果你用这个
public int countLines(String filename) throws IOException {
LineNumberReader reader = new LineNumberReader(new FileReader(filename));
int cnt = 0;
String lineRead = "";
while ((lineRead = reader.readLine()) != null) {}
cnt = reader.getLineNumber();
reader.close();
return cnt;
}
你不能跑到大 num 行,比如 100K 行,因为从 reader.getLineNumber 返回是 int。您需要长类型的数据来处理最大行数..