python - 如何在 Python 中廉价地获取大文件的行数？

Question

我需要在 python 中获取一个大文件（数十万行）的行数。记忆和时间方面最有效的方法是什么？

目前我这样做：

def file_len(fname):
    with open(fname) as f:
        for i, l in enumerate(f):
            pass
    return i + 1

有可能做得更好吗？

score 741 · Accepted Answer

741

一行，可能相当快：

num_lines = sum(1 for line in open('myfile.txt'))

于 2009-06-19T19:07:06.063 回答

score 422 · Accepted Answer

你没有比这更好的了。

毕竟，任何解决方案都必须读取整个文件，弄清楚\n你有多少，然后返回结果。

你有没有更好的方法来做到这一点而无需阅读整个文件？不确定...最好的解决方案将始终是 I/O-bound，你能做的最好的就是确保你不使用不必要的内存，但看起来你已经涵盖了。

score 221 · Accepted Answer

我相信内存映射文件将是最快的解决方案。我尝试了四个函数：OP（opcount）发布的函数；对文件中的行进行简单的迭代 ( simplecount)；带有内存映射文件 (mmap) ( mapcount) 的 readline；以及 Mykola Kharechko ( bufcount) 提供的缓冲读取解决方案。

我将每个函数运行了五次，并计算了一个 120 万行文本文件的平均运行时间。

Windows XP、Python 2.5、2GB RAM、2 GHz AMD 处理器

这是我的结果：

mapcount : 0.465599966049
simplecount : 0.756399965286
bufcount : 0.546800041199
opcount : 0.718600034714

编辑：Python 2.6 的数字：

mapcount : 0.471799945831
simplecount : 0.634400033951
bufcount : 0.468800067902
opcount : 0.602999973297

所以缓冲区读取策略似乎是 Windows/Python 2.6 最快的

这是代码：

from __future__ import with_statement
import time
import mmap
import random
from collections import defaultdict

def mapcount(filename):
    f = open(filename, "r+")
    buf = mmap.mmap(f.fileno(), 0)
    lines = 0
    readline = buf.readline
    while readline():
        lines += 1
    return lines

def simplecount(filename):
    lines = 0
    for line in open(filename):
        lines += 1
    return lines

def bufcount(filename):
    f = open(filename)                  
    lines = 0
    buf_size = 1024 * 1024
    read_f = f.read # loop optimization

    buf = read_f(buf_size)
    while buf:
        lines += buf.count('\n')
        buf = read_f(buf_size)

    return lines

def opcount(fname):
    with open(fname) as f:
        for i, l in enumerate(f):
            pass
    return i + 1


counts = defaultdict(list)

for i in range(5):
    for func in [mapcount, simplecount, bufcount, opcount]:
        start_time = time.time()
        assert func("big_file.txt") == 1209138
        counts[func].append(time.time() - start_time)

for key, vals in counts.items():
    print key.__name__, ":", sum(vals) / float(len(vals))

score 184 · Accepted Answer

我不得不在一个类似的问题上发布这个，直到我的声誉得分上升一点（感谢撞到我的人！）。

所有这些解决方案都忽略了一种使该程序运行得更快的方法，即使用无缓冲（原始）接口、使用字节数组和进行自己的缓冲。（这只适用于 Python 3。在 Python 2 中，默认情况下可能使用也可能不使用原始接口，但在 Python 3 中，您将默认使用 Unicode。）

使用计时工具的修改版本，我相信以下代码比提供的任何解决方案都更快（并且稍微更像 Python）：

def rawcount(filename):
    f = open(filename, 'rb')
    lines = 0
    buf_size = 1024 * 1024
    read_f = f.raw.read

    buf = read_f(buf_size)
    while buf:
        lines += buf.count(b'\n')
        buf = read_f(buf_size)

    return lines

使用单独的生成器函数，这运行得更快：

def _make_gen(reader):
    b = reader(1024 * 1024)
    while b:
        yield b
        b = reader(1024*1024)

def rawgencount(filename):
    f = open(filename, 'rb')
    f_gen = _make_gen(f.raw.read)
    return sum( buf.count(b'\n') for buf in f_gen )

这可以通过使用 itertools 内联的生成器表达式完全完成，但看起来很奇怪：

from itertools import (takewhile,repeat)

def rawincount(filename):
    f = open(filename, 'rb')
    bufgen = takewhile(lambda x: x, (f.raw.read(1024*1024) for _ in repeat(None)))
    return sum( buf.count(b'\n') for buf in bufgen )

以下是我的时间安排：

function      average, s  min, s   ratio
rawincount        0.0043  0.0041   1.00
rawgencount       0.0044  0.0042   1.01
rawcount          0.0048  0.0045   1.09
bufcount          0.008   0.0068   1.64
wccount           0.01    0.0097   2.35
itercount         0.014   0.014    3.41
opcount           0.02    0.02     4.83
kylecount         0.021   0.021    5.05
simplecount       0.022   0.022    5.25
mapcount          0.037   0.031    7.46

score 103 · Accepted Answer

你可以执行一个子进程并运行wc -l filename

import subprocess

def file_len(fname):
    p = subprocess.Popen(['wc', '-l', fname], stdout=subprocess.PIPE, 
                                              stderr=subprocess.PIPE)
    result, err = p.communicate()
    if p.returncode != 0:
        raise IOError(err)
    return int(result.strip().split()[0])

score 46 · Accepted Answer

这是一个 python 程序，用于使用多处理库在机器/内核之间分配行数。我的测试使用 8 核 windows 64 服务器将 2000 万行文件的计数从 26 秒提高到 7 秒。注意：不使用内存映射会使事情变得更慢。

import multiprocessing, sys, time, os, mmap
import logging, logging.handlers

def init_logger(pid):
    console_format = 'P{0} %(levelname)s %(message)s'.format(pid)
    logger = logging.getLogger()  # New logger at root level
    logger.setLevel( logging.INFO )
    logger.handlers.append( logging.StreamHandler() )
    logger.handlers[0].setFormatter( logging.Formatter( console_format, '%d/%m/%y %H:%M:%S' ) )

def getFileLineCount( queues, pid, processes, file1 ):
    init_logger(pid)
    logging.info( 'start' )

    physical_file = open(file1, "r")
    #  mmap.mmap(fileno, length[, tagname[, access[, offset]]]

    m1 = mmap.mmap( physical_file.fileno(), 0, access=mmap.ACCESS_READ )

    #work out file size to divide up line counting

    fSize = os.stat(file1).st_size
    chunk = (fSize / processes) + 1

    lines = 0

    #get where I start and stop
    _seedStart = chunk * (pid)
    _seekEnd = chunk * (pid+1)
    seekStart = int(_seedStart)
    seekEnd = int(_seekEnd)

    if seekEnd < int(_seekEnd + 1):
        seekEnd += 1

    if _seedStart < int(seekStart + 1):
        seekStart += 1

    if seekEnd > fSize:
        seekEnd = fSize

    #find where to start
    if pid > 0:
        m1.seek( seekStart )
        #read next line
        l1 = m1.readline()  # need to use readline with memory mapped files
        seekStart = m1.tell()

    #tell previous rank my seek start to make their seek end

    if pid > 0:
        queues[pid-1].put( seekStart )
    if pid < processes-1:
        seekEnd = queues[pid].get()

    m1.seek( seekStart )
    l1 = m1.readline()

    while len(l1) > 0:
        lines += 1
        l1 = m1.readline()
        if m1.tell() > seekEnd or len(l1) == 0:
            break

    logging.info( 'done' )
    # add up the results
    if pid == 0:
        for p in range(1,processes):
            lines += queues[0].get()
        queues[0].put(lines) # the total lines counted
    else:
        queues[0].put(lines)

    m1.close()
    physical_file.close()

if __name__ == '__main__':
    init_logger( 'main' )
    if len(sys.argv) > 1:
        file_name = sys.argv[1]
    else:
        logging.fatal( 'parameters required: file-name [processes]' )
        exit()

    t = time.time()
    processes = multiprocessing.cpu_count()
    if len(sys.argv) > 2:
        processes = int(sys.argv[2])
    queues=[] # a queue for each process
    for pid in range(processes):
        queues.append( multiprocessing.Queue() )
    jobs=[]
    prev_pipe = 0
    for pid in range(processes):
        p = multiprocessing.Process( target = getFileLineCount, args=(queues, pid, processes, file_name,) )
        p.start()
        jobs.append(p)

    jobs[0].join() #wait for counting to finish
    lines = queues[0].get()

    logging.info( 'finished {} Lines:{}'.format( time.time() - t, lines ) )

score 31 · Accepted Answer

与此答案类似的单行 bash 解决方案，使用现代subprocess.check_output功能：

def line_count(filename):
    return int(subprocess.check_output(['wc', '-l', filename]).split()[0])

score 28 · Accepted Answer

在进行perfplot分析后，必须推荐缓冲读取解决方案

def buf_count_newlines_gen(fname):
    def _make_gen(reader):
        while True:
            b = reader(2 ** 16)
            if not b: break
            yield b

    with open(fname, "rb") as f:
        count = sum(buf.count(b"\n") for buf in _make_gen(f.raw.read))
    return count

它快速且节省内存。大多数其他解决方案的速度要慢约 20 倍。

重现情节的代码：

import mmap
import subprocess
from functools import partial

import perfplot


def setup(n):
    fname = "t.txt"
    with open(fname, "w") as f:
        for i in range(n):
            f.write(str(i) + "\n")
    return fname


def for_enumerate(fname):
    i = 0
    with open(fname) as f:
        for i, _ in enumerate(f):
            pass
    return i + 1


def sum1(fname):
    return sum(1 for _ in open(fname))


def mmap_count(fname):
    with open(fname, "r+") as f:
        buf = mmap.mmap(f.fileno(), 0)

    lines = 0
    while buf.readline():
        lines += 1
    return lines


def for_open(fname):
    lines = 0
    for _ in open(fname):
        lines += 1
    return lines


def buf_count_newlines(fname):
    lines = 0
    buf_size = 2 ** 16
    with open(fname) as f:
        buf = f.read(buf_size)
        while buf:
            lines += buf.count("\n")
            buf = f.read(buf_size)
    return lines


def buf_count_newlines_gen(fname):
    def _make_gen(reader):
        b = reader(2 ** 16)
        while b:
            yield b
            b = reader(2 ** 16)

    with open(fname, "rb") as f:
        count = sum(buf.count(b"\n") for buf in _make_gen(f.raw.read))
    return count


def wc_l(fname):
    return int(subprocess.check_output(["wc", "-l", fname]).split()[0])


def sum_partial(fname):
    with open(fname) as f:
        count = sum(x.count("\n") for x in iter(partial(f.read, 2 ** 16), ""))
    return count


def read_count(fname):
    return open(fname).read().count("\n")


b = perfplot.bench(
    setup=setup,
    kernels=[
        for_enumerate,
        sum1,
        mmap_count,
        for_open,
        wc_l,
        buf_count_newlines,
        buf_count_newlines_gen,
        sum_partial,
        read_count,
    ],
    n_range=[2 ** k for k in range(27)],
    xlabel="num lines",
)
b.save("out.png")
b.show()

score 18 · Accepted Answer

我会使用 Python 的文件对象方法readlines，如下：

with open(input_file) as foo:
    lines = len(foo.readlines())

这将打开文件，在文件中创建行列表，计算列表的长度，将其保存到变量并再次关闭文件。

score 13 · Accepted Answer

这是我使用纯 python 发现的最快的东西。您可以通过设置缓冲区来使用所需的任何内存量，尽管 2**16 在我的计算机上似乎是一个最佳位置。

from functools import partial

buffer=2**16
with open(myfile) as f:
        print sum(x.count('\n') for x in iter(partial(f.read,buffer), ''))

我在这里找到了答案为什么在 C++ 中从标准输入读取行比 Python 慢得多？并稍微调整了一下。这是一本很好的读物，可以理解如何快速计算行数，但wc -l仍然比其他任何东西快 75%。

score 12 · Accepted Answer

def file_len(full_path):
  """ Count number of lines in a file."""
  f = open(full_path)
  nr_of_lines = sum(1 for line in f)
  f.close()
  return nr_of_lines

score 11 · Accepted Answer

一线解决方案：

import os
os.system("wc -l  filename")

我的片段：

>>> os.system('wc -l *.txt')

0 bar.txt
1000 command.txt
3 test_file.txt
1003 total

score 11 · Accepted Answer

这是我使用的，看起来很干净：

import subprocess

def count_file_lines(file_path):
    """
    Counts the number of lines in a file using wc utility.
    :param file_path: path to file
    :return: int, no of lines
    """
    num = subprocess.check_output(['wc', '-l', file_path])
    num = num.split(' ')
    return int(num[0])

更新：这比使用纯 python 略快，但以内存使用为代价。子进程在执行您的命令时将派生一个与父进程具有相同内存占用的新进程。

score 8 · Accepted Answer

凯尔的回答

num_lines = sum(1 for line in open('my_file.txt'))

可能是最好的，替代方法是

num_lines =  len(open('my_file.txt').read().splitlines())

这是两者的性能比较

In [20]: timeit sum(1 for line in open('Charts.ipynb'))
100000 loops, best of 3: 9.79 µs per loop

In [21]: timeit len(open('Charts.ipynb').read().splitlines())
100000 loops, best of 3: 12 µs per loop

score 7 · Accepted Answer

这个版本我得到了一个小的（4-8%）改进，它重用了一个常量缓冲区，所以它应该避免任何内存或 GC 开销：

lines = 0
buffer = bytearray(2048)
with open(filename) as f:
  while f.readinto(buffer) > 0:
      lines += buffer.count('\n')

您可以使用缓冲区大小，也许会看到一些改进。

score 5 · Accepted Answer

至于我，这个变体将是最快的：

#!/usr/bin/env python

def main():
    f = open('filename')                  
    lines = 0
    buf_size = 1024 * 1024
    read_f = f.read # loop optimization

    buf = read_f(buf_size)
    while buf:
        lines += buf.count('\n')
        buf = read_f(buf_size)

    print lines

if __name__ == '__main__':
    main()

原因：缓冲比逐行读取快，string.count也非常快

score 5 · Accepted Answer

为了完成上述方法，我尝试了 fileinput 模块的变体：

import fileinput as fi   
def filecount(fname):
        for line in fi.input(fname):
            pass
        return fi.lineno()

并将一个 60 百万行的文件传递给上述所有方法：

mapcount : 6.1331050396
simplecount : 4.588793993
opcount : 4.42918205261
filecount : 43.2780818939
bufcount : 0.170812129974

令我有点惊讶的是，fileinput 是如此糟糕，并且比所有其他方法都差得多......

score 4 · Accepted Answer

我已经像这样修改了缓冲区案例：

def CountLines(filename):
    f = open(filename)
    try:
        lines = 1
        buf_size = 1024 * 1024
        read_f = f.read # loop optimization
        buf = read_f(buf_size)

        # Empty file
        if not buf:
            return 0

        while buf:
            lines += buf.count('\n')
            buf = read_f(buf_size)

        return lines
    finally:
        f.close()

现在也计算空文件和最后一行（不带 \n）。

score 4 · Accepted Answer

这段代码更短更清晰。这可能是最好的方法：

num_lines = open('yourfile.ext').read().count('\n')

score 2 · Accepted Answer

如果想在 Linux 中用 Python 廉价地获取行数，我推荐这种方法：

import os
print os.popen("wc -l file_path").readline().split()[0]

file_path 既可以是抽象文件路径，也可以是相对路径。希望这可能会有所帮助。

score 2 · Accepted Answer

简单方法：

1)

>>> f = len(open("myfile.txt").readlines())
>>> f

430

>>> f = open("myfile.txt").read().count('\n')
>>> f
430
>>>

num_lines = len(list(open('myfile.txt')))

score 2 · Accepted Answer

已经有很多答案了，但不幸的是，它们中的大多数只是一个几乎无法优化问题的小经济体......

我参与了几个项目，其中行数是软件的核心功能，并且尽可能快地处理大量文件至关重要。

行数的主要瓶颈是 I/O 访问，因为您需要读取每一行以检测行返回字符，所以根本没有办法。第二个潜在瓶颈是内存管理：一次加载越多，处理速度越快，但与第一个瓶颈相比，这个瓶颈可以忽略不计。

因此，除了诸如禁用 gc 收集和其他微观管理技巧之类的微小优化之外，有 3 种主要方法可以减少行计数函数的处理时间：

硬件解决方案：主要且最明显的方式是非编程方式：购买速度非常快的 SSD/闪存硬盘。到目前为止，这是您获得最大速度提升的方法。
数据准备解决方案：如果您生成或可以修改您处理的文件的生成方式，或者如果您可以对它们进行预处理，则首先将行返回转换为 unix 样式（\n），因为这将比 Windows 节省 1 个字符或MacOS 样式（节省不多，但很容易获得），其次，最重要的是，您可以编写固定长度的行。如果您需要可变长度，您可以随时填充较小的行。这样，您可以立即从总文件大小中计算行数，访问速度要快得多。通常，问题的最佳解决方案是对其进行预处理，以使其更适合您的最终目的。
并行化+硬件方案：如果您可以购买多个硬盘（如果可能的话，还可以购买 SSD 闪存盘），那么您甚至可以通过利用并行化，通过在磁盘之间以平衡的方式（最简单的是按总大小平衡）存储文件来超越一个磁盘的速度，然后从所有这些磁盘中并行读取。然后，您可以期望获得与您拥有的磁盘数量成比例的乘数提升。如果购买多个磁盘不是您的选择，那么并行化可能无济于事（除非您的磁盘像某些专业级磁盘一样具有多个读取头，但即使这样，磁盘的内部高速缓存和 PCB 电路也可能成为瓶颈并阻止您完全并行使用所有磁头，而且您必须为这个硬盘设计一个特定的代码将使用，因为您需要知道确切的集群映射，以便将文件存储在不同头下的集群上，以便之后可以用不同的头读取它们）。确实，众所周知，顺序读取几乎总是比随机读取快，并且在单个磁盘上的并行化将具有比顺序读取更类似于随机读取的性能（例如，您可以使用 CrystalDiskMark 在这两个方面测试您的硬盘驱动器速度） .

如果这些都不是一个选项，那么您只能依靠微观管理技巧将您的行计数功能的速度提高几个百分点，但不要指望任何真正重要的东西。相反，您可以预计，与您将看到的速度改进回报相比，您花在调整上的时间将不成比例。

score 1 · Accepted Answer

打开文件的结果是一个迭代器，它可以转换成一个序列，它有一个长度：

with open(filename) as f:
   return len(list(f))

这比您的显式循环更简洁，并且避免了enumerate.

score 1 · Accepted Answer

那这个呢

def file_len(fname):
  counts = itertools.count()
  with open(fname) as f: 
    for _ in f: counts.next()
  return counts.next()

score 1 · Accepted Answer

1

count = max(enumerate(open(filename)))[0]

于 2011-03-11T21:09:52.690 回答

score 1 · Accepted Answer

这个怎么样？

import fileinput
import sys

counter=0
for line in fileinput.input([sys.argv[1]]):
    counter+=1

fileinput.close()
print counter

score 1 · Accepted Answer

这个单线怎么样：

file_length = len(open('myfile.txt','r').read().split('\n'))

使用此方法在 3900 行文件上计时需要 0.003 秒

def c():
  import time
  s = time.time()
  file_length = len(open('myfile.txt','r').read().split('\n'))
  print time.time() - s

score 1 · Accepted Answer

1

print open('file.txt', 'r').read().count("\n") + 1

于 2014-03-21T06:10:30.660 回答

score 1 · Accepted Answer

def line_count(path):
    count = 0
    with open(path) as lines:
        for count, l in enumerate(lines, start=1):
            pass
    return count

score 1 · Accepted Answer

def count_text_file_lines(path):
    with open(path, 'rt') as file:
        line_count = sum(1 for _line in file)
    return line_count

score 1 · Accepted Answer

大文件的替代方法是使用xreadlines():

count = 0
for line in open(thefilepath).xreadlines(  ): count += 1

对于 Python 3，请参阅：Python 3中的 xreadlines() 替代品是什么？

score 1 · Accepted Answer

这是对其他一些答案的元评论。

行读取和缓冲\n计数技术不会为每个文件返回相同的答案，因为某些文本文件在最后一行的末尾没有换行符。您可以通过检查最后一个非空缓冲区的最后一个字节并在不是时添加 1 来解决此问题b'\n'。
在 Python 3 中，以文本模式和二进制模式打开文件会产生不同的结果，因为默认情况下，文本模式将 CR、LF 和 CRLF 识别为行尾（将它们全部转换为'\n'），而在二进制模式下，只有 LF 和 CRLF 会算了算了b'\n'。无论您是按行阅读还是进入固定大小的缓冲区，这都适用。经典的 Mac OS 使用 CR 作为行尾；我不知道这些文件现在有多常见。
缓冲区读取方法使用与文件大小无关的有限 RAM，而行读取方法可以在最坏的情况下一次将整个文件读入 RAM（特别是如果文件使用 CR 行结尾）。在最坏的情况下，它可能使用比文件大小更多的 RAM，因为动态调整行缓冲区大小以及（如果您以文本模式打开）Unicode 解码和存储的开销。
您可以通过预先分配一个字节数组并使用readinto而不是read. 现有答案之一（投票很少）这样做，但它有问题（它重复计算了一些字节）。
顶部缓冲区读取答案使用大缓冲区 (1 MiB)。由于操作系统预读，使用较小的缓冲区实际上可以更快。如果您一次读取 32K 或 64K，操作系统可能会在您请求之前将下一个 32K/64K 读取到缓存中，并且每次访问内核几乎都会立即返回。如果您一次读取 1 MiB，则操作系统不太可能推测性地读取整个兆字节。它可能预读的数量较少，但您仍将花费大量时间坐在内核中等待磁盘返回其余数据。

score 0 · Accepted Answer

为什么不读取前 100 行和后 100 行并估计平均行长度，然后将总文件大小除以这些数字？如果您不需要确切的值，这可能会起作用。

score 0 · Accepted Answer

您可以通过os.path以下方式使用该模块：

import os
import subprocess
Number_lines = int( (subprocess.Popen( 'wc -l {0}'.format( Filename ), shell=True, stdout=subprocess.PIPE).stdout).readlines()[0].split()[0] )

，其中Filename是文件的绝对路径。

score 0 · Accepted Answer

如果文件可以放入内存，那么

with open(fname) as f:
    count = len(f.read().split(b'\n')) - 1

score 0 · Accepted Answer

创建一个名为的可执行脚本文件count.py：

#!/usr/bin/python

import sys
count = 0
for line in sys.stdin:
    count+=1

然后将文件的内容通过管道传输到 python 脚本中：cat huge.txt | ./count.py. Pipe 也适用于Powershell，因此您最终将计算行数。

对我来说，在 Linux 上它比简单的解决方案快 30%：

count=1
with open('huge.txt') as f:
    count+=1

score 0 · Accepted Answer

使用 Numba

我们可以使用 Numba 来 JIT（及时）将我们的函数编译为机器码。def numbacountparallel(fname) 的运行速度比问题中的 def file_len(fname) 快 2.8 倍。

笔记：

在运行基准测试之前，操作系统已经将文件缓存到内存中，因为我在我的 PC 上没有看到太多的磁盘活动。第一次读取文件时，时间会慢很多，使得使用 Numba 的时间优势变得微不足道。

第一次调用该函数时，JIT 编译需要额外的时间。

如果我们做的不仅仅是计算行数，这将很有用。

Cython 是另一种选择。

http://numba.pydata.org/

结论

由于计数行将受 IO 限制，请使用问题中的 def file_len(fname) ，除非您想做的不仅仅是计数行。

import timeit

from numba import jit, prange
import numpy as np

from itertools import (takewhile,repeat)

FILE = '../data/us_confirmed.csv' # 40.6MB, 371755 line file
CR = ord('\n')


# Copied from the question above. Used as a benchmark
def file_len(fname):
    with open(fname) as f:
        for i, l in enumerate(f):
            pass
    return i + 1


# Copied from another answer. Used as a benchmark
def rawincount(filename):
    f = open(filename, 'rb')
    bufgen = takewhile(lambda x: x, (f.read(1024*1024*10) for _ in repeat(None)))
    return sum( buf.count(b'\n') for buf in bufgen )


# Single thread
@jit(nopython=True)
def numbacountsingle_chunk(bs):

    c = 0
    for i in range(len(bs)):
        if bs[i] == CR:
            c += 1

    return c


def numbacountsingle(filename):
    f = open(filename, "rb")
    total = 0
    while True:
        chunk = f.read(1024*1024*10)
        lines = numbacountsingle_chunk(chunk)
        total += lines
        if not chunk:
            break

    return total


# Multi thread
@jit(nopython=True, parallel=True)
def numbacountparallel_chunk(bs):

    c = 0
    for i in prange(len(bs)):
        if bs[i] == CR:
            c += 1

    return c


def numbacountparallel(filename):
    f = open(filename, "rb")
    total = 0
    while True:
        chunk = f.read(1024*1024*10)
        lines = numbacountparallel_chunk(np.frombuffer(chunk, dtype=np.uint8))
        total += lines
        if not chunk:
            break

    return total

print('numbacountparallel')
print(numbacountparallel(FILE)) # This allows Numba to compile and cache the function without adding to the time.
print(timeit.Timer(lambda: numbacountparallel(FILE)).timeit(number=100))

print('\nnumbacountsingle')
print(numbacountsingle(FILE))
print(timeit.Timer(lambda: numbacountsingle(FILE)).timeit(number=100))

print('\nfile_len')
print(file_len(FILE))
print(timeit.Timer(lambda: rawincount(FILE)).timeit(number=100))

print('\nrawincount')
print(rawincount(FILE))
print(timeit.Timer(lambda: rawincount(FILE)).timeit(number=100))

100 次调用每个函数的时间（以秒为单位）

numbacountparallel
371755
2.8007332000000003

numbacountsingle
371755
3.1508585999999994

file_len
371755
6.7945494

rawincount
371755
6.815438

score 0 · Accepted Answer

0

我会使用的最简单和最短的方法是：

f = open("my_file.txt", "r")
len(f.readlines())

于 2021-08-11T03:08:56.780 回答

score 0 · Accepted Answer

0

我发现你可以。

f = open("data.txt")
linecout = len(f.readlines())

会给你一个答案

于 2021-08-23T04:58:52.470 回答

score -1 · Accepted Answer

-1

相似地：

lines = 0
with open(path) as f:
    for line in f:
        lines += 1

于 2013-09-05T14:08:16.617 回答

score -1 · Accepted Answer

另一种可能：

import subprocess

def num_lines_in_file(fpath):
    return int(subprocess.check_output('wc -l %s' % fpath, shell=True).strip().split()[0])

score -1 · Accepted Answer

如果文件中的所有行长度相同（并且仅包含 ASCII 字符）*，则可以非常便宜地执行以下操作：

fileSize     = os.path.getsize( pathToFile )  # file size in bytes
bytesPerLine = someInteger                    # don't forget to account for the newline character
numLines     = fileSize // bytesPerLine

*我怀疑如果使用像é这样的unicode 字符，将需要更多的努力来确定一行中的字节数。

score -2 · Accepted Answer

那这个呢？

import sys
sys.stdin=open('fname','r')
data=sys.stdin.readlines()
print "counted",len(data),"lines"

score -2 · Accepted Answer

为什么以下不起作用？

import sys

# input comes from STDIN
file = sys.stdin
data = file.readlines()

# get total number of lines in file
lines = len(data)

print lines

在这种情况下， len 函数使用输入行作为确定长度的方法。

python - 如何在 Python 中廉价地获取大文件的行数？

44 回答 44

使用 Numba

笔记：

结论

Related

Reference