python - IronPython 与 CPython 的文件读取和解析性能

Question

我一直在用 python 开发一个文件阅读器，我希望在其中读取 ~100MB 的 ascii 文件。顶部有一堆标题信息，然后是制表符分隔的列。有些列包含非数字数据（我现在不关心）。我有一个 matlab 实现，可以在不到 1.5 秒的时间内读取一个 30MB 的示例文件。我的 python 阅读器在 CPython 中大约需要 2 秒，但在 IronPython 中大约需要 4 秒。不同之处似乎在于字符串值转换为浮点数的位置，但我无法在 IronPython 中让它更快。

我在这里的最新迭代有以下循环来读取和解析行

#-Parse the actual data lines
istep = -1
while len(line) > 0:

    istep += 1
    #-Split the line and convert pasred values to floats
    timestep = line.split();            
    for ichan in numericChannels:                    
        data[ichan].append(float(timestep[ichan]))

    line = f.readline().strip()

numericChannels是一个整数列表，指定我要读取的通道。 data是一个列表列表，其中子列表是一列数据。

性能上的差异似乎来自浮点转换。关于我可以在 IronPython 上做些什么来加快速度的任何想法？我什至尝试过提前读取文件，然后使用 System.Threading.Task.Parallel.ForEach 构造来解析文件行。那根本没有帮助。

谢谢。

score 0 · Accepted Answer

a）您说“差异似乎在于字符串值转换为浮点数的位置”-是什么让您这么认为？您是否对代码运行了分析器？

b）如果你有记忆，这样做可能会更快

for line in f.readlines():

score 0 · Accepted Answer

在我看来，这样的事情可能会快一点。

import operator
data=[]
istep = -1
columngetter=operator.itemgetter(*numericChannels)
while len(line) > 0:
    istep += 1
    #-Split the line and convert parsed values to floats
    timestep = line.split()
    data.append(map(float,columngetter(timestep)))
    line = f.readline().strip()

data=zip(*data)

score 0 · Accepted Answer

在读取文本文件方面，IronPython 似乎比 CPython 慢。我在几个版本的 Python 中运行了这个片段（partest2.txt 文件在一行上有 200,000 个数字）：

import sys
import timeit

tmr = timeit.Timer("with open(r'partest2.txt','r') as fid:fid.readlines()")
res = tmr.timeit(number=20)
print(res)

结果：

CPython 2.7：1.282
CPython 3.3：1.562
IronPython 2.6 [.NET 2]：2.196
IronPython 2.7 [.NET 4]：2.880

IronPython 运行吐出此警告（不确定它是否会影响任何内容）：

<string>:1: RuntimeWarning: IronPython 不支持禁用 GC

当正在读取的文件 (partest2.txt) 被更改以使其具有相同的 200,000 个数字时，每个数字都在它自己的行上给出，这是时间（完全不同）

CPython 2.7：4.04
CPython 3.3：7.61
IronPython 2.7 [.NET 4]：20.22
IronPython 2.6 [.NET 2]：21.46

哎呀！

python - IronPython 与 CPython 的文件读取和解析性能

3 回答 3

Related

Reference