0

我有 16 到 20 个文件(制表符分隔/逗号分隔),每个文件包含 2 列。列如下。1. 间隔(毫秒) 2. IOPS

我想为 ex 加入所有 20 个文件:

文件 1:

1000, 11217              
2000, 12789              
3001, 12022              
4000, 14028   

文件 2:

1000, 11236
2000, 12789
3001, 12022
4000, 14028         

这样,我就有了 20 个输入文件。这些文件将基于线程生成。

输出应该像

1000, 11217, 11236
2000, 12789, 12803
3000, 12156
3001, 12022
4000, 14028, 13889

任何人都可以建议哪种方法是合并文件的最佳方式。

提前谢谢你

4

2 回答 2

0

为了测试下面的脚本,我生成了一些测试数据,即 16 个文件,每个文件有 12 条记录,随机出现的t值不完全等于 的倍数1000,使用以下帮助脚本

from random import random, seed
seed(1234)

for n in range(1, 17):
    with open('f%02d'%n, 'w') as f:
        for t in range(1000,12001,1000):
            print(t+(1 if random()<0.1 else 0),int(random()*10000), sep=', ', file=f)

例如,文件f07包含

1000, 3812
2000, 9600
3000, 9210
4000, 6852
5000, 8387
6001, 2603
7000, 8578
8000, 1137
9000, 2138
10000, 5875
11001, 8774
12000, 7768

以下脚本用于heapq.merge对来自已排序文件的数据进行排序,数据由一个简单的生成器过滤,该生成器拆分单个记录并在记录所属的文件中添加信息,以便我们可以编写统一的记录,可能带有无效条目.

用于debug = False在数据上使用脚本。

'''Usage: python3 merge.py filename1, filename2, ..., filenameout'''
from heapq import merge
from sys import argv, stdout

def reader(fobj, n):
    for line in fobj:
        yield [x.strip() for x in line.split(',')]+[n]

# change to debug = False to run on your data
debug = True
argv = ['']+['f%02d'%n for n in range(1, 17)]+[''] if debug else argv
out  = stdout if debug else open(argv[-1], 'w')

readers = [reader(open(f), n) for n, f in enumerate(argv[1:-1])]
N = len(readers)

# header is "Interval,IOPS01,IOPS02,...,IOPSnn" modify as you like
print('Interval',*('IOPS%02d'%(n+1) for n in range(N)), sep=',', file=out)

# initialize for the loop
t0 = ""
# the loop
for t, iops, n in merge(*readers, key=lambda x:int(x[0])):
    if t != t0:
        if t0: print(t0, *iops_l, sep=',', file=out)
        t0 = t
        iops_l = ['····' if debug else '']*N
    iops_l[n] = '%4s'%iops if debug else iops

# the last record to output
print(t0, *iops_l, sep=',', file=out)

f01在, ...上运行程序f16会产生

Interval,IOPS01,IOPS02,IOPS03,IOPS04,IOPS05,IOPS06,IOPS07,IOPS08,IOPS09,IOPS10,IOPS11,IOPS12,IOPS13,IOPS14,IOPS15,IOPS16
1000,4407, 889,9569,2956, 477,5845,3812,3079,····,2571,4994,8863,6394,5339,8280,9311
1001,····,····,····,····,····,····,····,····,1400,····,····,····,····,····,····,····
2000,····,2695,3053,1856, 379,8128,9600,····,5880, 283,9501, 709,1964,5371,7242,3195
2001,9109,····,····,····,····,····,····,3949,····,····,····,····,····,····,····,····
3000,5822,6446,····,1194, 876,4995,9210,6827,7535,····,5323,8030,6204,3628,3730,5248
3001,····,····,2767,····,····,····,····,····,····,9906,····,····,····,····,····,····
4000, 839,3552,1773,8294,6993,5490,6852,4005,2986,6665,1455,9396,5018,7121,6750,9895
5000,2368,9335,9547,8177,9450,1309,8387,2812,3305,5156,2432,3797,9101,4568,8340,····
5001,····,····,····,····,····,····,····,····,····,····,····,····,····,····,····,1974
6000,····,5301,8338,6410,····,9253,····,6004,2937,5152,5064, 524, 495,8768,7844,3358
6001,7887,····,····,····,2574,····,2603,····,····,····,····,····,····,····,····,····
7000,6232,····,····,7026,····, 697,8578,5188,8142,····,7734, 801,4680,5435,5463,6949
7001,····,5081,3861,····,4263,····,····,····,····,7595,····,····,····,····,····,····
8000,1485,····,3417,1571,3966,  17,1137,4574,5340,6980,3265,9292,3717,1917,7696, 293
8001,····,1437,····,····,····,····,····,····,····,····,····,····,····,····,····,····
9000,1144,3773,4759,2693,8873,6382,2138,6598,3396,2195,5617,6769,7784,8677,6985,3160
10000,····,····,4708,8404,8185,1482,5875,7255,3210,6531,5991,6930,  87,9459,9566,····
10001,4867,5875,····,····,····,····,····,····,····,····,····,····,····,····,····, 556
11000, 645,5573,8815,7935,8586,4327,····,1037,7596,6827,3717,8642,9853,8470,3549,4239
11001,····,····,····,····,····,····,8774,····,····,····,····,····,····,····,····,····
12000,4658,9373,7810, 707,7024,2071,7768,7940,5482,5641,7628,3714,5999,7547,1883,9796

在我用 替换debug = Truedebug = False,程序可以像这样运行

$ python3 merge.py f01 f02 f03 f04 f05 f06 f07 f08 f09 f10 f11 f12 f13 f14 f15 f16 OUT
$ cat OUT
Interval,IOPS01,IOPS02,IOPS03,IOPS04,IOPS05,IOPS06,IOPS07,IOPS08,IOPS09,IOPS10,IOPS11,IOPS12,IOPS13,IOPS14,IOPS15,IOPS16
1000,4407, 889,9569,2956, 477,5845,3812,3079,,2571,4994,8863,6394,5339,8280,9311
1001,,,,,,,,,1400,,,,,,,
2000,,2695,3053,1856, 379,8128,9600,,5880, 283,9501, 709,1964,5371,7242,3195
2001,9109,,,,,,,3949,,,,,,,,
3000,5822,6446,,1194, 876,4995,9210,6827,7535,,5323,8030,6204,3628,3730,5248
3001,,,2767,,,,,,,9906,,,,,,
4000, 839,3552,1773,8294,6993,5490,6852,4005,2986,6665,1455,9396,5018,7121,6750,9895
5000,2368,9335,9547,8177,9450,1309,8387,2812,3305,5156,2432,3797,9101,4568,8340,
5001,,,,,,,,,,,,,,,,1974
6000,,5301,8338,6410,,9253,,6004,2937,5152,5064, 524, 495,8768,7844,3358
6001,7887,,,,2574,,2603,,,,,,,,,
7000,6232,,,7026,, 697,8578,5188,8142,,7734, 801,4680,5435,5463,6949
7001,,5081,3861,,4263,,,,,7595,,,,,,
8000,1485,,3417,1571,3966,  17,1137,4574,5340,6980,3265,9292,3717,1917,7696, 293
8001,,1437,,,,,,,,,,,,,,
9000,1144,3773,4759,2693,8873,6382,2138,6598,3396,2195,5617,6769,7784,8677,6985,3160
10000,,,4708,8404,8185,1482,5875,7255,3210,6531,5991,6930,  87,9459,9566,
10001,4867,5875,,,,,,,,,,,,,, 556
11000, 645,5573,8815,7935,8586,4327,,1037,7596,6827,3717,8642,9853,8470,3549,4239
11001,,,,,,,8774,,,,,,,,,
12000,4658,9373,7810, 707,7024,2071,7768,7940,5482,5641,7628,3714,5999,7547,1883,9796
$ 
于 2020-05-08T17:21:31.633 回答
0

这不是一个答案......它是一种非常冗长的格式化评论


当你遇到这种问题时,就是对方说他们的脚本一切都很好,而你看到它不工作,你可能会猜到中间有环境问题,他们的环境与你的环境不同......

你能做些什么来解决这个问题?以您问题的当前状态为例

  1. 看看当你重新创建他们的环境时会发生什么:

    • 在一个新文件夹中(请不要在桌面上的所有内容)复制帮助脚本from random import random, seed ...etc etc并复制merge.py
    • 运行帮助脚本
    • 检查示例文件是否正确生成
    • 运行merge.py_debug=True

    如果你得到的是向你展示的,那么肯定是你的环境

  2. 在一个新文件夹中,请一个新文件夹,
    • 复制您的数据文件merge.py,编辑merge.py和设置debug=False
    • 运行脚本!python file1 file2 file3 output(使用正确的文件名)
    • 如果它有效,那是您桌面上的某些东西在干扰(什么?现在,不要在意,它可以在新的环境中工作,仅此而已)
    • 如果它不起作用,你必须自己看看
      • 编辑merge.py并放置prints 以检查脚本实际上在做什么,而不是脚本应该做什么以及它实际上在做什么,因为您遇到了不匹配并且您想了解它的原因......
      • 例如,argv列表中是否包含预期的项目?很容易添加for arg in argv: print(arg)到代码中,看看你得到的是程序名(argv[0])输入文件名最后是输出文件名

为了帮助您,我将向您展示该脚本的调试版本,该脚本将在其中输出有关其实际操作的大量详细信息,但请努力理解打印的内容、打印的原因以及打印内容是什么当您发现问题时的含义...

我们到了

$ cat mergedbg.py
'''Usage: python3 merge.py filename1, filename2, ..., filenameout'''

from heapq import merge
from sys import argv, stdout

def reader(fobj, n):
    for line in fobj:
        yield [x.strip() for x in line.split(',')]+[n]

print('Contents of argv',
      *((n, arg) for n, arg in enumerate(argv)), sep='\n\t')

# open the files for reading
readers = []
for n, fname in enumerate(argv[1:-1]):
    try:
        print('Trying to open', fname, end='... ')
        readers.append(reader(open(fname), n))
        print('OK')
    except:
        print('oops, something went wrong!')

# open the file for writing
try:
    print('Trying to open', argv[-1], 'for writing', end='... ')
    out  = open(argv[-1], 'w')
    print('OK')
except:
    print('oops, something went wrong!')

N = len(readers)
print("About to loop over %d files"%N)

# header is "Interval,IOPS01,IOPS02,...,IOPSnn" modify as you like
print('Interval',*('IOPS%02d'%(n+1) for n in range(N)), sep=',', file=out)

# initialize 
t0, icount, ocount = "", 0, 0
for t, iops, n in merge(*readers, key=lambda t:int(t[0])):
    icount += 1
    print((icount, n, t, iops), end = '')
    if t != t0:
        if t0:
            ocount += 1
            print(t0, *iops_l, sep=',', file=out)
        t0 = t
        iops_l = ['']*N 
    iops_l[n] = iops

# the last record to output
ocount += 1
print(t0, *iops_l, sep=',', file=out)
# summary
print("\n%4d input records read,"%icount,
      "\n%4d output records written (1 header, %d data)."%(ocount+1, ocount))

当然,您必须使用您的文件名运行脚本...

$ python3 mergedbg.py f01 f02 f03 f04 OUTdbg
Contents of argv
        (0, 'mergedbg.py')
        (1, 'f01')
        (2, 'f02')
        (3, 'f03')
        (4, 'f04')
        (5, 'OUTdbg')
Trying to open f01... OK
Trying to open f02... OK
Trying to open f03... OK
Trying to open f04... OK
Trying to open OUTdbg for writing... OK
About to loop over 4 files
(1, 0, '1000', '4407')(2, 1, '1000', '889')(3, 2, '1000', '9569')(4, 3, '1000', '2956')(5, 1, '2000', '2695')(6, 2, '2000', '3053')(7, 3, '2000', '1856')(8, 0, '2001', '9109')(9, 0, '3000', '5822')(10, 1, '3000', '6446')(11, 3, '3000', '1194')(12, 2, '3001', '2767')(13, 0, '4000', '839')(14, 1, '4000', '3552')(15, 2, '4000', '1773')(16, 3, '4000', '8294')(17, 0, '5000', '2368')(18, 1, '5000', '9335')(19, 2, '5000', '9547')(20, 3, '5000', '8177')(21, 1, '6000', '5301')(22, 2, '6000', '8338')(23, 3, '6000', '6410')(24, 0, '6001', '7887')(25, 0, '7000', '6232')(26, 3, '7000', '7026')(27, 1, '7001', '5081')(28, 2, '7001', '3861')(29, 0, '8000', '1485')(30, 2, '8000', '3417')(31, 3, '8000', '1571')(32, 1, '8001', '1437')(33, 0, '9000', '1144')(34, 1, '9000', '3773')(35, 2, '9000', '4759')(36, 3, '9000', '2693')(37, 2, '10000', '4708')(38, 3, '10000', '8404')(39, 0, '10001', '4867')(40, 1, '10001', '5875')(41, 0, '11000', '645')(42, 1, '11000', '5573')(43, 2, '11000', '8815')(44, 3, '11000', '7935')(45, 0, '12000', '4658')(46, 1, '12000', '9373')(47, 2, '12000', '7810')(48, 3, '12000', '707')
  48 input records read, 
  19 output records written (1 header, 18 data).
$ cat OUTdbg 
Interval,IOPS01,IOPS02,IOPS03,IOPS04
1000,4407,889,9569,2956
2000,,2695,3053,1856
2001,9109,,,
3000,5822,6446,,1194
3001,,,2767,
4000,839,3552,1773,8294
5000,2368,9335,9547,8177
6000,,5301,8338,6410
6001,7887,,,
7000,6232,,,7026
7001,,5081,3861,
8000,1485,,3417,1571
8001,,1437,,
9000,1144,3773,4759,2693
10000,,,4708,8404
10001,4867,5875,,
11000,645,5573,8815,7935
12000,4658,9373,7810,707
$ 

结束语,我想我们之间有一个细微的误会,我似乎很难找到问题......我可以诚实地建议你寻求你的一些同事的帮助,即使是有限的知识Python 的,因为这可能非常有帮助,他们会在你的肩膀上看着你的过程并在你继续前进时问你问题是的,这可能非常有帮助。

也就是说,请随意问我其他问题:-)

于 2020-05-14T10:44:26.203 回答