python - 在 Python 中对序列进行排序的最佳方法是什么？

Question

我正在尝试根据需要连续发生的某些条件对表格进行排序。表的简化版本：

Number  Time
   1    23
   2    45
   3    67
   4    23
   5    11
   6    45
   7    123
   8    34

...

我需要检查时间是否连续 5 次 <40。就像我需要检查第 1-5 行，然后是 2-6 等...然后第一次和最后一次打印并保存到文件中。就像，如果满足第 2-6 行的条件，我将需要打印第 2 行和第 6 行的时间。在满足条件后，检查应该停止。无需检查其他行。到目前为止，我实现了一个带有两个临时变量的计数器来连续检查 3 个项目。它工作正常。但是，如果我想检查连续发生 30 次的条件，我不能只手动创建 30 个临时变量。实现这一目标的最佳方法是什么？我想我只需要某种循环。谢谢！

这是我的代码的一部分：

reader = csv.reader(open(filename))
counter, temp1, temp2, numrow = 0, 0, 0, 0

for row in reader:
    numrow+=1
    if numrow <5:
        col0, col1, col4, col5, col6, col23, col24, col25 = float(row[0]),
            float(row[1]), float(row[4]), float(row[5]),float(row[6]), 
            float(row[23]), float(row[24]), float(row[25])
        if col1 <= 40:
            list1=(col1, col3, col4, col5, col6, col23, col24, col25)
            counter += 1
            if counter == 3:
                print("Cell# %s" %filename[-10:-5])
                print LAYOUT.format(*headers_short)
                print LAYOUT.format(*temp1)
                print LAYOUT.format(*temp2)
                print LAYOUT.format(*list1)
                print ""

            elif counter == 1:
                temp1=list1

            elif counter == 2:
                temp2=list1

        else:
            counter = 0

我实施了 Bakuriu 建议的解决方案，它似乎正在工作。但是，结合众多测试的最佳方式是什么？就像我需要检查几个条件一样。可以说：v

连续 10 次循环效率低于 40，
连续 5 个循环中小于 40 的容量
连续 25 次循环少于 40 次
还有一些……

现在我只为每次测试打开 csv.reader 并运行该函数。我想这不是最有效的方法，尽管它有效。对不起，我只是一个完整的菜鸟。

csvfiles = glob.glob('processed_data/*.stat')
for filename in csvfiles: 

    flag=[]
    flag.append(filename[-12:-5])
    reader = csv.reader(open(filename))
    for a, row_group in enumerate(row_grouper(reader,10)):
        if all(float(row[1]) < 40 for row in row_group):         
            str1= "Efficiency is less than 40 in cycles "+ str(a+1)+'-'+str(a+10)  #i is the index of the first row in the group.
            flag.append(str1)
            break #stop processing other rows.

    reader = csv.reader(open(filename))    
    for b, row_group in enumerate(row_grouper(reader,5)):
        if all(float(row[3]) < 40 for row in row_group):
            str1= "Capacity is less than 40 minutes in cycles "+ str(a+1)+'-'+str(a+5)
            flag.append(str1)
            break #stop processing other rows.

    reader = csv.reader(open(filename))    
    for b, row_group in enumerate(row_grouper(reader,25)):
        if all(float(row[3]) < 40 for row in row_group):
            str1= "Time is less than < 40 in cycles "+ str(a+1)+'-'+str(a+25)
            flag.append(str1)
            break #stop processing other rows.

   if len(flag)>1:

       for i in flag:
            print i
        print '\n'

score 2 · Accepted Answer

您根本不必对数据进行排序。一个简单的解决方案可能是：

def row_grouper(reader):
    iterrows = iter(reader)
    current = [next(iterrows) for _ in range(5)]
    for next_row in iterrows:
        yield current
        current.pop(0)
        current.append(next_row)


reader = csv.reader(open(filename))

for i, row_group in enumerate(row_grouper(reader)):
    if all(float(row[1]) < 40 for row in row_group):
        print i, i+5  #i is the index of the first row in the group.
        break #stop processing other rows.

该row_grouper函数是一个生成器，可生成连续行的 5 元素列表。每次它删除组的第一行并在末尾添加新行。

list您可以使用 adeque并将pop(0)in替换为更有效row_grouper的调用，而不是普通的popleft()，尽管如果列表只有 5 个元素，这并不重要。

或者，您可以使用 martineau 建议并使用maxlen关键字参数并避免使用pop。这大约是使用 deque 的 popleft 的两倍，大约是使用list's 的两倍pop(0)。

编辑：要检查多个条件，您可以修改使用多个条件row_grouper并用于itertools.tee获取可迭代对象的副本。

例如：

import itertools as it

def check_condition(group, row_index, limit, found):
    if group is None or found:
        return False
    return all(float(row[row_index]) < limit for row in group)


f_iter, s_iter, t_iter = it.tee(iter(reader), 3)

groups = row_grouper(f_iter, 10), row_grouper(s_iter, 5), row_grouper(t_iter, 25)

found_first = found_second = found_third = False

for index, (first, second, third) in enumerate(it.izip_longest(*groups)):
    if check_condition(first, 1, 40, found_first):
        #stuff
        found_first = True
    if check_condition(second, 3, 40, found_second):
        #stuff
        found_second = True
    if check_condition(third, 3, 40, found_third): 
        # stuff
        found_third = True
    if found_first and found_second and found_third:
        #stop the code if we matched all the conditions once.
        break

第一部分只是导入itertools（并分配一个“别名”it以避免itertools每次都输入）。

我已经定义了check_condition函数，因为条件变得越来越复杂，你不想一遍又一遍地重复它们。如您所见，最后一行与check_condition之前的条件相同：它检查当前“行组”是否验证了该属性。由于我们计划只对文件进行一次迭代，并且我们不能在只满足一个条件时停止循环（因为我们会错过其他条件）我们必须使用一些标志来告诉我们（例如）时间的条件是否是以前见过与否。正如您在循环中看到的那样，当所有条件都满足时for，我们退出循环。break

现在，该行：

f_iter, s_iter, t_iter = it.tee(iter(reader), 3)

在的行上创建一个可迭代对象reader并制作 3 个副本。这意味着循环：

for row in f_iter:
    print(row)

将打印文件的所有行，就像for row in reader. 但是请注意，这允许我们在不多次读取文件的情况下itertools.tee获取行的副本。

之后，我们必须将这些行传递给row_grouper以验证条件：

groups = row_grouper(f_iter, 10), row_grouper(s_iter, 5), row_grouper(t_iter, 25)

最后，我们必须遍历“行组”。为了同时做到这一点，我们使用itertools.izip_longest(在 python3 中重命名为itertools.zip_longest(without i))。它的工作原理就像zip创建元素对（例如zip([1, 2, 3], ["a", "b", "c"]) -> [(1, "a"), (2, "b"), (3, "c")]）。不同之处在于用 sizip_longest 填充较短的可迭代对象None。这确保我们检查所有可能组的条件（这也是为什么check_condition必须检查 if groupis None）。

为了获得当前行索引，我们将所有内容都包装在中enumerate，就像以前一样。代码内部for非常简单：您使用检查条件check_condition，如果满足条件，则执行您必须做的事情，并且必须为该条件设置标志（以便在以下循环中条件始终为False） .

（注意：我必须说我没有测试代码。有时间我会测试它，无论如何我希望我能给你一些想法。并查看文档itertools）。

score 1 · Accepted Answer

您实际上不需要对数据进行排序，只需跟踪您要查找的条件是否已在最后N行数据中发生。固定大小collections.deque的 s 适合这种事情。

import csv
from collections import deque
filename = 'table.csv'
GROUP_SIZE = 5
THRESHOLD = 40
cond_deque = deque(maxlen=GROUP_SIZE)

with open(filename) as datafile:
    reader = csv.reader(datafile) # assume delimiter=','
    reader.next() # skip header row
    for linenum, row in enumerate(reader, start=1):  # process rows of file
        col0, col1, col4, col5, col6, col23, col24, col25 = (
            float(row[i]) for i in (0, 1, 4, 5, 6, 23, 24, 25))
        cond_deque.append(col1 < THRESHOLD)
        if cond_deque.count(True) == GROUP_SIZE:
            print 'lines {}-{} had {} consecutive rows with col1 < {}'.format(
                linenum-GROUP_SIZE+1, linenum, GROUP_SIZE, THRESHOLD)
            break  # found, so stop looking

python - 在 Python 中对序列进行排序的最佳方法是什么？

2 回答 2

Related

Reference