4

使用 bash 或 python (2.4.x)

我有一个文件 - 文件中有大约 100 行左右,文件的结构是这样的。

aaaaa,  100
aaaab,  75
aaaac,  150
aaaad,  135
aaaae,  144
aaaaf,  12
aaaag,  5
aaaah,  34
aaaai,  11
aaaaj,  43
aaaak,  88
aaaal,  3
baaaa,  25
baaab,  33
baaac,  87
baaad,  111
baaae,  45
baaaf,  99
baaag,  71
baaah,  68
baaai,  168
baaaj,  21
baaak,  11
baaal,  47
caaaa,  59
caaab,  85
caaac,  77
caaad,  33
caaae,  44
caaaf,  16
caaag,  111
caaah,  141
caaai,  87
caaaj,  59
caaak,  89
caaal,  3

我想要做的是将它分成 12 列,每列具有大致相同数量的传感器,并且每列的总和接近相同。

换句话说,如果我把上面的列表像这样拆分。

aaaaa   100     aaaab   75      baaab   33
aaaai   11      baaah   68      baaac   87
aaaak   88      caaaa   59      caaac   77
       199             202              197

aaaah   34      baaaf   99      caaad   33
baaad   111     baaal   47      aaaac   150
aaaaj   43      caaae   44      caaaf   16
       188             190              199

aaaag   5       aaaaf   12      baaaa   25
aaaad   135     caaai   87      caaag   111
caaaa   59      caaak   89      baaag   71
       199                 188          207

aaaae   144     baaaj   21      caaaj   59
aaaal   3       baaak   11      caaah   141
baaae   45      baaai   168     caaal   3
       192              200              203

它使 12 列相等的项目并且非常接近偶数。

我可以手动完成,但我们最终需要这样做几次。我什至不确定从哪里开始,除了将它放入一个数组中,计算数组中的项目并进行随机拆分。尽管如此,仍然停留在价值平衡上。

任何指针?

4

4 回答 4

3

如果您想要一个最佳解决方案,这对于大量输入来说不会很有趣。您正在寻找与 CS 中一些非常著名的难题相符的东西——背包装箱等。一些更简单、不太完美的解决方案可能已经足够接近了。

这并不准确,但是,鉴于您的示例数据集,我设法通过一个非常简单的方法获得了 214、197、194、199、205、182、195、192、199、199、206、208 的大小。它可能适用于也可能不适用于真实数据。

方法是:

  1. 按大小排序列表
  2. 将列表分为 3 部分 - 高、中和低
  3. 将 high 的每个成员放在一个集合中。
  4. 反转中低列表。
  5. 将它们(以相反的顺序)放入现有集合中

随着您越来越接近最佳分区,解决方案可能会变得更加复杂。

于 2012-05-15T16:43:57.127 回答
1

有趣的问题,我认为您在寻找最佳解决方案时遇到了相当耗时的过程。您可以计算每次拆分的项目数以及它们应具有的平均值。在整数上对项目进行排序并在值仍低于平均值时取最大的数字,然后重复此过程直到您只需要再添加一个项目,现在选择最小的并尝试尽可能接近平均值(超过或低于无关紧要)。

如果在除最新之外的任何步骤中遇到困难(例如值 > 平均值),请返回并选择下一个最大的步骤。

于 2012-05-15T16:09:12.830 回答
1

我写了两个非常简单的实现。第一个使用双端队列从右侧和左侧弹出(列表排序后)以将低值与高值放置。第二个是@Sean McSomething 建议的。

这是代码(quick'n'dirty - 不幸的是很少有评论):

import math
import itertools
import collections


def sum_column(data):
    return sum(zip(*data)[1], 0.0)


def split_groups(sensors):
    sensors.sort(key=lambda item: item[1], reverse=True)
    per_group = len(sensors) // 12
    average = sum_column(sensors) / len(sensors)
    data = collections.deque(sensors)
    groups = [[] for i in xrange(12)]
    cycle = itertools.cycle(groups)
    try:
        while True:
            current = cycle.next()
            if len(current) == per_group - 1:
                if sum_column(current) < average:
                    current.append(data.popleft())
                else:
                    current.append(data.pop())
                continue
            current.append(data.popleft())
            current.append(data.pop())
    except IndexError:
        return groups


def split_groups2(sensors):
    sensors.sort(key=lambda item: item[1], reverse=True)
    groups = [[] for i in xrange(12)]
    cycle = itertools.cycle(groups)
    per_group = int(math.ceil(len(sensors) / 3.))
    partitions = [sensors[i:i + per_group] for i in xrange(0, len(sensors)
                                                           per_group)]
    medium, low = map(reversed, partitions[1:])
    for sensor, value in itertools.chain(partitions[0], medium, low):
        cycle.next().append((sensor, value))
    return groups


def format_groups(result):
    ret = []
    for group in result:
        tmp = []
        tmp.append('\n'.join('{0}   {1}'.format(k, v) for k, v in group))
        tmp.append(' ' * 8 + str(int(sum_column(group))))
        ret.append('\n'.join(tmp))
    return '\n\n'.join(ret)


if __name__ == '__main__':
    import sys

    implementation = split_groups
    if '--second' in sys.argv:
        sys.argv.remove('--second')
        implementation = split_groups2

    with open(sys.argv[1]) as fobj:
        sensors = []
        for line in fobj:
            sensor, value = line.strip().split(',  ')
            sensors.append((sensor, int(value)))
        sys.stdout.write(format_groups(split_groups(sensors)))
        sys.stdout.write('\n')

要点:
https ://gist.github.com/2703965

我把简单的部分(格式化)留给你了。现在它只是垂直打印它们(而不是按照您的要求垂直打印)。应该不会太难。

这是它可以达到的最好效果(使用两种实现):

(max(sums) - min(sums)) / 2. = 16.0

这与示例相去甚远,但这是一个开始。您可以使用文件名和可选的开关从命令行启动它--second(以使用第二个实现)。我可以使用命令行解析器,但我习惯于 argparse,这在 Python 2.4 中不存在。所以我只是去那个尴尬的黑客。

示例运行:

$ python2 groupit.py filename.txt
baaai   168
caaal   3
aaaaj   43
        214

aaaac   150
aaaal   3
caaae   44
        197

aaaae   144
aaaag   5
baaae   45
        194

caaah   141
baaak   11
baaal   47
        199

aaaad   135
aaaai   11
caaaj   59
        205

baaad   111
aaaaf   12
caaaa   59
        182

caaag   111
caaaf   16
baaah   68
        195

aaaaa   100
baaaj   21
baaag   71
        192

baaaf   99
baaaa   25
aaaab   75
        199

caaak   89
caaad   33
caaac   77
        199

aaaak   88
baaab   33
caaab   85
        206

baaac   87
aaaah   34
caaai   87
        208
$ python2 groupit.py --second filename.txt
baaai   168
caaal   3
aaaaj   43
        214

aaaac   150
aaaal   3
caaae   44
        197

aaaae   144
aaaag   5
baaae   45
        194

caaah   141
baaak   11
baaal   47
        199

aaaad   135
aaaai   11
caaaj   59
        205

baaad   111
aaaaf   12
caaaa   59
        182

caaag   111
caaaf   16
baaah   68
        195

aaaaa   100
baaaj   21
baaag   71
        192

baaaf   99
baaaa   25
aaaab   75
        199

caaak   89
caaad   33
caaac   77
        199

aaaak   88
baaab   33
caaab   85
        206

baaac   87
aaaah   34
caaai   87
        208

对于问题中的示例,这两种算法给出了完全相同的答案。如果您可以提供更多测试用例,我会尝试改进它们。我在 Python 2.7 上测试了脚本,因为我没有安装 2.4。
对不起,答案很长。

于 2012-05-15T18:34:59.230 回答
0

到目前为止,这是我想出的。我认为它非常接近,并且没有真正高价值的东西有点抵消 - 它很接近。

很高兴听到任何建议,使其更加 Pythonic。

文件:columnsplit.py

#!/usr/bin/python
import sys, operator

# usage
# columnsplit.py <filename> <#cols>
# columnsplit.py test.csv 12
#

#determine number of devices per column
def devicelisting(fulllist,percolumn):
  devicelist=[]
  fobj=open(fulllist,'r')
  for line in fobj:
    (key, val) = line.split(',')
    devicelist.append((key,int(val)))
  devicespercol=(len(devicelist)/int(percolumn))
  return(devicelist,devicespercol)

def devicesplit(fulllist,numcolumns,roundnum):
  if roundnum == 0:
    devices=sorted(fulllist, key=lambda device: device[1], reverse=True)
    devicestemp=devices
  else:
    devices=sorted(fulllist, key=lambda device: device[1])
    devicestemp=devices
  deviceslice=[]
  for idx, val in zip(range(numcolumns), devices):
    deviceslice.append(val)
    devicestemp.remove(val)
  return(deviceslice,devicestemp)

def makecolumns(roundnumber,percol):
  column=[]
  for i in range(percol):
    exec('tempslice=deviceslice%s' % i)
    column.append(tempslice[roundnumber])
  return(column)

# what this is going to do is generate how many devices will fill each of the intended
# number of columns.  What is left over will be run again against the lowest value of columns

if __name__ == '__main__':
  tempslice=[]
  devices,percol=devicelisting(sys.argv[1],sys.argv[2])
  # devices is the devices/value as tuples nested in a list
  # percol is going to be how many devices per column
  # you can len(devices) to count how many devices we have

  # prints out the device list in reverse.
  # print sorted(devices, key=lambda x: x[1], reverse=True)

  # what we will need to do here is split the device list into number of desired slices.  i.e. if we want 12 columns
  # and we have 108 devices there should be 9 slices of 12.
  # this will leave a remaining slice - of less than 9 which will be added to the 12 columns in order of smallest column first

  devicesleft=devices
  numcolumns=int(sys.argv[2])
  for i in range(percol):
    sendcol,devicesleft=devicesplit(devicesleft,numcolumns,i)
    exec('deviceslice%s=sendcol' % i)
# and finally create the columns
  for i in range(0,numcolumns):
    sendcol=makecolumns(i,percol)
    exec('column%s=sendcol' % i)

  # add the left over devices
  j=numcolumns
  # sort remaining reverse.
  devices=sorted(devicesleft, key=lambda device: device[1], reverse=True)
  for i in range(len(devices)):
    j-=1
    exec('column%s.append(devices[i])' % j)

  # prints out the resulting columns
  for i in range(0,numcolumns):
    exec('tempcol=column%s' % i)
    print tempcol
    print sum([pair[1] for pair in tempcol])

我跑过的测试文件。

文件:test44a.csv

SQCIEOEO,1272
HIKTXYZH,281
JZHRZXKX,5793
UBGTOLUX,147
WBVYFNBN,9
VMHTKHBU,32
GILGFWDA,1334
YKUMWOKT,2066
PFSVTUIP,51
GPJRWKMD,673
TYJZUNZS,27
XTFUHPNX,2102
VFSPABFG,65
ROYOZKRS,189
IARDNRVL,587
LBFSQTQL,973
ZJBZKGFB,21301
UEPUOHMW,20
HEAVWVGH,0
XMANFQZE,719
ZADKGIMB,82
NCVBJIYR,27
NYMJUSQR,20646
EQFKHEOH,2050
ERRLAENN,19
HIPRQNIE,12557
MVNHODYT,20
UEDBIRIN,14
JAZJEMXL,28
UMDLALPN,36
GCUUGTNA,0
XRCGIKTR,12
KSBPEYBZ,20657
LELLPAYW,43792
DTRKMFLK,73
WNQEXJWI,41
CYXHXYHI,10
CSUSTTOX,120
NFHZLSJH,23
FAMDKJLM,25
HIUEHBNJ,261
UIBNCQKP,40
WSPHKYOQ,30025
ZBUJKFWR,0
OQWVSKFM,49
SHZUXKKU,21
CZBMYQDX,45
RXGBCCTR,17
SPMLASXS,15
ZWNXGXRI,59
WTVUJZSB,22
WYDZBWQU,19100
MDFMVCFV,6133
ZSSGQJPM,25
CKHMJZOG,85
YRFZOWTB,28
AYNWBSRA,14
LJGBTVOW,13110
GWJPWXWU,16
PCUDYNEY,179
MSVNLMOX,62
WUYPPNMW,2285
KVLGTIBI,11
KWMIKQHW,11
JDKUPYRM,1851
DARXQYDY,68
UUPXIDEP,139
SKQZMTFY,4377
ZEPOWAEA,189
BWXRVAPP,167
VFMDIRTA,561
BKANEGMD,2122
LBRICWID,1775
TGVOGLDC,3650
QQGZHAAJ,81
KAXPHJSS,122
LKAOHISA,32
ONOVZSYQ,41
IEPQEPZP,62
QWEXGXQS,0
IQGPZYQO,15
MEJLXIBG,10
MRWRHWHX,10
TMVAJLSS,57
BYIAXYOJ,173
DYUAGWGT,248
ODLVZSST,21
EOTOZLHA,6476
KPBHOQQR,30
OLSVIYOW,539
CZSCSLVX,17
ZPMYBTZL,11
IATWRKOF,12507
WGBEFQBH,41
PUJIFEFE,382
TSDULCGU,9070
DARUKFAG,209
MBLRRNYH,250
IIQNNWSG,25
OWBZYIUC,1808
ILXTRXZD,2012
ZLVRZUYH,269
CPVPLOWZ,108
KYZJGTMO,635
EJHWGHZG,25
TUXTOWBR,11
LXGXLCWW,2313
AVFHPRWT,915
AEPHMPNF,32
KLZZHAQT,56
XWQJZNFA,611
JKHYCDSC,1455

运行它的命令:python columnsplit.py test44a 12(12 是所需列的数量)。

示例输出列的值为第一列。:

1) 45577 [('LELLPAYW', 43792), ('HEAVWVGH', 0), ('XRCGIKTR', 12), ('ODLVZSST', 21), ('VMHTKHBU', 32), ('TMVAJLSS', 57), ('KAXPHJSS', 122), ('ZLVRZUYH', 269), ('SQCIEOEO', 1272)]

2) 31906 [('WSPHKYOQ', 30025), ('GCUUGTNA', 0), ('UEDBIRIN', 14), ('WTVUJZSB', 22), ('LKAOHISA', 32), ('ZWNXGXRI', 59), ('UUPXIDEP', 139), ('HIKTXYZH', 281), ('GILGFWDA', 1334)]

3) 23416 [('ZJBZKGFB', 21301), ('ZBUJKFWR', 0), ('AYNWBSRA', 14), ('NFHZLSJH', 23), ('AEPHMPNF', 32), ('MSVNLMOX', 62), ('UBGTOLUX', 147), ('PUJIFEFE', 382), ('JKHYCDSC', 1455)]

4) 23276 [('KSBPEYBZ', 20657), ('QWEXGXQS', 0), ('SPMLASXS', 15), ('FAMDKJLM', 25), ('UMDLALPN', 36), ('IEPQEPZP', 62), ('BWXRVAPP', 167), ('OLSVIYOW', 539), ('LBRICWID', 1775)]

5) 23342 [('NYMJUSQR', 20646), ('WBVYFNBN', 9), ('IQGPZYQO', 15), ('ZSSGQJPM', 25), ('UIBNCQKP', 40), ('VFSPABFG', 65), ('BYIAXYOJ', 173), ('VFMDIRTA', 561), ('OWBZYIUC', 1808)]

6) 21877 [('WYDZBWQU', 19100), ('CYXHXYHI', 10), ('GWJPWXWU', 16), ('IIQNNWSG', 25), ('WNQEXJWI', 41), ('DARXQYDY', 68), ('PCUDYNEY', 179), ('IARDNRVL', 587), ('JDKUPYRM', 1851)]

7) 16088 [('LJGBTVOW', 13110), ('MEJLXIBG', 10), ('RXGBCCTR', 17), ('EJHWGHZG', 25), ('ONOVZSYQ', 41), ('DTRKMFLK', 73), ('ROYOZKRS', 189), ('XWQJZNFA', 611), ('ILXTRXZD', 2012)]

8) 15607 [('HIPRQNIE', 12557), ('MRWRHWHX', 10), ('CZSCSLVX', 17), ('TYJZUNZS', 27), ('WGBEFQBH', 41), ('QQGZHAAJ', 81), ('ZEPOWAEA', 189), ('KYZJGTMO', 635), ('EQFKHEOH', 2050)]

9) 17952 [('IATWRKOF', 12507), ('KVLGTIBI', 11), ('ERRLAENN', 19), ('NCVBJIYR', 27), ('CZBMYQDX', 45), ('ZADKGIMB', 82), ('DARUKFAG', 209), ('GPJRWKMD', 673), ('YKUMWOKT', 2066), ('LXGXLCWW', 2313)]

10) 15982 [('TSDULCGU', 9070), ('KWMIKQHW', 11), ('UEPUOHMW', 20), ('JAZJEMXL', 28), ('OQWVSKFM', 49), ('CKHMJZOG', 85), ('DYUAGWGT', 248), ('XMANFQZE', 719), ('XTFUHPNX', 2102), ('TGVOGLDC', 3650)]

11) 14358 [('EOTOZLHA', 6476), ('ZPMYBTZL', 11), ('MVNHODYT', 20), ('YRFZOWTB', 28), ('PFSVTUIP', 51), ('CPVPLOWZ', 108), ('MBLRRNYH', 250), ('AVFHPRWT', 915), ('BKANEGMD', 2122), ('SKQZMTFY', 4377)]

12) 15683 [('MDFMVCFV', 6133), ('TUXTOWBR', 11), ('SHZUXKKU', 21), ('KPBHOQQR', 30), ('KLZZHAQT', 56), ('CSUSTTOX', 120), ('HIUEHBNJ', 261), ('LBFSQTQL', 973), ('WUYPPNMW', 2285), ('JZHRZXKX', 5793)]
于 2012-05-25T18:55:25.757 回答