python - 基于连续编号模式何时停止的拆分列表

Question

我已经输出了一个列表。每当以下数字不等于其前一个值时，我想将其分解为单独的列表。

 x = [1,4,4,5,5,8,8,10,10,25,25,70,70,90,90,100,2,3,3,4,4,5,5,8,8,9,20,21,21,22,23)

我想要像这样的列表

 a = [1,4,4,5,5,8,8,10,10,25,25,70,70,90,90,100)

 b = [2,3,3,4,4,5,5,8,8,9)

 c = [20,21,21,22]

 d = [23]

score 4 · Accepted Answer

为了回答你的问题：

我有 [...] 一份清单。每当以下数字不等于其前一个值时，我想将其分解为单独的列表。

看看itertools.groupby。

例子：

import itertools
l = [38, 1200, 1200, 306, 306, 391, 391, 82, 82, 35, 35, 902, 902, 955, 955, 13]
for x, v in itertools.groupby(l):
    # `v` is an iterator that yields all subsequent elements
    # that have the same value
    # `x` is that value
    print list(v)

输出是：

$ python test.py
[38]
[1200, 1200]
[306, 306]
[391, 391]
[82, 82]
[35, 35]
[902, 902]
[955, 955]
[13]

这显然是你要求的？

至于你的模式，这里有一些生成器函数（至少）产生你期望给定输入的输出：

import itertools

def split_sublists(input_list):
    sublist = []
    for val, l in itertools.groupby(input_list):
        l = list(l)
        if not sublist or len(l) == 2:
            sublist += l
        else:
            sublist += l
            yield sublist
            sublist = []
    yield sublist

input_list = [1,4,4,5,5,8,8,10,10,25,25,70,70,90,90,100,2,3,3,4,4,5,5,8,8,9,20,21,21,22,23]
for sublist in split_sublists(input_list):
    print sublist

输出：

$ python test.py
[1, 4, 4, 5, 5, 8, 8, 10, 10, 25, 25, 70, 70, 90, 90, 100]
[2, 3, 3, 4, 4, 5, 5, 8, 8, 9]
[20, 21, 21, 22]
[23]

score 2 · Accepted Answer

numpy 版本：

>>> inds = np.where(np.diff(x))[0]
>>> out = np.split(x,inds[np.diff(inds)==1][0::2]+2)
>>> for n in out:
...     print n

[  38 1200 1200  306  306  391  391   82   82   35   35  902  902  955  955
   13]
[955 847 847 835 835 698 698 777 777 896 896 923 923 940 940 569 569  53
  53 411]
[  53 1009 1009 1884]
[1009  878]
[ 923  886  886  511  511  942  942 1067 1067 1888 1888  243  243 1556]

您的新案例是相同的：

>>> inds = np.where(np.diff(x))[0]
>>> out = np.split(x,inds[np.diff(inds)==1][0::2]+2)
>>> for n in out:
...     print n
...
[  1   4   4   5   5   8   8  10  10  25  25  70  70  90  90 100]
[2 3 3 4 4 5 5 8 8 9]
[20 21 21 22]
[23]

从x列表开始：

%timeit inds = np.where(np.diff(x))[0];out = np.split(x,inds[np.diff(inds)==1][0::2]+2)
10000 loops, best of 3: 169 µs per loop

如果x是一个 numpy 数组：

%timeit inds = np.where(np.diff(arr_x))[0];out = np.split(arr_x,inds[np.diff(inds)==1][0::2]+2)
10000 loops, best of 3: 135 µs per loop

对于较大的系统，您可能期望 numpy 比纯 python 具有更好的性能。

score 1 · Accepted Answer

这是我的丑陋解决方案：

x = [38, 1200, 1200, 306, 306, 391, 391, 82, 82, 35, 35, 902, 902, 955, 955, 13, 955, 847, 847, 835, 83, 5698, 698, 777, 777, 896, 896, 923, 923, 940, 940, 569, 569, 53, 53, 411]

def weird_split(alist):
    sublist = []
    for i, n in enumerate(alist[:-1]):
        sublist.append(n)
        # make sure we only create a new list if the current one is not empty
        if len(sublist) > 1 and n != alist[i-1] and n != alist[i+1]:
            yield sublist
            sublist = []
    # always add the last element
    sublist.append(alist[-1])
    yield sublist

for sublist in weird_split(x):
    print sublist

和输出：

[38, 1200, 1200, 306, 306, 391, 391, 82, 82, 35, 35, 902, 902, 955, 955, 13]
[955, 847, 847, 835]
[83, 5698]
[698, 777, 777, 896, 896, 923, 923, 940, 940, 569, 569, 53, 53, 411]

score 1 · Accepted Answer

首先，您还没有为定义行为[1, 0, 0, 1, 0, 0, 1]，因此将其拆分为[1, 0, 0, 1],[0, 0]和[1].

其次，有很多特殊情况需要正确处理，因此比您预期的要长。如果它直接将东西放入列表中，这也会被缩短，但是生成器是个好东西，所以我确保不这样做。

首先，使用完整的迭代器接口而不是yield快捷方式，因为它允许在外部和内部迭代之间更好地共享状态，而无需subsection每次迭代都创建一个新的生成器。嵌套s 可能能够在更少的空间内完成此操作，但在这种情况下，我认为冗长是可以接受的def。yield

所以，设置：

class repeating_sections:
    def __init__(self, iterable):
        self.iter = iter(iterable)

        try:
            self._cache = next(self.iter)
            self.finished = False
        except StopIteration:
            self.finished = True

我们需要定义产生的子迭代器，直到它找到不匹配的对。因为 end 将从迭代器中删除，我们yield在下一次调用时需要它_subsection，所以将它存储在_cache.

    def _subsection(self):
        yield self._cache

        try:
            while True:
                item1 = next(self.iter)

                try:
                    item2 = next(self.iter)
                except StopIteration:
                    yield item1
                    raise

                if item1 == item2:
                    yield item1
                    yield item2

                else:
                    yield item1
                    self._cache = item2
                    return

        except StopIteration:
            self.finished = True

__iter__应该返回self迭代：

    def __iter__(self):
        return self

__next__除非完成，否则返回一个小节。请注意，如果要使行为可靠，则用尽该部分很重要。

    def __next__(self):
        if self.finished:
            raise StopIteration

        subsection = self._subsection()
        return subsection

        for item in subsection:
            pass

一些测试：

for item in repeating_sections(x):
    print(list(item))
#>>> [38, 1200, 1200, 306, 306, 391, 391, 82, 82, 35, 35, 902, 902, 955, 955, 13]
#>>> [955, 847, 847, 835, 835, 698, 698, 777, 777, 896, 896, 923, 923, 940, 940, 569, 569, 53, 53, 411]
#>>> [53, 1009, 1009, 1884]
#>>> [1009, 878]
#>>> [923, 886, 886, 511, 511, 942, 942, 1067, 1067, 1888, 1888, 243, 243, 1556]


for item in repeating_sections([1, 0, 0, 1, 0, 0, 1]):
    print(list(item))
#>>> [1, 0, 0, 1]
#>>> [0, 0]
#>>> [1]

显示这一点的一些时机并非完全没有意义：

SETUP="
x = [38, 1200, 1200, 306, 306, 391, 391, 82, 82, 35, 35, 902, 902, 955, 955, 13, 955, 847, 847, 835, 83, 5698, 698, 777, 777, 896, 896, 923, 923, 940, 940, 569, 569, 53, 53, 411]
x *= 5000

class repeating_sections:
    def __init__(self, iterable):
        self.iter = iter(iterable)

        try:
            self._cache = next(self.iter)
            self.finished = False
        except StopIteration:
            self.finished = True

    def _subsection(self):
        yield self._cache

        try:
            while True:
                item1 = next(self.iter)

                try:
                    item2 = next(self.iter)
                except StopIteration:
                    yield item1
                    raise

                if item1 == item2:
                    yield item1
                    yield item2

                else:
                    yield item1
                    self._cache = item2
                    return

        except StopIteration:
            self.finished = True

    def __iter__(self):
        return self

    def __next__(self):
        if self.finished:
            raise StopIteration

        subsection = self._subsection()
        return subsection

        for item in subsection:
            pass


def weird_split(alist):
    sublist = []
    for i, n in enumerate(alist[:-1]):
        sublist.append(n)
        # make sure we only create a new list if the current one is not empty
        if len(sublist) > 1 and n != alist[i-1] and n != alist[i+1]:
            yield sublist
            sublist = []
    # always add the last element
    sublist.append(alist[-1])
    yield sublist
"

python -m timeit -s "$SETUP" "for section in repeating_sections(x):" "    for item in section: pass"
python -m timeit -s "$SETUP" "for section in weird_split(x):"        "    for item in section: pass"

结果：

10 loops, best of 3: 150 msec per loop
10 loops, best of 3: 207 msec per loop

差别不大，但速度更快。

score 1 · Accepted Answer

def group(l,skip=0):
    prevind = 0
    currind = skip+1
    for val in l[currind::2]:
        if val != l[currind-1]:
            if currind-prevind-1 > 1: yield l[prevind:currind-1]
            prevind = currind-1
        currind += 2
    if prevind != currind:
        yield l[prevind:currind]

对于您定义的列表，哪个在调用时返回skip=1

[38, 1200, 1200, 306, 306, 391, 391, 82, 82, 35, 35, 902, 902, 955, 955]
[13, 955, 847, 847, 835, 835, 698, 698, 777, 777, 896, 896, 923, 923, 940, 940, 569, 569, 53, 53]
[411, 53, 1009, 1009]
[1884, 1009]
[878, 923, 886, 886, 511, 511, 942, 942, 1067, 1067, 1888, 1888, 243, 243, 1556]

还有一个更简单的示例列表[1,1,3,3,2,5]：

for g in group(l2):
    print g

[1, 1, 3, 3]
[2, 5]

原因skip是该函数的可选参数是在您的示例中38被包括在内，尽管它不等于1200。如果这是一个错误，那么只需删除 skip 并最初设置currind为 equal 。1

解释：

在一个列表中[a,b,c,d,e,...]。我们想连续比较两个元素，即a == b, c == d，然后当比较没有返回时True，捕获所有之前的元素（不包括那些已经捕获的元素）。为此，我们需要跟踪最后一次捕获发生的位置，最初是0（即没有捕获）。然后我们遍历每个对，遍历列表中的第二个元素，currind默认情况下（当不跳过元素时）是一个。然后将我们得到l[currind::2]的值与它之前的值进行比较l[currind-1]。是从的初始值开始currind的每个第二个元素的索引（默认情况下）。如果值不currind1match 那么我们需要执行捕获，但前提是生成的捕获包含一个术语！因此currind-prevind-1> 1（因为列表切片的长度为 -1，因此需要为 2 或更多才能提取至少 1 个元素）。l[prevind:currind-1]执行此捕获，从不匹配（或0默认）的最后一个比较的索引直到每个比较对中第一个值之前a,b的元素等c,d。然后prevind设置为currind-1即捕获的最后一个元素的索引。然后我们增加currind2 以进入 next 的索引val。最后，如果还剩下一对，我们将其提取出来。

所以对于[1,1,3,3,2,5]：

val is 1, at index 1. comparing to value at 0 i.e 1
make currind the index of last element of the next pair
val is 3, at index 3. comparing to value at 2 i.e 3
make currind the index of last element of the next pair
val is 5, at index 5. comparing to value at 4 i.e 2
not equal so get slice between 0,4
[1, 1, 3, 3]
make currind the index of last element of the next pair  #happens after the for loop
[2, 5]

python - 基于连续编号模式何时停止的拆分列表

5 回答 5

Related

Reference