0

我有一个停车场,停着不同型号(nr)的汽车,而且这些汽车都挤得满满当当,为了让一个人下车,可能需要移动其他一些汽车。有点像 15Puzzle,只有我可以从停车场开出一辆或多辆汽车。Ordered_car_List 包括今天将被取走的汽车,它们需要被带出停车场,并尽可能少地移动未订购的汽车。这个熊猫有更多的专栏,但这是我想不通的。我有一个适用于小数据集的程序,但似乎这不是 PANDAS 的方式:-)

我有这个:

cars = pd.DataFrame({'x': [1,1,1,1,1,2,2,2,2],
                    'y': [1,2,3,4,5,1,2,3,4],
                   'order_number':[6,6,7,6,7,9,9,10,12]})
cars['order_number_no_dublicates_down'] = None
Ordered_car_List = [6,9,9,10,28]

i=0
while i < len(cars):
    temp_val = cars.at[i, 'order_number']
    if temp_val in Ordered_car_List:
        cars.at[i, 'order_number_no_dublicates_down']  = temp_val
        Ordered_car_List.remove(temp_val)
    i+=1

如果我使用cars.apply(lambda...,如何在每次迭代中更改Ordered_car_List?我可以采取另一种方法吗?

我找到了这个页面,它让我想要更快。Lambda 方法在速度方面处于中等水平,但它仍然比我现在做的要快得多。 https://towardsdatascience.com/how-to-make-your-pandas-loop-71-803-times-faster-805030df4f06

在此处输入图像描述

4

1 回答 1

0

更新cars

我们可以基于两个计数器对其进行矢量化:

  1. cumcount()累计计算每个唯一值cars['order_number']
  2. collections.Counter()计算每个唯一值Ordered_car_List
cumcount = cars.groupby('order_number').cumcount().add(1)
maxcount = cars['order_number'].map(Counter(Ordered_car_List))

#    order_number  cumcount  maxcount
# 0             6         1         1
# 1             6         2         1
# 2             7         1         0
# 3             6         3         1
# 4             7         2         0
# 5             9         1         2
# 6             9         2         2
# 7            10         1         1
# 8            12         1         0

那么我们只想保留cars['order_number']where cumcount <= maxcount

  • 要么使用DataFrame.loc[]

    cars.loc[cumcount <= maxcount, 'nodup'] = cars['order_number']
    
  • 或者Series.where()

    cars['nodup'] = cars['order_number'].where(cumcount <= maxcount)
    
  • Series.mask()条件倒置

    cars['nodup'] = cars['order_number'].mask(cumcount > maxcount)
    

更新Ordered_car_List

决赛Ordered_car_ListCounter()区别:

Used_car_List = cars.loc[cumcount <= maxcount, 'order_number']

# [6, 9, 9, 10]
Ordered_car_List = list(Counter(Ordered_car_List) - Counter(Used_car_List))

# [28]

最终输出

cumcount = cars.groupby('order_number').cumcount().add(1)
maxcount = cars['order_number'].map(Counter(Ordered_car_List))
cars['nodup'] = cars['order_number'].where(cumcount <= maxcount)

#    x  y  order_number  nodup
# 0  1  1             6    6.0
# 1  1  2             6    NaN
# 2  1  3             7    NaN
# 3  1  4             6    NaN
# 4  1  5             7    NaN
# 5  2  1             9    9.0
# 6  2  2             9    9.0
# 7  2  3            10   10.0
# 8  2  4            12    NaN
Used_car_List = cars.loc[cumcount <= maxcount, 'order_number']
Ordered_car_List = list(Counter(Ordered_car_List) - Counter(Used_car_List))

# [28]

计时

请注意,对于小数据,您的循环仍然非常快,但矢量化计数器方法的扩展性要好得多:

loop vs Series.where vs Series.mask vs df.loc的计时

于 2021-07-07T15:40:51.180 回答