2

我正在研究一个数据集,在该数据集上我有某些值需要四舍五入到下限/上限。

例如。如果我希望上限为9并低于3并且我们有如下数字 -

[ 7.453511737983394, 
  8.10917072790058, 
  6.2377799380575, 
  5.225853201122676, 
  4.067932296134156 ]

我们希望列出四舍五入到 3 或 9,例如 -

[ 9, 
  9, 
  9, 
  3, 
  3 ]

我知道我们可以用一种很好的旧方式来做到这一点,比如在数组中迭代并找到差异,然后得到最接近的那个。

我的方法代码:

for i in the_list[:]:
    three = abs(3-the_list[i])  
    nine = abs(9-the_list[i])

    if three < nine:
        the_list[i] = three
    else:
        the_list[i] = nine

我想知道是否有一种内置在python中的快速而肮脏的方式,例如:

hey_bound = round_the_num(number, bound_1, bound_2) 

我知道我们可以my-approach-code,但我很确定这已经以更好的方式实现了,我试图找到它,但没有找到它,我们就在这里。

解决此问题的任何猜测或直接链接都将是惊人的。

4

8 回答 8

3

编辑:
到目前为止,我认为最好的方法是使用 numpy (以避免“手动”循环)并简单计算the_list两个边界之间的差异数组(因此这里没有昂贵的乘法),然后仅有条件地添加一个或其他,取决于哪个更小:

import numpy as np

the_list = np.array([ 7.453511737983394,
8.10917072790058, 
6.2377799380575, 
5.225853201122676, 
4.067932296134156 ])

dhi = 9 - the_list
dlo = 3 - the_list
idx = dhi + dlo < 0
the_rounded = the_list + np.where(idx, dhi, dlo)
# array([9., 9., 9., 3., 3.])

我将轮函数应用于无偏移归一化列表,然后缩小并添加偏移量:

import numpy as np

the_list = np.array([ 7.453511737983394,
8.10917072790058, 
6.2377799380575, 
5.225853201122676, 
4.067932296134156 ])

hi = 9
lo = 3
dlt = hi - lo

the_rounded = np.round((the_list - lo)/dlt) * dlt + lo

# [9. 9. 9. 3. 3.]
于 2018-12-11T16:23:55.417 回答
3

可用答案​​的时间比较

在此处输入图像描述

我的解释是:
从性能的角度来看,您应该选择 Abhishek Patel 或 Carles Mitjans 以获得较小的列表。
对于包含几十个或更多值的列表,numpy 数组然后有条件地添加具有较小绝对值的差异似乎是最快的解决方案。


用于时序比较的代码:

import timeit
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt

plt.style.use('ggplot')

rep = 5

timings = dict()

for n in range(7):
    print(f'N = 10^{n}')

    N = 10**n
    setup = f'''import numpy as np\nthe_list = np.random.random({N})*6+3\nhi = 9\nlo = 3\ndlt = hi - lo\nmid = (hi + lo) / 2\ndef return_the_num(l, lst, h):\n    return [l if abs(l-x) < abs(h-x) else h for x in lst]'''

    fct = 'np.round((the_list - lo)/dlt) * dlt + lo'
    t = timeit.Timer(fct, setup=setup)
    timings['SpghttCd_np'] = timings.get('SpghttCd_np', []) + [np.min(t.repeat(repeat=rep, number=1))]

    fct = 'return_the_num(3, the_list, 9)'
    t = timeit.Timer(fct, setup=setup)
    timings['Austin'] = timings.get('Austin', []) + [np.min(t.repeat(repeat=rep, number=1))]

    fct = '[(lo, hi)[mid < v] for v in the_list]'
    t = timeit.Timer(fct, setup=setup)
    timings['SpghttCd_lc'] = timings.get('SpghttCd_lc', []) + [np.min(t.repeat(repeat=rep, number=1))]

    setup += '\nround_the_num = lambda list, upper, lower: [upper if x > (upper + lower) / 2 else lower for x in list]'
    fct = 'round_the_num(the_list, 9, 3)'
    t = timeit.Timer(fct, setup=setup)
    timings['Carles Mitjans'] = timings.get('Carles Mitjans', []) + [np.min(t.repeat(repeat=rep, number=1))]

    setup += '\nupper_lower_bound_list=[3,9]'
    fct = '[min(upper_lower_bound_list, key=lambda x:abs(x-myNumber)) for myNumber in the_list]'
    t = timeit.Timer(fct, setup=setup)
    timings['mad_'] = timings.get('mad_', []) + [np.min(t.repeat(repeat=rep, number=1))]

    setup += '\ndef return_bound(x, l, h):\n    low = abs(x - l)\n    high = abs(x - h)\n    if low < high:\n        return l\n    else:\n        return h'
    fct = '[return_bound(x, 3, 9) for x in the_list]'
    t = timeit.Timer(fct, setup=setup)
    timings["Scratch'N'Purr"] = timings.get("Scratch'N'Purr", []) + [np.min(t.repeat(repeat=rep, number=1))]

    setup += '\ndef round_the_list(list, bound_1, bound_2):\n\tmid = (bound_1+bound_2)/2\n\tfor i in range(len(list)):\n\t\tif list[i] > mid:\n\t\t\tlist[i] = bound_2\n\t\telse:\n\t\t\tlist[i] = bound_1'
    fct = 'round_the_list(the_list, 3, 9)'
    t = timeit.Timer(fct, setup=setup)
    timings["Abhishek Patel"] = timings.get("Abhishek Patel", []) + [np.min(t.repeat(repeat=rep, number=1))]

    fct = 'dhi = 9 - the_list\ndlo = 3 - the_list\nidx = dhi + dlo < 0\nthe_list + np.where(idx, dhi, dlo)'
    t = timeit.Timer(fct, setup=setup)
    timings["SpghttCd_where"] = timings.get("SpghttCd_where", []) + [np.min(t.repeat(repeat=rep, number=1))]

print('done')

df = pd.DataFrame(timings, 10**np.arange(n+1))
ax = df.plot(logx=True, logy=True)
ax.set_xlabel('length of the list')
ax.set_ylabel('seconds to run')
ax.get_lines()[-1].set_c('g')
plt.legend()
print(df)
于 2018-12-12T09:12:39.543 回答
1

您可以通过找到中点并检查列表中每个数字在中点的哪一侧来进行概括

def round_the_list(list, bound_1, bound_2):
  mid = (bound_1+bound_2)/2
  for i in range(len(list)):
        if list[i] > mid:         # or >= depending on your rounding decision
            list[i] = bound_2
        else:
            list[i] = bound_1
于 2018-12-11T16:19:29.767 回答
1

也许您可以编写一个函数并在列表理解中使用它。

def return_bound(x, l, h):
    low = abs(x - l)
    high = abs(x - h)
    if low < high:
        return l
    else:
        return h

测试:

>>> mylist = [7.453511737983394, 8.10917072790058, 6.2377799380575, 5.225853201122676, 4.067932296134156]
>>> [return_bound(x, 3, 9) for x in mylist]
[9, 9, 9, 3, 3]
于 2018-12-11T16:22:54.390 回答
1

min通过修改键参数来查找绝对差异,使用内置函数的单行列表理解

upper_lower_bound_list=[3,9]
myNumberlist=[ 7.453511737983394, 
8.10917072790058, 
6.2377799380575, 
5.225853201122676, 
4.067932296134156 ]

列表理解

[min(upper_lower_bound_list, key=lambda x:abs(x-myNumber)) for myNumber in myNumberlist]

输出

[9, 9, 9, 3, 3]
于 2018-12-11T16:27:06.340 回答
1

使用列表推导和 lambda 函数的另一个选项:

round_the_num = lambda list, upper, lower: [upper if x > (upper + lower) / 2 else lower for x in list]

round_the_num(l, 9, 3)
于 2018-12-11T16:27:20.683 回答
1

您可以编写一个执行列表理解的自定义函数,例如:

lst = [ 7.453511737983394, 
  8.10917072790058, 
  6.2377799380575, 
  5.225853201122676, 
  4.067932296134156 ]

def return_the_num(l, lst, h): 
    return [l if abs(l-x) < abs(h-x) else h for x in lst]

print(return_the_num(3, lst, 9))
# [9, 9, 9, 3, 3]
于 2018-12-11T16:27:23.937 回答
1

我真的很喜欢@AbhishekPatel 关于与中点进行比较的想法。但我会将其放入 LC 中,使用结果作为边界元组的索引:

the_list = [ 7.453511737983394,
8.10917072790058, 
6.2377799380575, 
5.225853201122676, 
4.067932296134156 ]

hi = 9
lo = 3
mid = (hi + lo) / 2

[(lo, hi)[mid < v] for v in the_list]
# [9, 9, 9, 3, 3]

...但这比 numpy 方法慢 15 倍以上。
但是,这里可以处理大于hi或小于的数字lo
...但这又仅适用于 100000 个条目列表。如果是 OP 发布的原始列表,这两个变体非常接近......

于 2018-12-11T18:59:42.217 回答