python - Python中等效的“nth_element”函数是什么？

Question

我想在 python 中实现 Vantage Point Tree，但它使用 C++ 中的 std::nth_element 。

所以我想在 Python 或 numpy 中找到等效的“nth_element”函数。

请注意，nth_element 只会对数组进行部分排序，并且它是 O(N)。

int the_array[10] = {4,5,7,3,6,0,1,2,9,8};
std::vector<int> the_v(the_array,the_array+10);
std::nth_element (the_v.begin()+0, the_v.begin()+5, the_v.begin()+10);

现在向量可能是：

3,0,2,1,4,5,6,7,9,8

而且我不仅想获得第n个元素，还想重新排列列表的两个部分，[3,0,2,1,4]和[6,7,9,8]。

而且，nth_element 支持接受一个可以比较两个元素的函数，比如，在下面的as中，vector是一个vector op DataPoint，DistanceComparator函数会用the_v.begin()比较两个点的距离：

vector<DataPoint> the_v;
for(int n = 0; n < N; n++) the_v[n] = DataPoint(D, n, X + n * D);
std::nth_element (the_v.begin()+0, the_v.begin()+5, the_v.begin()+10,
    DistanceComparator(the_v.begin()));

编辑：

我使用了 bhuvan-venkatesh 的答案，并编写了一些代码进行测试。

partition_timer = timeit.Timer("numpy.partition(a, 10000)",
    "import numpy;numpy.random.seed(2);"+
    "a = numpy.random.rand(10000000)")
print(partition_timer.timeit(10))

sort_timer = timeit.Timer("numpy.sort(a)",
    "import numpy;numpy.random.seed(2);"+
    "a = numpy.random.rand(10000000)")
print(sort_timer.timeit(10))

sorted_timer = timeit.Timer("sorted(a)",
    "import numpy;numpy.random.seed(2);"+
    "a = numpy.random.rand(10000000)")
print(sorted_timer.timeit(10))

结果：

2.2217168808
17.0386350155
281.301710844

然后，我将使用 C++ 代码进行更多测试。

但是有一个问题，当使用numpy时，它总是会返回一个新的数组，当我的数组很大时，它会浪费很多内存。我该如何处理。或者我只需要为 python 编写一个 C++ 扩展。

编辑2：

@bhuvan-venkatesh 感谢您推荐分区功能。

我使用如下分区：

import numpy

@profile
def for_numpy():
    numpy.random.seed(2)
    a = numpy.random.rand(1e7)
    for i in range(100):
        a.partition(numpy.random.randint(1e6))

if __name__ == '__main__':
    for_numpy()

并像这样运行探查器：

python -m memory_profiler profiler_test.py

结果是：

Line #    Mem usage    Increment   Line Contents
================================================
    25   23.613 MiB    0.000 MiB   @profile
    26                             def for_numpy():
    27   23.613 MiB    0.000 MiB       numpy.random.seed(2)
    28   99.934 MiB   76.320 MiB       a = numpy.random.rand(1e7)
    29  100.004 MiB    0.070 MiB       for i in range(100):
    30  100.004 MiB    0.000 MiB           a.partition(numpy.random.randint(1e6))

它不会像这样复制整个数组：numpy.partition(a, 3)

结论： numpy.ndarray.partition 是我想要找到的。

score 1 · Accepted Answer

http://docs.scipy.org/doc/numpy/reference/generated/numpy.partition.html

只需确保 numpy 分区将创建两个新数组，这意味着您将快速创建很多新数组。它们比 python 列表更有效，但不会做与 c++ 完全相同的事情。

如果你想要确切的元素，那么你可以做一个过滤器搜索，它仍然是 O(n)

array = np.array(...)
partition = np.partition(array, 5) # O(n)
element = np.where(partition==array[5]) # O(n)
left, right = partition[:element], partition[element+1:] # O(n)

所以你的新代码比较慢，但这是 python-y 的方式。

编辑：

所以你需要一个比较器？除了编写自己的小函数之外，没有办法——用纯 numpy 作为关键字——因为每个 numpy 操作都是用高度优化的 c 代码实现的，这意味着传入 python 函数或 python lambda 会强制 numpy每次都转到对象级别并进行评估。

numpy.vectorize进入对象级别，但最终您将不得不编写自己的代码；如果您想创建一个更“优化的算法”，Rosetta 代码就可以实现。（我把它放在引号中，因为使用 python 对象，由于对象级别的访问，你仍然会比 c 或 numpy 代码慢得多）。如果速度是您真正关心的问题，但您希望 python 可读性考虑使用 cython 进行扩展。

python - Python中等效的“nth_element”函数是什么？

1 回答 1

Related

Reference