2

我在带有标识符列的 DataFrame 中有一些数据。

data = DataFrame({'id' : [50,50,30,10,50,50,30]})

对于每个唯一 ID,我想提出一个新的唯一标识符。我希望 id 是从 0 开始的连续整数。这是我到目前为止所拥有的:

unique = data[['id']].drop_duplicates()   
unique['group'] = np.arange(len(unique))
unique.set_index('id')
data = data.merge(unique, 'inner', on = 'id')

这可行,但似乎有点脏。有没有更好的办法?

4

1 回答 1

8

That is what pandas.factorize does:

data = pd.DataFrame({'id' : [50,50,30,10,50,50,30]})
print pd.factorize(data.id)[0]

The output:

[0 0 1 2 0 0 1]

numpy.unique can also do this:

import numpy as np
print np.unique([50,50,30,10,50,50,30], return_inverse=True)[1]

the output:

array([2, 2, 1, 0, 2, 2, 1])

the index outputed by numpy.unique is sorted by value, so the smallest value 10 is assigend to index 0. If you want this result by using factorize, set sort argument to True:

pandas.factorize(data.id, sort=True)[0]
于 2013-03-13T03:24:22.430 回答