python - 自动重命名列以确保它们是唯一的

Question

我将电子表格提取到名为df.

我们举个例子：

df=pd.DataFrame({'a': np.random.rand(10), 'b': np.random.rand(10)})
df.columns=['a','a']

          a         a
0  0.973858  0.036459
1  0.835112  0.947461
2  0.520322  0.593110
3  0.480624  0.047711
4  0.643448  0.104433
5  0.961639  0.840359
6  0.848124  0.437380
7  0.579651  0.257770
8  0.919173  0.785614
9  0.505613  0.362737

当我跑步时，df.columns.is_unique我得到False

我想自动将列“a”重命名为“a_2”（或类似的东西）

我不期望像这样的解决方案df.columns=['a','a_2']

我正在寻找一种可用于多个列的解决方案！

score 12 · Accepted Answer

您可以手动唯一化列：

df_columns = ['a', 'b', 'a', 'a_2', 'a_2', 'a', 'a_2', 'a_2_2']

def uniquify(df_columns):
    seen = set()

    for item in df_columns:
        fudge = 1
        newitem = item

        while newitem in seen:
            fudge += 1
            newitem = "{}_{}".format(item, fudge)

        yield newitem
        seen.add(newitem)

list(uniquify(df_columns))
#>>> ['a', 'b', 'a_2', 'a_2_2', 'a_2_3', 'a_3', 'a_2_4', 'a_2_2_2']

score 4 · Accepted Answer

我将电子表格提取到名为 df 的 Python DataFrame 中...我想自动重命名 [duplicate] 列 [names]。

Pandas 会自动为您执行此操作，而无需您执行任何操作...

测试.xls：在此处输入图像描述

import pandas as pd
import numpy as np

df = pd.io.excel.read_excel(
    "./test.xls", 
    "Sheet1",
    header=0,
    index_col=0,
)
print df

--output:--
        a    b   c  b.1  a.1  a.2
index                            
0      10  100 -10 -100   10   21
1      20  200 -20 -200   11   22
2      30  300 -30 -300   12   23
3      40  400 -40 -400   13   24
4      50  500 -50 -500   14   25
5      60  600 -60 -600   15   26


print df.columns.is_unique

--output:--
True

如果由于某种原因，你得到了一个带有重复列的 DataFrame，你可以这样做：

import pandas as pd
import numpy as np
from collections import defaultdict 

df = pd.DataFrame(
    {
        'k': np.random.rand(10),
        'l': np.random.rand(10), 
        'm': np.random.rand(10),
        'n': np.random.rand(10),
        'o': np.random.rand(10),
        'p': np.random.rand(10),
    }
)

print df

--output:--
         k         l         m         n         o         p
0  0.566150  0.025225  0.744377  0.222350  0.800402  0.449897
1  0.701286  0.182459  0.661226  0.991143  0.793382  0.980042
2  0.383213  0.977222  0.404271  0.050061  0.839817  0.779233
3  0.428601  0.303425  0.144961  0.313716  0.244979  0.487191
4  0.187289  0.537962  0.669240  0.096126  0.242258  0.645199
5  0.508956  0.904390  0.838986  0.315681  0.359415  0.830092
6  0.007256  0.136114  0.775670  0.665000  0.840027  0.991058
7  0.719344  0.072410  0.378754  0.527760  0.205777  0.870234
8  0.255007  0.098893  0.079230  0.225225  0.490689  0.554835
9  0.481340  0.300319  0.649762  0.460897  0.488406  0.16604


df.columns = ['a', 'b', 'c', 'b', 'a', 'a']
print df

--output:--
          a         b         c         b         a         a
0  0.566150  0.025225  0.744377  0.222350  0.800402  0.449897
1  0.701286  0.182459  0.661226  0.991143  0.793382  0.980042
2  0.383213  0.977222  0.404271  0.050061  0.839817  0.779233
3  0.428601  0.303425  0.144961  0.313716  0.244979  0.487191
4  0.187289  0.537962  0.669240  0.096126  0.242258  0.645199
5  0.508956  0.904390  0.838986  0.315681  0.359415  0.830092
6  0.007256  0.136114  0.775670  0.665000  0.840027  0.991058
7  0.719344  0.072410  0.378754  0.527760  0.205777  0.870234
8  0.255007  0.098893  0.079230  0.225225  0.490689  0.554835
9  0.481340  0.300319  0.649762  0.460897  0.488406  0.166047


print df.columns.is_unique

--output:--
False  


name_counts = defaultdict(int)
new_col_names = []

for name in df.columns:
    new_count = name_counts[name] + 1
    new_col_names.append("{}{}".format(name, new_count))
    name_counts[name] = new_count 

print new_col_names


--output:--
['a1', 'b1', 'c1', 'b2', 'a2', 'a3']


df.columns = new_col_names
print df

--output:--
         a1        b1        c1        b2        a2        a3
0  0.264598  0.321378  0.466370  0.986725  0.580326  0.671168
1  0.938810  0.179999  0.403530  0.675112  0.279931  0.011046
2  0.935888  0.167405  0.733762  0.806580  0.392198  0.180401
3  0.218825  0.295763  0.174213  0.457533  0.234081  0.555525
4  0.891890  0.196245  0.425918  0.786676  0.791679  0.119826
5  0.721305  0.496182  0.236912  0.562977  0.249758  0.352434
6  0.433437  0.501975  0.088516  0.303067  0.916619  0.717283
7  0.026491  0.412164  0.787552  0.142190  0.665488  0.488059
8  0.729960  0.037055  0.546328  0.683137  0.134247  0.444709
9  0.391209  0.765251  0.507668  0.299963  0.348190  0.731980

print df.columns.is_unique

--output:--
True

score 1 · Accepted Answer

万一有人在 Scala 中需要这个->

def renameDup (Header : String) : String = {

val trimmedList: List[String] = Header.split(",").toList
var fudge =0
var newitem =""
var seen = List[String]()

for (item <- trimmedList){
    fudge = 1
    newitem = item
    for (newitem2 <- seen){
        if (newitem2 == newitem ){
            fudge += 1
            newitem = item + "_" + fudge
        }
    }
    seen= seen :+ newitem
}   
return seen.mkString(",")
}

>>> ['a'，'b'，'a_2'，'a_2_2'，'a_2_3'，'a_3'，'a_2_4'，'a_2_2_2']

score 0 · Accepted Answer

这是一个一直使用熊猫的解决方案。

import pandas as pd

# create data frame with duplicate column names
df = pd.DataFrame({'a': [1, 2, 3], 'b': [4, 5, 6]})
df.rename({'a': 'col', 'b': 'col'}, axis=1, inplace=True)
df

---output---

   col  col
0    1    4
1    2    5
2    3    6

# make a new data frame of column headers and number sequentially
dfcolumns = pd.DataFrame({'name': df.columns})
dfcolumns['counter'] = dfcolumns.groupby('name').cumcount().apply(str)

# remove counter for first case (optional) and combine suffixes
dfcolumns.loc[dfcolumns.counter=='0', 'counter'] = ''
df.columns = dfcolumns['name'] + dfcolumns['counter']
df

---output---

   col  col1
0    1     4
1    2     5
2    3     6

score -1 · Accepted Answer

我在从 oracle 表加载 DataFrame 时遇到了这个问题。7stud 是正确的 pd.read_excel() 自动指定带有 *.1 的重复列，但并非所有读取函数都这样做。一种解决方法是将 DataFrame 保存到 csv（或 excel）文件，然后重新加载它以重新指定重复的列。

data = pd.read_SQL(SQL,connection)
data.to_csv(r'C:\temp\temp.csv')
data=read_csv(r'C:\temp\temp.csv')

python - 自动重命名列以确保它们是唯一的

5 回答 5

>>> ['a'，'b'，'a_2'，'a_2_2'，'a_2_3'，'a_3'，'a_2_4'，'a_2_2_2']

Related

Reference