python - 两个熊猫列的字符串连接

Question

我有以下内容DataFrame：

from pandas import *
df = DataFrame({'foo':['a','b','c'], 'bar':[1, 2, 3]})

它看起来像这样：

    bar foo
0    1   a
1    2   b
2    3   c

现在我想要类似的东西：

     bar
0    1 is a
1    2 is b
2    3 is c

我怎样才能做到这一点？我尝试了以下方法：

df['foo'] = '%s is %s' % (df['bar'], df['foo'])

但它给了我一个错误的结果：

>>>print df.ix[0]

bar                                                    a
foo    0    a
1    b
2    c
Name: bar is 0    1
1    2
2
Name: 0

很抱歉提出一个愚蠢的问题，但是这个熊猫：在 DataFrame中组合两列对我没有帮助。

score 154 · Accepted Answer

154

df['bar'] = df.bar.map(str) + " is " + df.foo

于 2012-08-08T06:03:51.577 回答

score 117 · Accepted Answer

这个问题已经得到了回答，但我相信将一些以前没有讨论过的有用方法混在一起，并比较迄今为止提出的所有方法在性能方面是很好的。

以下是针对此问题的一些有用的解决方案，按性能递增的顺序排列。

`DataFrame.agg`

这是一种str.format基于简单的方法。

df['baz'] = df.agg('{0[bar]} is {0[foo]}'.format, axis=1)
df
  foo  bar     baz
0   a    1  1 is a
1   b    2  2 is b
2   c    3  3 is c

您还可以在此处使用 f 字符串格式：

df['baz'] = df.agg(lambda x: f"{x['bar']} is {x['foo']}", axis=1)
df
  foo  bar     baz
0   a    1  1 is a
1   b    2  2 is b
2   c    3  3 is c

`char.array`- 基于连接

将列转换为连接为chararrays，然后将它们相加。

a = np.char.array(df['bar'].values)
b = np.char.array(df['foo'].values)

df['baz'] = (a + b' is ' + b).astype(str)
df
  foo  bar     baz
0   a    1  1 is a
1   b    2  2 is b
2   c    3  3 is c

列表理解与`zip`

我不能夸大熊猫中的列表理解被低估的程度。

df['baz'] = [str(x) + ' is ' + y for x, y in zip(df['bar'], df['foo'])]

或者，使用str.jointo concat （也将更好地扩展）：

df['baz'] = [
    ' '.join([str(x), 'is', y]) for x, y in zip(df['bar'], df['foo'])]

df
  foo  bar     baz
0   a    1  1 is a
1   b    2  2 is b
2   c    3  3 is c

列表推导在字符串操作方面表现出色，因为字符串操作本质上很难向量化，而且大多数 pandas “向量化”函数基本上都是循环的包装器。我在For loops with pandas - 我什么时候应该关心？. 一般来说，如果您不必担心索引对齐，请在处理字符串和正则表达式操作时使用列表推导式。

默认情况下，上面的列表 comp 不处理 NaN。但是，如果您需要处理它，您总是可以编写一个包装 try-except 的函数。

def try_concat(x, y):
    try:
        return str(x) + ' is ' + y
    except (ValueError, TypeError):
        return np.nan


df['baz'] = [try_concat(x, y) for x, y in zip(df['bar'], df['foo'])]

`perfplot`性能测量

使用perfplot生成的图表。这是完整的代码清单。

功能

def brenbarn(df):
    return df.assign(baz=df.bar.map(str) + " is " + df.foo)

def danielvelkov(df):
    return df.assign(baz=df.apply(
        lambda x:'%s is %s' % (x['bar'],x['foo']),axis=1))

def chrimuelle(df):
    return df.assign(
        baz=df['bar'].astype(str).str.cat(df['foo'].values, sep=' is '))

def vladimiryashin(df):
    return df.assign(baz=df.astype(str).apply(lambda x: ' is '.join(x), axis=1))

def erickfis(df):
    return df.assign(
        baz=df.apply(lambda x: f"{x['bar']} is {x['foo']}", axis=1))

def cs1_format(df):
    return df.assign(baz=df.agg('{0[bar]} is {0[foo]}'.format, axis=1))

def cs1_fstrings(df):
    return df.assign(baz=df.agg(lambda x: f"{x['bar']} is {x['foo']}", axis=1))

def cs2(df):
    a = np.char.array(df['bar'].values)
    b = np.char.array(df['foo'].values)

    return df.assign(baz=(a + b' is ' + b).astype(str))

def cs3(df):
    return df.assign(
        baz=[str(x) + ' is ' + y for x, y in zip(df['bar'], df['foo'])])

score 45 · Accepted Answer

您的代码中的问题是您希望在每一行上应用该操作。您编写它的方式虽然采用了整个 'bar' 和 'foo' 列，将它们转换为字符串并返回一个大字符串。你可以这样写：

df.apply(lambda x:'%s is %s' % (x['bar'],x['foo']),axis=1)

它比其他答案更长，但更通用（可用于非字符串的值）。

score 13 · Accepted Answer

13

你也可以使用

df['bar'] = df['bar'].str.cat(df['foo'].values.astype(str), sep=' is ')

于 2014-03-28T17:56:36.183 回答

score 11 · Accepted Answer

df.astype(str).apply(lambda x: ' is '.join(x), axis=1)

0    1 is a
1    2 is b
2    3 is c
dtype: object

score 6 · Accepted Answer

series.str.cat是解决此问题的最灵活方法：

为了 df = pd.DataFrame({'foo':['a','b','c'], 'bar':[1, 2, 3]})

df.foo.str.cat(df.bar.astype(str), sep=' is ')

>>>  0    a is 1
     1    b is 2
     2    c is 3
     Name: foo, dtype: object

或者

df.bar.astype(str).str.cat(df.foo, sep=' is ')

>>>  0    1 is a
     1    2 is b
     2    3 is c
     Name: bar, dtype: object

与.join()（用于连接单个系列中包含的列表）不同，此方法用于将 2 个系列连接在一起。它还允许您根据需要忽略或替换NaN值。

score 4 · Accepted Answer

@DanielVelkov 答案是正确的，但使用字符串文字更快：

# Daniel's
%timeit df.apply(lambda x:'%s is %s' % (x['bar'],x['foo']),axis=1)
## 963 µs ± 157 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)

# String literals - python 3
%timeit df.apply(lambda x: f"{x['bar']} is {x['foo']}", axis=1)
## 849 µs ± 4.28 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)

score 0 · Accepted Answer

我认为任意数量的列最简洁的解决方案是这个答案的简短版本：

df.astype(str).apply(' is '.join, axis=1)

您可以使用刮掉另外两个字符df.agg()，但速度较慢：

df.astype(str).agg(' is '.join, axis=1)

score 0 · Accepted Answer

我遇到了一个特殊情况，我的数据框中有 10^11 行，在这种情况下，建议的解决方案都不合适。我使用了类别，当唯一字符串的数量不太大时，这在所有情况下都可以正常工作。这很容易在带有 XxY 和因素的 R 软件中完成，但我在 python 中找不到任何其他方法（我是 python 新手）。如果有人知道实施此操作的地方，我将很高兴知道。

def Create_Interaction_var(df,Varnames):
    '''
    :df data frame
    :list of 2 column names, say "X" and "Y". 
    The two columns should be strings or categories
    convert strings columns to categories
    Add a column with the "interaction of X and Y" : X x Y, with name 
    "Interaction-X_Y"
    '''
    df.loc[:, Varnames[0]] = df.loc[:, Varnames[0]].astype("category")
    df.loc[:, Varnames[1]] = df.loc[:, Varnames[1]].astype("category")
    CatVar = "Interaction-" + "-".join(Varnames)
    Var0Levels = pd.DataFrame(enumerate(df.loc[:,Varnames[0]].cat.categories)).rename(columns={0 : "code0",1 : "name0"})
    Var1Levels = pd.DataFrame(enumerate(df.loc[:,Varnames[1]].cat.categories)).rename(columns={0 : "code1",1 : "name1"})
    NbLevels=len(Var0Levels)

    names = pd.DataFrame(list(itertools.product(dict(enumerate(df.loc[:,Varnames[0]].cat.categories)),
                                                dict(enumerate(df.loc[:,Varnames[1]].cat.categories)))),
                         columns=['code0', 'code1']).merge(Var0Levels,on="code0").merge(Var1Levels,on="code1")
    names=names.assign(Interaction=[str(x) + '_' + y for x, y in zip(names["name0"], names["name1"])])
    names["code01"]=names["code0"] + NbLevels*names["code1"]
    df.loc[:,CatVar]=df.loc[:,Varnames[0]].cat.codes+NbLevels*df.loc[:,Varnames[1]].cat.codes
    df.loc[:, CatVar]=  df[[CatVar]].replace(names.set_index("code01")[["Interaction"]].to_dict()['Interaction'])[CatVar]
    df.loc[:, CatVar] = df.loc[:, CatVar].astype("category")
    return df

python - 两个熊猫列的字符串连接

9 回答 9

DataFrame.agg

char.array- 基于连接

列表理解与zip

perfplot性能测量

Related

Reference

`DataFrame.agg`

`char.array`- 基于连接

列表理解与`zip`

`perfplot`性能测量