0

我一直在努力解决这个问题并且无法解决它,我得到了当前的数据框:

import databricks.koalas as ks

x = ks.DataFrame.from_records(
{'ds': {0: Timestamp('2018-10-06 00:00:00'),
  1: Timestamp('2017-06-08 00:00:00'),
  2: Timestamp('2018-10-22 00:00:00'),
  3: Timestamp('2017-02-08 00:00:00'),
  4: Timestamp('2019-02-03 00:00:00'),
  5: Timestamp('2019-02-26 00:00:00'),
  6: Timestamp('2017-04-15 00:00:00'),
  7: Timestamp('2017-07-02 00:00:00'),
  8: Timestamp('2017-04-04 00:00:00'),
  9: Timestamp('2017-03-20 00:00:00'),
  10: Timestamp('2018-06-09 00:00:00'),
  11: Timestamp('2017-01-15 00:00:00'),
  12: Timestamp('2018-05-07 00:00:00'),
  13: Timestamp('2018-01-17 00:00:00'),
  14: Timestamp('2017-07-11 00:00:00'),
  15: Timestamp('2018-12-17 00:00:00'),
  16: Timestamp('2018-12-05 00:00:00'),
  17: Timestamp('2017-05-22 00:00:00'),
  18: Timestamp('2017-08-13 00:00:00'),
  19: Timestamp('2018-05-21 00:00:00')},
 'store': {0: 81,
  1: 128,
  2: 81,
  3: 128,
  4: 25,
  5: 128,
  6: 11,
  7: 124,
  8: 43,
  9: 25,
  10: 25,
  11: 124,
  12: 124,
  13: 128,
  14: 81,
  15: 11,
  16: 124,
  17: 11,
  18: 167,
  19: 128},
 'stock': {0: 1,
  1: 236,
  2: 3,
  3: 9,
  4: 36,
  5: 78,
  6: 146,
  7: 20,
  8: 12,
  9: 12,
  10: 15,
  11: 25,
  12: 10,
  13: 7,
  14: 0,
  15: 230,
  16: 80,
  17: 6,
  18: 110,
  19: 8},
 'sells': {0: 1.0,
  1: 17.0,
  2: 1.0,
  3: 2.0,
  4: 1.0,
  5: 2.0,
  6: 7.0,
  7: 1.0,
  8: 1.0,
  9: 1.0,
  10: 2.0,
  11: 1.0,
  12: 1.0,
  13: 1.0,
  14: 1.0,
  15: 1.0,
  16: 1.0,
  17: 3.0,
  18: 2.0,
  19: 1.0}}
)

以及我想在 groupby 中使用的这个功能 - 应用:

import numpy as np

def compute_indicator(df):
  return (
    df.copy()
    .assign(
      indicator=lambda x: x['a'] < np.percentile(x['b'], 80)
    )
    .astype(int)
    .fillna(1)
  )

其中 df 是一个熊猫数据框。如果我使用 pandas 进行分组应用,代码将按预期执行:

import pandas as pd
# This runs
a = pd.DataFrame.from_dict(x.to_dict()).groupby('store').apply(compute_indicator)

但是当试图在考拉上运行同样的程序时,它给了我以下错误:ValueError: cannot insert store, already exists

x.groupby('store').apply(compute_indicator)
# ValueError: cannot insert store, already exists

我不能使用输入注释,compute_indicator因为某些列不是固定的(它们与数据框一起移动,旨在供其他转换使用)。

在考拉中运行代码应该怎么做?

4

1 回答 1

0

至于 Koalas 0.29.0,当koalas.DataFrame.groupby(keys).apply(f)第一次在无类型的 func 上运行时f,它必须推断模式,并执行此操作pandas.DataFrame.head(n).groupby(keys).apply(f)。问题是 pandasapply接收作为参数的数据帧,其中 groupby 键作为索引和列(参见这个问题)。

然后将结果pandas.DataFrame.head(h).groupby(keys).apply(f)转换为 a koalas.DataFrame,因此如果f不删除keys列,则此转换会由于列名重复而引发异常(请参阅问题

于 2020-03-28T16:02:16.357 回答