我正在尝试构建一些东西,对于熊猫数据库中的每条记录,将显示给定列的总数,并显示给定列中在该记录日期之前发生的某些记录的总数。
请注意,比较应该是当前记录的 STARTDATE 与所有记录的 ENDDATE 相比(仅计算在当前期间之前结束的期间的利润)
我需要澄清这一点,因为 Diego Amicabile 在下面提出了一个非常漂亮的答案,不幸的是并没有让我到达我需要的地方(我最初发布的问题只有一个报告日期字段)
所以在这个数据框中,我希望最后有两列。总利润(或 sumall)和公司利润(或 sumco)
Sumall,第一条记录为 0,第二条记录为 -500(2017-01-01 之前的所有日期)第三条记录为 300(-500+800)等
Sumco 将是 0 ,直到我们获得第二条 IBM 记录,即 -500 。它在第三条 IBM 记录上保持 -500,因为第二条记录 (2017-03-03) 的结束时间在第三条记录的开始时间之后。
它应该如下所示:
import io
import pandas as pd
text = """CO SECTOR PROFIT STARTMVYEAR TOTALPROFIT STARTDATE ENDDATE
IBM TECHNOLOGY -500 2500 500 2017-01-01 2017-01-01
APPLE TECHNOLOGY 800 4000 300 2017-01-02 2017-01-03
GM INDUSTRIAL 250 1000 0 2017-02-01 2017-02-03
IBM INDUSTRIAL 600 3000 100 2017-03-01 2017-03-03
IBM INDUSTRIAL 600 35000 100 2017-03-02 2017-06-01"""
df = pd.read_csv(io.StringIO(text), delim_whitespace=True, parse_dates=[0])
df['sumall'] = df.apply(lambda y: df[df['ENDDATE'] < y['STARTDATE'] ].PROFIT.sum())
df['sumco'] = df.apply(lambda y: df[(df['ENDDATE'] < y['STARTDATE'] )& (df.co==y.co)].PROFIT.sum())
错误如下:
---------------------------------------------------------------------------
TypeError Traceback (most recent call last)
pandas\index.pyx in pandas.index.IndexEngine.get_loc (pandas\index.c:4279)()
pandas\src\hashtable_class_helper.pxi in
pandas.hashtable.Int64HashTable.get_item (pandas\hashtable.c:8543)()
TypeError: an integer is required
C:\Users\User\Anaconda3\lib\site-packages\pandas\core\frame.py in apply(self, func, axis, broadcast, raw, reduce, args, **kwds)
4150 if reduce is None:
4151 reduce = True
-> 4152 return self._apply_standard(f, axis, reduce=reduce)
4153 else:
4154 return self._apply_broadcast(f, axis)
C:\Users\User\Anaconda3\lib\site-packages\pandas\core\frame.py in _apply_standard(self, func, axis, ignore_failures, reduce)
4246 try:
4247 for i, v in enumerate(series_gen):
-> 4248 results[i] = func(v)
4249 keys.append(v.name)
4250 except Exception as e:
<ipython-input-13-92e1d7684747> in <lambda>(y)
KeyError Traceback (most recent call last)
<ipython-input-13-92e1d7684747> in <module>()
11 df = pd.read_csv(io.StringIO(text), delim_whitespace=True, parse_dates=[0])
12
---> 13 df['sumall'] = df.apply(lambda y: df[df['ENDDATE'] < y['STARTDATE'] ].PROFIT.sum())
14 df['sumco'] = df.apply(lambda y: df[(df['ENDDATE'] < y['STARTDATE'] )& (df.co==y.co)].PROFIT.sum())
C:\Users\User\Anaconda3\lib\site-packages\pandas\core\series.py in __getitem__(self, key)
601 key = com._apply_if_callable(key, self)
602 try:
--> 603 result = self.index.get_value(self, key)
604
605 if not is_scalar(result):
C:\Users\User\Anaconda3\lib\site-packages\pandas\indexes\base.py in get_value(self, series, key)
2167 try:
2168 return self._engine.get_value(s, k,
-> 2169 tz=getattr(series.dtype, 'tz', None))
2170 except KeyError as e1:
2171 if len(self) > 0 and self.inferred_type in ['integer', 'boolean']:
pandas\index.pyx in pandas.index.IndexEngine.get_value (pandas\index.c:3557)()
pandas\index.pyx in pandas.index.IndexEngine.get_value (pandas\index.c:3240)()
pandas\index.pyx in pandas.index.IndexEngine.get_loc (pandas\index.c:4363)()
KeyError: ('STARTDATE', 'occurred at index CO')