python - 熊猫聚合器 .first() 和 .last() 之间的区别

Question

我很好奇在这个特定实例中做什么last()和做什么（当链接到重新采样时）。first()如果我错了，请纠正我，但我理解你是否将参数传递给第一个和最后一个，例如 3；它返回前 3 个月或前 3 年。

在这种情况下，由于我没有将任何参数传递给first()and last()，当我像这样重新采样时它实际上在做什么？我知道，如果我通过链接重新采样.mean()，我将使用平均所有月份的平均分数重新采样到几年，但是当我使用时会发生什么last()？

更重要的是，为什么在这种情况first()下last()会给我不同的答案？我看到在数字上它们是不相等的。

IE：post2008.resample().first() != post2008.resample().last()

TLDR：

做什么.first()和.last()做什么？
在这种情况下，当链接到重新采样时会做什么.first()和做什么？.last()
为什么.resample().first() != .resample().last()呢？

这是聚合之前的代码：

# Read 'GDP.csv' into a DataFrame: gdp
gdp = pd.read_csv('GDP.csv', index_col='DATE', parse_dates=True)

# Slice all the gdp data from 2008 onward: post2008
post2008 = gdp.loc['2008-01-01':,:]

# Print the last 8 rows of post2008
print(post2008.tail(8))

这是print(post2008.tail(8))输出：

              VALUE
DATE               
2014-07-01  17569.4
2014-10-01  17692.2
2015-01-01  17783.6
2015-04-01  17998.3
2015-07-01  18141.9
2015-10-01  18222.8
2016-01-01  18281.6
2016-04-01  18436.5

这是重新采样和聚合的代码last()：

# Resample post2008 by year, keeping last(): yearly
yearly = post2008.resample('A').last()
print(yearly)

这就是每年的情况post2008.resample('A').last()：

              VALUE
DATE               
2008-12-31  14549.9
2009-12-31  14566.5
2010-12-31  15230.2
2011-12-31  15785.3
2012-12-31  16297.3
2013-12-31  16999.9
2014-12-31  17692.2
2015-12-31  18222.8
2016-12-31  18436.5

这是重新采样和聚合的代码first()：

# Resample post2008 by year, keeping first(): yearly
yearly = post2008.resample('A').first()
print(yearly)

这就是每年的情况post2008.resample('A').first()：

            VALUE
DATE               
2008-12-31  14668.4
2009-12-31  14383.9
2010-12-31  14681.1
2011-12-31  15238.4
2012-12-31  15973.9
2013-12-31  16475.4
2014-12-31  17025.2
2015-12-31  17783.6
2016-12-31  18281.6

score 0 · Accepted Answer

首先，让我们使用示例数据创建一个数据框：

import pandas as pd
dates = pd.DatetimeIndex(['2014-07-01', '2014-10-01', '2015-01-01',
                            '2015-04-01', '2015-07-01', '2015-07-01',
                            '2016-01-01', '2016-04-01'])
df = pd.DataFrame({'VALUE': range(1000, 9000, 1000)}, index=dates)
print(df)

输出将是

            价值
2014-07-01 1000
2014-10-01 2000
2015-01-01 3000
2015-04-01 4000
2015-07-01 5000
2015-07-01 6000
2016-01-01 7000
2016-04-01 8000

如果我们将 eg 传递'6M'给df.first（它不是聚合器，而是DataFrame 方法），我们将选择前六个月的数据，在上面的示例中仅包含两天：

print(df.first('6M'))

            价值
2014-07-01 1000
2014-10-01 2000

同样，last仅返回属于过去六个月数据的行：

print(df.last('6M'))

            价值
2016-01-01 6000
2016-04-01 7000

在这种情况下，不传递所需的参数会导致错误：

print(df.first())

类型错误：first() 缺少 1 个必需的位置参数：'offset'

另一方面，df.resample('Y')返回一个Resampler 对象，它具有聚合方法first、last、mean等。在这种情况下，它们只保留每年的第一个（分别是最后一个）值（而不是例如平均所有值或其他类型的聚合）：

print(df.resample('Y').first())

            价值
2014-12-31 1000
2015-12-31 3000 # 这是 2015 年的 3 个值中的第一个
2016-12-31 6000

print(df.resample('Y').last())

            价值
2014-12-31 2000
2015-12-31 6000 # 这是 2015 年 3 个值中的最后一个
2016-12-31 7000

作为一个额外的例子，还要考虑按更小周期分组的情况：

print(df.resample('M').last().head())

             价值
2014-07-31 1000.0 # 这是 2014 年 7 月的最后一个（也是唯一一个）值
2014-08-31 NaN # 2014 年 8 月无数据
2014-09-30 NaN # 2014 年 9 月无数据
2014-10-31 2000.0
2014-11-30 NaN # 2014 年 11 月无数据

在这种情况下，任何没有值的期间都将用 NaN 填充。此外，对于这个示例，使用first而不是last会返回相同的值，因为每个月（最多）有一个值。

python - 熊猫聚合器 .first() 和 .last() 之间的区别

1 回答 1

Related

Reference