标题说明了大部分内容。即找出每年最多连续的Ones/1s(或Trues),如果年末连续的1s持续到下一年,将它们合并在一起。我试图实现这一点,但似乎有点“黑客”,并想知道是否有更好的方法来做到这一点。
可重现的示例代码:
# Modules needed
import pandas as pd
import numpy as np
# Example Input array of Ones and Zeroes with a datetime-index (Original data is time-series)
InputArray = pd.Series([0,1,0,1,1,1,1,1,1,1,1,1,0,1,1,1])
InputArray.index = (pd.date_range('2000-12-22', '2001-01-06'))
boolean_array = InputArray == 1 #convert to boolean
# Wanted Output
# Year MaxConsecutive-Ones
# 2000 9
# 2001 3
以下是我实现所需输出的初始代码
# function to get max consecutive for a particular array. i.e. will be done for each year below (groupby)
def GetMaxConsecutive(boolean_array):
distinct = boolean_array.ne(boolean_array.shift()).cumsum() # associate each trues/false to a number
distinct = distinct[boolean_array] # only consider trues from the distinct values
consect = distinct.value_counts().max() # find the number of consecutives of distincts values then find the maximum value
return consect
# Find the maximum consecutive 'Trues' for each year.
MaxConsecutive = boolean_array.groupby(lambda x: x.year).apply(GetMaxConsecutive)
print(MaxConsecutive)
# Year MaxConsecutive-Ones
# 2000 7
# 2001 3
但是,上面的输出仍然不是我们想要的,因为 groupby 函数会削减每年的数据。
因此,在下面的代码中,我们将尝试通过计算边界处的 MaxConsecutive-Ones(即 current_year-01-01 和 previous_year-12-31)来“修复”这个问题,并且如果边界处的 MaxConsecutive-Ones 大于与上面输出的原始 MaxConsecutive-Ones 然后我们替换它。
# First) we aquire all start_of_year and end_of_year data
start_of_year = boolean_array.loc[(boolean_array.index.month==1) & (boolean_array.index.day==1)]
end_of_year = boolean_array.loc[(boolean_array.index.month==12) & (boolean_array.index.day==31)]
# Second) we mask above start_of_year and end_of_year data: to only have elements that are "True"
start_of_year = start_of_year[start_of_year]
end_of_year = end_of_year[end_of_year]
# Third) Change index to only contain years (rather than datetime index)
# Also for "start_of_year" array include -1 to the years when setting the index.
# So we can match end_of_year to start_of_year arrays!
start_of_year = pd.Series(start_of_year)
start_of_year.index = start_of_year.index.year - 1
end_of_year = pd.Series(end_of_year)
end_of_year.index = end_of_year.index.year
# Combine index-years that are 'matched'
matched_years = pd.concat([end_of_year, start_of_year], axis = 1)
matched_years = matched_years.dropna()
matched_years = matched_years.index
# Finally) Compute the consecutive 1s/trues at the boundaries
# for each matched years
for year in matched_years:
# Compute the amount of consecutive 1s/trues at the start-of-year
start = boolean_array.loc[boolean_array.index.year == (year + 1)]
distinct = start.ne(start.shift()).cumsum() # associate each consecutive trues/false to a number
distinct_masked = distinct[start] # only consider trues from the distinct values i.e. remove elements within "distinct" that are "False" within the boolean array.
count_distincts = distinct_masked.value_counts() # the index of this array is the associated distinct_value and its actual value/element is the amount of consecutives.
start_consecutive = count_distincts.loc[distinct_masked.min()] # Find the number of consecutives at the start of year (or where distinct_masked is minimum)
# Compute the amount of consecutive 1s/trues at the previous-end-of-year
end = boolean_array.loc[boolean_array.index.year == year]
distinct = end.ne(end.shift()).cumsum() # associate each trues/false to a number
distinct_masked = distinct[end] # only consider trues from the distinct values i.e. remove elements within "distinct" that are "False" within the boolean array.
count_distincts = distinct_masked.value_counts() # the index of this array is the associated distinct_value and its actual value/element is the amount of consecutives.
end_consecutive = count_distincts.loc[distinct_masked.max()] # Find the number of consecutives at the end of year (or where distinct_masked is maximum)
# Merge/add the consecutives at the boundaries (start-of-year and previous-end-of-year)
ConsecutiveAtBoundaries = start_consecutive + end_consecutive
# Now we modify the original MaxConsecutive if ConsecutiveAtBoundaries is larger
Modify_MaxConsecutive = MaxConsecutive.copy()
if Modify_MaxConsecutive.loc[year] < ConsecutiveAtBoundaries:
Modify_MaxConsecutive.loc[year] = ConsecutiveAtBoundaries
else:
None
# Wanted Output is achieved!
print(Modify_MaxConsecutive)
# Year MaxConsecutive-Ones
# 2000 9
# 2001 3