python - 考虑边界的每年最大连续一数/真数（年初和年末）

Question

标题说明了大部分内容。即找出每年最多连续的Ones/1s（或Trues），如果年末连续的1s持续到下一年，将它们合并在一起。我试图实现这一点，但似乎有点“黑客”，并想知道是否有更好的方法来做到这一点。

可重现的示例代码：

# Modules needed
import pandas as pd
import numpy as np

# Example Input array of Ones and Zeroes with a datetime-index (Original data is time-series)

InputArray = pd.Series([0,1,0,1,1,1,1,1,1,1,1,1,0,1,1,1])
InputArray.index = (pd.date_range('2000-12-22', '2001-01-06'))
boolean_array = InputArray == 1 #convert to boolean

# Wanted Output
# Year    MaxConsecutive-Ones
# 2000    9
# 2001    3

以下是我实现所需输出的初始代码

# function to get max consecutive for a particular array. i.e. will be done for each year below (groupby)
def GetMaxConsecutive(boolean_array):
    distinct = boolean_array.ne(boolean_array.shift()).cumsum() # associate each trues/false to a number 
    distinct = distinct[boolean_array] # only consider trues from the distinct values
    consect = distinct.value_counts().max() # find the number of consecutives of distincts values then find the maximum value
    return consect

# Find the maximum consecutive 'Trues' for each year.
MaxConsecutive = boolean_array.groupby(lambda x: x.year).apply(GetMaxConsecutive)
print(MaxConsecutive)
# Year    MaxConsecutive-Ones
# 2000    7
# 2001    3

但是，上面的输出仍然不是我们想要的，因为 groupby 函数会削减每年的数据。

因此，在下面的代码中，我们将尝试通过计算边界处的 MaxConsecutive-Ones（即 current_year-01-01 和 previous_year-12-31）来“修复”这个问题，并且如果边界处的 MaxConsecutive-Ones 大于与上面输出的原始 MaxConsecutive-Ones 然后我们替换它。

# First) we aquire all start_of_year  and end_of_year data
start_of_year = boolean_array.loc[(boolean_array.index.month==1) & (boolean_array.index.day==1)]
end_of_year = boolean_array.loc[(boolean_array.index.month==12) & (boolean_array.index.day==31)]

# Second) we mask above start_of_year and end_of_year data: to only have elements that are "True"
start_of_year = start_of_year[start_of_year]
end_of_year = end_of_year[end_of_year]

# Third) Change index to only contain years (rather than datetime index)
# Also for "start_of_year" array include -1 to the years when setting the index. 
# So we can match end_of_year to start_of_year arrays!
start_of_year = pd.Series(start_of_year)
start_of_year.index = start_of_year.index.year - 1
end_of_year = pd.Series(end_of_year)
end_of_year.index = end_of_year.index.year

# Combine index-years that are 'matched'
matched_years = pd.concat([end_of_year, start_of_year], axis = 1)
matched_years = matched_years.dropna()
matched_years = matched_years.index

# Finally) Compute the consecutive 1s/trues at the boundaries 
# for each matched years
for year in matched_years:
    # Compute the amount of consecutive 1s/trues at the start-of-year
    start = boolean_array.loc[boolean_array.index.year == (year + 1)]
    distinct = start.ne(start.shift()).cumsum() # associate each consecutive trues/false to a number 
    distinct_masked = distinct[start] # only consider trues from the distinct values i.e. remove elements within "distinct" that are "False" within the boolean array. 
    count_distincts = distinct_masked.value_counts() # the index of this array is the associated distinct_value and its actual value/element is the amount of consecutives.
    start_consecutive = count_distincts.loc[distinct_masked.min()] # Find the number of consecutives at the start of year (or where distinct_masked is minimum)

    # Compute the amount of consecutive 1s/trues at the previous-end-of-year
    end = boolean_array.loc[boolean_array.index.year == year]
    distinct = end.ne(end.shift()).cumsum() # associate each trues/false to a number 
    distinct_masked = distinct[end] # only consider trues from the distinct values i.e. remove elements within "distinct" that are "False" within the boolean array.
    count_distincts = distinct_masked.value_counts() # the index of this array is the associated distinct_value and its actual value/element is the amount of consecutives.
    end_consecutive = count_distincts.loc[distinct_masked.max()] # Find the number of consecutives at the end of year (or where distinct_masked is maximum)


    # Merge/add the consecutives at the boundaries (start-of-year and previous-end-of-year)
    ConsecutiveAtBoundaries = start_consecutive + end_consecutive

    # Now we modify the original MaxConsecutive if ConsecutiveAtBoundaries is larger
    Modify_MaxConsecutive = MaxConsecutive.copy()
    if Modify_MaxConsecutive.loc[year] < ConsecutiveAtBoundaries:
        Modify_MaxConsecutive.loc[year] = ConsecutiveAtBoundaries
    else:
        None

# Wanted Output is achieved!
print(Modify_MaxConsecutive)
# Year    MaxConsecutive-Ones
# 2000    9
# 2001    3

score 1 · Accepted Answer

不确定这是否是最有效的，但它是一种解决方案：

arr = pd.Series([0,1,0,1,1,1,1,1,1,1,1,1,0,1,1,1])
arr.index = (pd.date_range('2000-12-22', '2001-01-06'))
arr = arr.astype(bool)
df = arr.reset_index()  # convert to df
df['adj_year'] = df['index'].dt.year  # adj_year will be adjusted for streaks

mask = (df[0].eq(True)) & (df[0].shift().eq(True))
df.loc[mask, 'adj_year'] = np.NaN  # we mark streaks as NaN and fill from above
df.adj_year = df.adj_year.fillna(method='ffill').astype('int')
df.groupby('adj_year').apply(lambda x: ((x[0] == x[0].shift()).cumsum() + 1).max())
# find max streak for each adjusted year

输出：

adj_year
2000    9
2001    3
dtype: int64

笔记：

按照约定，Python 中的变量名（类除外）是小写的，因此arr与InputArray
1 和 0 等价于 True 和 False，因此您可以将它们转换为布尔值而无需显式比较
cumsum是零索引的（在 Python 中很常见）所以我们加 1

score 1 · Accepted Answer

现在我有时间了。这是我的解决方案：

# Modules needed
import pandas as pd
import numpy as np

input_array = pd.Series([0,1,0,1,1,1,1,1,1,1,1,1,0,1,1,1], dtype=bool)
input_dates = pd.date_range('2000-12-22', '2001-01-06')
df = pd.DataFrame({"input": input_array, "dates": input_dates})

streak_starts = df.index[~df.input.shift(1, fill_value=False) & df.input]
streak_ends = df.index[~df.input.shift(-1, fill_value=False) & df.input] + 1
streak_lengths = streak_ends - streak_starts

streak_df = df.iloc[streak_starts].copy()
streak_df["streak_length"] = streak_lengths

longest_streak_per_year = streak_df.groupby(streak_df.dates.dt.year).streak_length.max()

输出：

dates
2000    9
2001    3
Name: streak_length, dtype: int64

score 0 · Accepted Answer

该解决方案不能准确回答问题，因此不会是最终答案。即这涉及当前年度和下一年边界处的 max_consecutive true

boolean_array = pd.Series([0,1,0,1,1,1,1,1,1,1,1,1,0,1,1,1]).astype(bool)
boolean_array.index = (pd.date_range('2000-12-22', '2001-01-06'))

distinct = boolean_array.ne(boolean_array.shift()).cumsum() 
distinct_masked = distinct[boolean_array] 
streak_sum = distinct_masked.value_counts() 
streak_sum_series =  pd.Series(streak_sum.loc[distinct_masked].values, index = distinct_masked.index.copy())
max_consect = streak_sum_series.groupby(lambda x: x.year).max()

输出：

max_consect 
2000    9
2001    9
dtype: int64

python - 考虑边界的每年最大连续一数/真数（年初和年末）

3 回答 3

Related

Reference