regex - 替换四位数熊猫

Question

import pandas as pd
dataframe = pd.DataFrame({'Data' : ['The **ALI**1929 for 90 days but not 77731929 ', 
                                       'For all **ALI**1952  28A 177945 ', 
                                       'But the **ALI**1914 and **ALI**1903 1912',],
                          'ID': [1,2,3]

                         })

Data    ID
0   The **ALI**1929 for 90 days but not 77731929    1
1   For all **ALI**1952 28A 177945                  2
2   But the **ALI**1914 and **ALI**1903 1912        3

我的数据框看起来像我上面的。我的目标是用与相关联的OLDER任何数字替换这个词。将是，也将是，但将保持不变。从如何从python中的字符串中提取一定长度的数字？我努力了1929**ALI****ALI**1929**ALI**OLDERALI**1903**ALI**OLDER**ALI**1952

dataframe['older'] = dataframe['Data'].str.replace(r'(?<!\d)(\d{3})(?!\d)', 'OLDER')

但这对我想要的效果不太好。我想要这样的东西作为输出

 Data        ID     older
0                 The ALI**OLDER for 90 days but not 77731929
1                 For all ALI**1952 28A 177945
2                 But the ALI**OLDER and ALI**OLDER 1912

我如何更改我的正则表达式str.replace(r'(?<!\d)(\d{3})(?!\d)'来做到这一点？

score 1 · Accepted Answer

你可以用这个

(?<=\*)(?:0\d{3}|1[0-8]\d{2}|19[0-2]\d)(?!\d)

(?<=\*)- 应该在前面*
(?:0\d{3}|1[0-8]\d{2}|19[0-2]\d)
- 0\d{3}- 匹配之间的任何 4 位数字0000 to 0999
- | - 交替
- 1[0-8]\d{2}- 匹配之间的任何 4 位数字1000 to 1899
- | - 交替
- 19[0-2]\d- 匹配任何 4 位数字1900 to 1929
(?!\d)- 后面不能跟数字

Regex Demo

score 0 · Accepted Answer

使用str.extractall和：np.where_str.replace

nums = dataframe['Data'].str.extractall('(?<=\*\*ALI\*\*)(\d+)').astype(int).unstack()

dataframe['older'] = np.where(nums.le(1929).any(axis=1), 
                              dataframe['Data'].str.replace('(?<=\*\*ALI\*\*)(\d+)', 'OLDER'), 
                              dataframe['Data'])

输出

                                            Data  ID                                           older
0  The **ALI**1929 for 90 days but not 77731929    1  The **ALI**OLDER for 90 days but not 77731929 
1               For all **ALI**1952  28A 177945    2                For all **ALI**1952  28A 177945 
2       But the **ALI**1914 and **ALI**1903 1912   3      But the **ALI**OLDER and **ALI**OLDER 1912

score 0 · Accepted Answer

定义一个repl可调用的客户并将其与str.replace

repl = lambda m: m.group(1) if int(m.group(1)) > 1929 else 'OLDER'
df.Data.str.replace(r'(?<=\*\*ALI\*\*)(\d+)', repl)

Out[662]:
0    The **ALI**OLDER for 90 days but not 77731929
1                  For all **ALI**1952  28A 177945
2        But the **ALI**OLDER and **ALI**OLDER 1912
Name: Data, dtype: object

score 0 · Accepted Answer

如我所见，正则表达式应匹配**ALI**nnnn（nnnn - 4 位数字）并且：

最初的**- 应该被删除（总是）。
ALI**- 应保持不变。
nnnn - 应该可选地替换为OLDER.

在这种情况下，不需要复杂的正则表达式。整个逻辑可以包含在“替换”函数中。

定义如下：

def repl(mtch):
    g1, g2 = mtch.group(1), mtch.group(2)
    return g1 + (g2 if int(g2) > 1929 else 'OLDER')

然后str.replace与此功能一起使用：

df.Data = df.Data.str.replace(r'\*\*(ALI\*\*)(\d{4})(?!\d)', repl)

请注意，我还更改了正则表达式，定义了 2 个捕获组。

score 0 · Accepted Answer

dataframe.Data.str.replace(r"(?<=\*ALI[*]{2})1[0-9](?:(?:[0-4][0-9])|5[0-1])\b","OLDER")
Out[364]: 
0    The **ALI**OLDER for 90 days but not 77731929 
1                  For all **ALI**1952  28A 177945 
2        But the **ALI**OLDER and **ALI**OLDER 1912
Name: Data, dtype: object

(?<=\*ALI[*]{2})以`*ALI** 开头
1[0-9]即10-19
(?:外部非捕获组的开始
- (?:[0-4][0-9])即 00-49 但未捕获
- |5[01]即50-51
)非捕获组结束
\b边界

regex - 替换四位数熊猫

5 回答 5

Related

Reference