python - 如何使用 Pandas 中的正则表达式提取月份中的某一天？

Question

我在这样的数据框中有字符串

140 "14 Feb 1995 Primary Care Doctor:
"
141 "30 May 2016 SOS-10 Total Score:
"
142 "22 January 1996 @ 11 AMCommunication with referring physician?: Done
"

我想分别提取几天和几个月。所以我列了一个清单

list=['Jan','Feb','Mar','Apr','May','Jun','Jul','Aug','Sep','Oct','Nov','Dec']        
for i in range(500):
     
    for month in list:
       a= 'r(\d\d) '+month+'[a-z]{,8}'
       b=df[0].str.findall(a)[i]
       df['day'][i]=b

当我寻找 df['day'] 我只得到 [] 并且我想得到 [14] [30] [22]

score 1 · Accepted Answer

尝试使用这个正则表达式：

...
    a = r"(\d{1,2}) \w+ \d{4}"
    b = df[0].str.findall(a)[i]
    df['day'][i] = b

score 0 · Accepted Answer

试试这个模式：

pattern = re.compile(r"(?P<day>\d{1,2}) (?P<month>[A-Z][a-z]{2,}) (?P<year>\d{2,4})")

命名的捕获组，例如(?P<day> \d{0,2}意味着您可以访问返回的 3 元组并仅提取该字段。

然后你可以做这样的事情：

>>> for match in re.finditer(pattern, str):
>>> .... print(match.group("day"))

我也会使用apply而不是 for 循环来访问您的 DataFrame：

>>> data = {"string": ["14 Feb 1995 Primary Care Doctor:", 
       "30 May 2016 SOS-10 Total Score:",
        "22 January 1996 @ 11 AMCommunication with referring physician?: Done"] }

>>> df = pd.DataFrame.from_dict(data)

>>> df.string.apply(lambda x: re.search(pattern, x).group("day"))

0    14
1    30
2    22
Name: string, dtype: object

然后，如果您愿意，可以方便地分别保存这些值：

>>> df["day"] = df.string.apply(lambda x: re.search(pattern, x).group("day"))

>>> df["month"] = df.string.apply(lambda x: re.search(pattern, x).group("month"))

>>> df
    string                                              day month
0   14 Feb 1995 Primary Care Doctor:                    14  Feb
1   30 May 2016 SOS-10 Total Score:                     30  May
2   22 January 1996 @ 11 AMCommunication with refe...   22  January

ETA：如果您想调整它以仅提取缩写月份，无论它是否完全拼写出来，请尝试将上面的正则表达式模式替换为：

pattern = re.compile(r"(?P<day>\d{1,2}) (?P<month>[A-Z][a-z]{2})[a-z]*? (?P<year>\d{2,4})")

这将仅捕获月份名称的前 3 个字符，但即使日期较长，也会找到日期。

python - 如何使用 Pandas 中的正则表达式提取月份中的某一天？

2 回答 2

Related

Reference