python - 从符合条件的 Pandas 列中提取正则表达式子字符串

Question

正则表达式提取满足以下条件的所有子字符串：

前 4 个字符是数字，子字符串以数字或字母结尾
15 或 18 个字符长

如果有 2 个子字符串符合条件，则只返回第一个

df1 = pd.DataFrame(data ={"Messy_IDS":["Looking for ID : 7010M000002N8c5T7A","5634M000002N8c5T7A,7010M000002N8c5T7A","https://website.com/12340000000f5F5"], "Desired_Output":["7010M000002N8c5T7A","5634M000002N8c5T7A","12340000000f5F5"]})

df1

        Messy_IDS                                Desired_Output
   0    Looking for ID : 7010M000002N8c5T7A      7010M000002N8c5T7A
   1    5634M000002N8c5T7A,7010M000002N8c5T7A    5634M000002N8c5T7A
   2    https://website.com/12340000000f5F5      12340000000f5F5

score 0 · Accepted Answer

0

你可以用这段代码做到这一点

df1["Messy_IDS"].str.extract("(\d{4}\w+)")

于 2020-10-15T05:53:56.733 回答

score 0 · Accepted Answer

用Series.str.extract正则表达式匹配前 4 位数字，然后用于11or14数字或字母：

df['new'] = df['Messy_IDS'].str.extract('([0-9]{4}[0-9A-Za-z]{11,14})')

或者：

df['new'] = df['Messy_IDS'].str.extract('(\d{4}\w{11,14})')

print (df)
                               Messy_IDS      Desired_Output  \
0    Looking for ID : 7010M000002N8c5T7A  7010M000002N8c5T7A   
1  5634M000002N8c5T7A,7010M000002N8c5T7A  5634M000002N8c5T7A   
2    https://website.com/12340000000f5F5     12340000000f5F5   

                  new  
0  7010M000002N8c5T7A  
1  5634M000002N8c5T7A  
2     12340000000f5F5

python - 从符合条件的 Pandas 列中提取正则表达式子字符串

2 回答 2

Related

Reference