-2

正则表达式提取满足以下条件的所有子字符串:

  • 前 4 个字符是数字,子字符串以数字或字母结尾

  • 15 或 18 个字符长

  • 如果有 2 个子字符串符合条件,则只返回第一个

    df1 = pd.DataFrame(data ={"Messy_IDS":["Looking for ID : 7010M000002N8c5T7A","5634M000002N8c5T7A,7010M000002N8c5T7A","https://website.com/12340000000f5F5"], "Desired_Output":["7010M000002N8c5T7A","5634M000002N8c5T7A","12340000000f5F5"]})
    

df1

        Messy_IDS                                Desired_Output
   0    Looking for ID : 7010M000002N8c5T7A      7010M000002N8c5T7A
   1    5634M000002N8c5T7A,7010M000002N8c5T7A    5634M000002N8c5T7A
   2    https://website.com/12340000000f5F5      12340000000f5F5
4

2 回答 2

0

你可以用这段代码做到这一点

df1["Messy_IDS"].str.extract("(\d{4}\w+)")
于 2020-10-15T05:53:56.733 回答
0

Series.str.extract正则表达式匹配前 4 位数字,然后用于11or14数字或字母:

df['new'] = df['Messy_IDS'].str.extract('([0-9]{4}[0-9A-Za-z]{11,14})')

或者:

df['new'] = df['Messy_IDS'].str.extract('(\d{4}\w{11,14})')

print (df)
                               Messy_IDS      Desired_Output  \
0    Looking for ID : 7010M000002N8c5T7A  7010M000002N8c5T7A   
1  5634M000002N8c5T7A,7010M000002N8c5T7A  5634M000002N8c5T7A   
2    https://website.com/12340000000f5F5     12340000000f5F5   

                  new  
0  7010M000002N8c5T7A  
1  5634M000002N8c5T7A  
2     12340000000f5F5  
于 2020-10-15T05:50:20.973 回答