-1

我有一个需要清理的字符串列的数据集。我需要的信息位于连字符之前或之后,具体取决于行:

AFZ - CTO
AFZ - Data Scientist
AFZ - Machine Learning Senior Manager
Agile Delivery Senior Manager - Data & Analytics
AGILE Master - Business Relationship Management

我想要实现的是:

CTO
Data Scientist
Machine Learning Senior Manager
Agile Delivery Senior Manager
Business Relationship Management

我尝试了几个正则表达式,但它们根据连字符的位置删除了我需要的信息。关于如何实现这一目标的任何线索?

谢谢!

4

3 回答 3

0

提取左侧部分:

>>> df['your_column'].str.extract('(.+)-')
                                0
0                            AFZ 
1                            AFZ 
2                            AFZ 
3  Agile Delivery Senior Manager 
4                   AGILE Master 

要提取正确的部分:

>>> df['your_column'].str.extract('-(.+)')
                                   0
0                                CTO
1                     Data Scientist
2    Machine Learning Senior Manager
3                   Data & Analytics
4   Business Relationship Management
于 2021-08-11T13:07:40.127 回答
0

按照您的示例,我定义了这个 DataFrame:

df = pd.DataFrame(['AFZ - CTO',
'AFZ - Data Scientist',
'AFZ - Machine Learning Senior Manager',
'Agile Delivery Senior Manager - Data & Analytics' ,
'AGILE Master - Business Relationship Management'], columns=['0'])

使用来自 pandas 和re的apply

import re
patterns = re.compile('AFZ - |AGILE Master - | -.*|\n', flags=re.IGNORECASE)
def split_it(text):
    return patterns.sub(r' ', text)

df['0'].apply(split_it)

在您可以创建一个函数(在我的例子中称为 split_it)并使用该变量从输入文本中删除模式之后,您可以定义一个编译所有要删除的模式的变量。剩下的唯一步骤是从 pandas 调用函数 apply,并传递创建的函数的名称。

这些是结果:

0                                  CTO
1                       Data Scientist
2      Machine Learning Senior Manager
3       Agile Delivery Senior Manager 
4     Business Relationship Management
Name: 0, dtype: object 
于 2021-08-11T13:27:54.913 回答
0
a = ["AFZ - CTO",
     "AFZ - Data Scientist",
     "AFZ - Machine Learning Senior Manager",
     "Agile Delivery Senior Manager - Data & Analytics",
     "AGILE Master - Business Relationship Management"]

一个班轮时间:D

list(i.split(" - ")[0] if re.search(r".{5,}- .*", i) else re.search(r"(?<=- ).*", i).group() for i in a)

结果:

[
 'CTO',
 'Data Scientist', 
 'Machine Learning Senior Manager',
 'Agile Delivery Senior Manager',
 'AGILE Master'
]
于 2021-08-11T13:09:44.363 回答