0

我有 2 个数据框,一个包含我需要分类的一列字符串(df = 数据),另一个包含可能的类别和搜索词(df = 类别)。我想在“数据”数据框中添加一列,它会根据搜索词返回一个类别。例如:

数据:

**RepairName**
A/C is not cold
flat tyre is c
the tyre needs a repair on left side
the aircon is not cold

类别:

**Category**      **SearchTerm**
A/C               aircon
A/C               A/C
Tyre              repair
Tyre              flat

期望的结果数据:

**RepairName**                        **Category**
A/C is not cold                         A/C
flat tyre is c                          Tyre
the tyre needs a repair on left side    Tyre
the aircon is not cold                  A/C

我已经用 apply 尝试了以下 lambda 函数。我不确定我的列引用是否在正确的位置:

data['Category'] = data['RepairName'].apply(lambda x: categories['Category'] if categories['SearchTerm'] in x else "")
data['Category'] = [categories['Category'] if categories['SearchTerm'] in data['RepairName'] else 0]

但我不断收到错误消息:

TypeError: 'in <string>' requires string as left operand, not Series

这提供了基于 SearchTerm 的类别是否存在的真/假,但是我无法返回与搜索词关联的类别:

data['containName']=data['RepairName'].str.contains('|'.join(categories['SearchTerm']),case=False)

这两者有时都有效,但并非一直有效(也许是因为我的某些搜索词不止一个词?)

data['Category'] = [
    next((c for c, k in categories.values if k in s), None) for s in data['RepairName']] 

d = dict(zip(categories['SearchTerm'], categories['Category']))
data['CategoryCheck'] = [next((d[y] for y in x.split() if y in d), None) for x in data['RepairName']]

4

2 回答 2

0

我们这样str.findallmap

s=df.RepairName.str.findall('|'.join(cat.SearchTerm.tolist())).str[0].\
    map(cat.set_index('SearchTerm').Category)
0     A/C
1    Tyre
2    Tyre
3     A/C
Name: RepairName, dtype: object
df['Category']=s
于 2020-06-22T01:44:42.413 回答
0

一旦我确保我的所有列都是小写的(我还删除了连字符和括号以更好地衡量),这就会起作用:

print("All lowercase")
data = data.apply(lambda x: x.astype(str).str.lower())
categories = categories.apply(lambda x: x.astype(str).str.lower())

print("Remove double spacing")
data = data.replace('\s+', ' ', regex=True)

print('Remove hyphens')
data["RepairName"] = data["RepairName"].str.replace('-', '')

print('Remove brackets')
data["RepairName"] = data["RepairName"].str.replace('(', '')
data["RepairName"] = data["RepairName"].str.replace(')', '')

data['Category'] = [
    next((c for c, k in categories.values if k in s), None) for s in data['RepairName']]
于 2020-06-22T09:02:49.370 回答