我有一个包含 3 列的 pandas 数据框:和path
,其中是 a并且有值。现在我想通过使用依赖于行值部分的正则表达式对它们进行分组。即,每一行都有唯一的值,我想将它们与其他包含类似. 然后,如果该组中的所有行都满足某个条件(在此示例中,此类项目不能超过 5 个),则修改所有行的列。tags
column1
tags
list of strings
column1
boolean
path
path
path
def has_to_change(df):
if len(df) > 5:
return True
else:
False
def add_tag_to_tags(row):
if 'tag' not in row['tags']:
row['tags'].append('tag')
return row
if __name__ == '__main__':
pattern = r'some regex'
regex = re.compile(pattern)
df = pd.read_csv(df_path)
for index, row in df.iterrows():
file_name = row['path']
matches = regex.search(file_name)
org_path = matches.group('some regex group') #get a match from this row's path
matching_rows = df[df['path'].str.contains(org_path+'(\.xml|\.txt)')] #find all rows that contain this file name but with some difference, say, another extentions xml or txt
if has_to_change(matching_rows): #if condition met, change it's vale and save back to dataframe
#i keep loop here because i want to overwrite row with the same index (it was originally a bit more complex)
for inner_index, augmented_row in matching_rows.iterrows():
augmented_row['column'] = True
augmented_row.apply(add_tag_to_tags, axis=1)
df.iloc[inner_index] = augmented_row
这样的代码可以以某种方式矢量化吗?它超级慢,但我找不到任何方法:
- 通过正则表达式创建组
- 检查每个这样的组作为一个整体的值
- 然后才更新这些组
示例数据
path, tags, column1
/mnt/000000386703_aug_13237_0.jpg, ['tag1'], False
/mnt/000000386703_aug_13237_0.xml, ['tag1'], False
/mnt/000000386703_aug_13237_0.txt, ['tag1', 'tag1'], False
/mnt/train_image_png_1221_aug_1245_5.jpg,['tag1'], False
/mnt/000000306488_aug_9203_1.jpg, ['tag1'], False
/mnt/000000391768_aug_20250_1.jpg, ['tag1'], False
/mnt/1561887652.9493463_aug_1462_0.jpg, ['tag1'], True
更新后:
path, tags, column1
/mnt/000000386703_aug_13237_0.jpg, ['tag1','tag'], True
/mnt/000000386703_aug_13237_0.xml, ['tag1','tag'], True
/mnt/000000386703_aug_13237_0.txt, ['tag1','tag1', 'tag'], True
/mnt/train_image_png_1221_aug_1245_5.jpg,['tag1'], False
/mnt/000000306488_aug_9203_1.jpg, ['tag1'], False
/mnt/000000391768_aug_20250_1.jpg, ['tag1'], False
/mnt/1561887652.9493463_aug_1462_0.jpg, ['tag1'], True
前 3 行将tag
值添加到tags
列,列值更改为,True
因为它们共享path
正则表达式捕获的值find all rows that have similar path
(因此它取决于行的path
)