1

我有一个包含 3 列的 pandas 数据框:和path,其中是 a并且有值。现在我想通过使用依赖于行值部分的正则表达式对它们进行分组。即,每一行都有唯一的值,我想将它们与其他包含类似. 然后,如果该组中的所有行都满足某个条件(在此示例中,此类项目不能超过 5 个),则修改所有行的列。tagscolumn1tagslist of stringscolumn1booleanpathpath path

def has_to_change(df):
  if len(df) > 5:
      return True
  else:
      False

def add_tag_to_tags(row):
  if 'tag' not in row['tags']:
      row['tags'].append('tag')
  return row

if __name__ == '__main__':

  pattern =  r'some regex'
  regex =  re.compile(pattern)

  df = pd.read_csv(df_path)

  for index, row in df.iterrows():
      file_name = row['path']
      matches = regex.search(file_name)
      org_path = matches.group('some regex group') #get a match from this row's path

      matching_rows = df[df['path'].str.contains(org_path+'(\.xml|\.txt)')] #find all rows that contain this file name but with some difference, say, another extentions xml or txt

      if has_to_change(matching_rows): #if condition met, change it's vale and save back to dataframe
          #i keep loop here because i want to overwrite row with the same index (it was originally a bit more complex)
          for inner_index, augmented_row in matching_rows.iterrows():
              augmented_row['column'] = True
              augmented_row.apply(add_tag_to_tags, axis=1)
              df.iloc[inner_index] = augmented_row

这样的代码可以以某种方式矢量化吗?它超级慢,但我找不到任何方法:

  1. 通过正则表达式创建组
  2. 检查每个这样的组作为一个整体的值
  3. 然后才更新这些组

示例数据

      path,                              tags,             column1
/mnt/000000386703_aug_13237_0.jpg,       ['tag1'],         False
/mnt/000000386703_aug_13237_0.xml,       ['tag1'],         False
/mnt/000000386703_aug_13237_0.txt,       ['tag1', 'tag1'], False
/mnt/train_image_png_1221_aug_1245_5.jpg,['tag1'],         False
/mnt/000000306488_aug_9203_1.jpg,        ['tag1'],         False
/mnt/000000391768_aug_20250_1.jpg,       ['tag1'],         False
/mnt/1561887652.9493463_aug_1462_0.jpg,  ['tag1'],         True

更新后:

      path,                              tags,                   column1
/mnt/000000386703_aug_13237_0.jpg,       ['tag1','tag'],         True
/mnt/000000386703_aug_13237_0.xml,       ['tag1','tag'],         True
/mnt/000000386703_aug_13237_0.txt,       ['tag1','tag1', 'tag'], True
/mnt/train_image_png_1221_aug_1245_5.jpg,['tag1'],               False
/mnt/000000306488_aug_9203_1.jpg,        ['tag1'],               False
/mnt/000000391768_aug_20250_1.jpg,       ['tag1'],               False
/mnt/1561887652.9493463_aug_1462_0.jpg,  ['tag1'],               True

前 3 行将tag值添加到tags列,列值更改为,True因为它们共享path正则表达式捕获的值find all rows that have similar path(因此它取决于行的path

4

0 回答 0