我有 12 个文件:
['01_2021.csv', '02_2021.csv', '03_2021.csv', '04_2021.csv', '05_2021.csv', '06_2021.csv', '07_2021.csv', '08_2021.csv', '09_2021.csv', '10_2021.csv', '11_2020.csv', '12_2020.csv']
我的 CSV 文件结构:
我的路径中的 sampleCSVFile:
id itemName NonImportantEntries Entries SomeOtherEntries
1 item1 27 111 163
2 item2 16 22 98
每个文件都有具有唯一值的“ID”列。我希望扫描所有文件,以确认最后一次看到给定 ID 的文件名。有人可以帮忙吗?
到目前为止我的代码:
import os
import pandas as pd
#get your working directory and target folder that contains all your files
path = os.path.join(os.getcwd(),'folder')
files = [os.path.join(path,i) for i in os.listdir(path) if os.path.isfile(os.path.join(path,i))]
files.remove(path+'.DS_Store')
files.sort()
#I'm stuck here as the below code seems to add column 'lastSeen' into my output file but it includes rows from all the files in one data frame. How should I approach it?
#for every file in folder, create a separate data frame and read it, for each frame append with column filename as 'lastSeen'. Scan unique IDs through all data frames to find in which data frame name, unique ID was seen last - in this example we are consider months between 2020 and 2021.
df = pd.DataFrame()
for file in files:
_df = pd.read_csv(file)
_df['fileName'] = os.path.split(file)[-1]
df = df.append(_df)
预期的 finalFile.csv 格式:
id lastSeen
1 06_2021
2 12_2020
3 10_2021
...
45000 07_2021
提前感谢您对此的任何帮助!