python - 熊猫：检查最后一次看到哪个月份的文件名唯一ID

Question

我有 12 个文件：

['01_2021.csv', '02_2021.csv', '03_2021.csv', '04_2021.csv', '05_2021.csv', '06_2021.csv', '07_2021.csv', '08_2021.csv', '09_2021.csv', '10_2021.csv', '11_2020.csv', '12_2020.csv']

我的 CSV 文件结构：

我的路径中的 sampleCSVFile：

id    itemName    NonImportantEntries    Entries    SomeOtherEntries
1      item1              27              111             163
2      item2              16               22              98

每个文件都有具有唯一值的“ID”列。我希望扫描所有文件，以确认最后一次看到给定 ID 的文件名。有人可以帮忙吗？

到目前为止我的代码：

import os
import pandas as pd

#get your working directory and target folder that contains all your files
path = os.path.join(os.getcwd(),'folder')

files = [os.path.join(path,i) for i in os.listdir(path) if os.path.isfile(os.path.join(path,i))]
files.remove(path+'.DS_Store')
files.sort()

#I'm stuck here as the below code seems to add column 'lastSeen' into my output file but it includes rows from all the files in one data frame. How should I approach it?
      
#for every file in folder, create a separate data frame and read it, for each frame append with column filename as 'lastSeen'. Scan unique IDs through all data frames to find in which data frame name, unique ID was seen last - in this example we are consider months between 2020 and 2021. 
df = pd.DataFrame()
for file in files:
    _df = pd.read_csv(file)
    _df['fileName'] = os.path.split(file)[-1]
    df = df.append(_df)

预期的 finalFile.csv 格式：

    id      lastSeen
    1       06_2021
    2       12_2020 
    3       10_2021
    ...
    45000   07_2021

提前感谢您对此的任何帮助！

score 1 · Accepted Answer

尝试：

使用文件名读取必要的（“id”）列，pd.read_csv并插入一个列（“lastSeen”）。
append每个 DataFrame 创建主数据帧
用于pd.to_datetime将文件名转换为日期。
groupby并且只保留日期列最大的“id”。

path = os.path.join(".", "folder")
files = [f for f in os.listdir(path) if f.endswith(".csv")]

master = pd.DataFrame()
for file in files:
    temp = pd.read_csv(os.path.join(path, file),usecols=[0])
    temp["lastSeen"] = file.replace(".csv","")
    master = master.append(temp, ignore_index=True)

master["date"] = pd.to_datetime(master["lastSeen"], format="%m_%Y")    
output = master[master["date"]==master.groupby("id")["date"].transform("max")].drop("date", axis=1)

>>> output
   id lastSeen
0   1  01_2021
2   2  02_2021
3   3  02_2021

python - 熊猫：检查最后一次看到哪个月份的文件名唯一ID

1 回答 1

Related

Reference