0

我正在使用此代码提取两个具有相似命名约定的 CSV,将它们的文件名放在“文件”列中,并将数据帧连接到一个名为 NatHrs 的数据帧中。

import glob
from pathlib import Path

path = r'C:\Users\ThisUser\Desktop\AC Mbr Analysis'
all_files = glob.glob(path + '\\Natl_hours_YTD_OC_*.csv')

Nat_dfs = []
for file in all_files:
    df = pd.read_csv(file, index_col=None, encoding='windows-1252', header=1 )
    df['File'] = file
    Nat_dfs.append(df)

NatHrs = pd.concat(Nat_dfs)

现在,我想使用“文件”列,它返回一个文件名对象,其中的条目看起来像“C:\Users\ThisUser\Desktop\AC Mbr Analysis\Natl_hours_YTD_OC_2018-2019”,只提取文件名的结尾——在本例“2018-2019”——并将这些字符放入一个新列“程序年”,反映条目“2018-2019”。我在操作字符串或系列方面没有成功——我应该使用 path.replace 吗?我搞不清楚了。当我描述我要解析的列时......

NatHrs['File'].describe

...我明白了:

Name: File, dtype: object>
4

3 回答 3

0

您可以使用正则表达式来查找字符串中的子字符串:

import re

string = r"C:\Users\ThisUser\Desktop\AC Mbr Analysis\Natl_hours_YTD_OC_2018-2019"
short = string.split('\\')[-1]

substring = re.search('\d+[-]*\d+',short).group()
print(substring)

可能想详细说明模式如何变化。它会永远是“年”吗?可以只是“年”吗?那么可能必须更改正则表达式。

编辑:

这一切都相当不方便,所以我制作了自己的虚拟文件,我可以用它来做你想做的事。对我来说很好,但请自己寻找:

import glob
import pandas as pd
import re
import os

all_files = glob.glob('Natl_hours_YTD_OC_*.csv')
full_paths = [os.path.abspath(file) for file in all_files]

print(full_paths)
>>> Out:
['C:\\Users\\Chris\\Desktop\\Natl_hours_YTD_OC_2018-2019.csv',
'C:\\Users\\Chris\\Desktop\\Natl_hours_YTD_OC_2019-2020.csv',
'C:\\Users\\Chris\\Desktop\\Natl_hours_YTD_OC_2020.csv']
Nat_dfs = []

for file in all_files:
    df = pd.read_csv(file,delim_whitespace=True)
    print(df,'\n')
    df['File'] = file
    df['Year'] = re.search('\d+[-]*\d*',file).group()

    Nat_dfs.append(df)
>>> Out:
   A  B
0  7  7
1  8  8
2  9  9 

   A  B
0  4  4
1  5  5
2  6  6 

   A  B
0  1  1
1  2  2
2  3  3 
NatHrs = pd.concat(Nat_dfs)
print(NatHrs)
>>> Out:
   A  B                             File       Year
0  7  7  Natl_hours_YTD_OC_2018-2019.csv  2018-2019
1  8  8  Natl_hours_YTD_OC_2018-2019.csv  2018-2019
2  9  9  Natl_hours_YTD_OC_2018-2019.csv  2018-2019
0  4  4  Natl_hours_YTD_OC_2019-2020.csv  2019-2020
1  5  5  Natl_hours_YTD_OC_2019-2020.csv  2019-2020
2  6  6  Natl_hours_YTD_OC_2019-2020.csv  2019-2020
0  1  1       Natl_hours_YTD_OC_2020.csv       2020
1  2  2       Natl_hours_YTD_OC_2020.csv       2020
2  3  3       Natl_hours_YTD_OC_2020.csv       2020

我不知道你做错了什么,但这绝对有效。希望这就是你所需要的。

于 2020-02-03T00:26:20.020 回答
0

我试过这个:

import re

string = NatHrs['File']
short = string.split('\\')[-1]

substring = re.search('\d+[-]*\d+',short).group()
print(substring)

NatHrs['Program Year'] = substring
NatHrs

我得到了这个:

---------------------------------------------------------------------------
AttributeError                            Traceback (most recent call last)
<ipython-input-257-f8c37bb604e2> in <module>
      3 
      4 string = NatHrs['File']
----> 5 short = string.split('\\')[-1]
      6 
      7 substring = re.search('\d+[-]*\d+',short).group()

~\anaconda3\envs\PythonData\lib\site-packages\pandas\core\generic.py in __getattr__(self, name)
   5177             if self._info_axis._can_hold_identifiers_and_holds_name(name):
   5178                 return self[name]
-> 5179             return object.__getattribute__(self, name)
   5180 
   5181     def __setattr__(self, name, value):

AttributeError: 'Series' object has no attribute 'split'

这返回了一个文件,该文件读取了 File 和 Program Year 年份之间的这种类型的不一致:

File    Program Year
C:\Users\HHeatley\Desktop\AC Mbr Analysis\Natl_hours_YTD_OC_2018-2019.csv   2018-2019
C:\Users\HHeatley\Desktop\AC Mbr Analysis\Natl_hours_YTD_OC_2018-2019.csv   2018-2019
C:\Users\HHeatley\Desktop\AC Mbr Analysis\Natl_hours_YTD_OC_2018-2019.csv   2018-2019
C:\Users\HHeatley\Desktop\AC Mbr Analysis\Natl_hours_YTD_OC_2019-2020.csv   2018-2019
C:\Users\HHeatley\Desktop\AC Mbr Analysis\Natl_hours_YTD_OC_2019-2020.csv   2018-2019
C:\Users\HHeatley\Desktop\AC Mbr Analysis\Natl_hours_YTD_OC_2019-2020.csv   2018-2019
C:\Users\HHeatley\Desktop\AC Mbr Analysis\Natl_hours_YTD_OC_2019-2020.csv   2018-2019
C:\Users\HHeatley\Desktop\AC Mbr Analysis\Natl_hours_YTD_OC_2019-2020.csv   2018-2019
C:\Users\HHeatley\Desktop\AC Mbr Analysis\Natl_hours_YTD_OC_2019-2020.csv   2018-2019

我也试过这个:

import re

string = file
short = string.split('\\')[-1]

substring = re.search('\d+[-]*\d+',short).group()
print(substring)

NatHrs['Program Year'] = substring
NatHrs

并得到一个仅反映 2019-2020 年的列“计划年”,尽管我希望 2018-2019 年和 2019-2020 年都出现。

于 2020-02-03T16:47:35.130 回答
0

这最终奏效了。非常感谢您帮助我!

globbed_files = glob.glob("Natl_hours_YTD_OC_*.csv")
globbed_files

data = []
for csv in globbed_files:
    frame = pd.read_csv(csv, encoding='windows-1252', header=1)
    frame['filename'] = os.path.basename(csv)
    file = os.path.basename(csv)
#create a new column to store the portion of the file name that denotes the Program Year to which the data belongs
    frame['Program Year'] = re.search('\d+[-]*\d*',file).group()
    data.append(frame)

NatHrs = pd.concat(data, ignore_index=True) #dont want pandas to try an align row indexes
NatHrs.copy().head()
于 2020-02-14T13:49:17.793 回答