我有一个如下的数据框:
BankNum | FirstName | LastName | ID |
00987772 | Michael | Brown | 123 |
00987772 | Bob | Brown | 123 |
00987772 | Michael | Mooney | 123 |
00987772 | Raven | Mallik | 245 |
00982122 | Karim | Hareche | 564 |
我正在执行以下操作以获取两个字典:
cols = [
{'col': 'BankNum', 'func': lambda x: x.value_counts().to_dict()},
{'col': 'FirstName', 'func': pd.Series.nunique},
{'col': 'LastName', 'func': pd.Series.nunique}]
d = df.groupby('Transporter ID').apply(lambda x: tuple(c['func'](x[c['col']]) for c in cols)).to_dict()
cols1 = ['ID']
df2 = df.groupby('BankNum').apply(lambda x: tuple(x[c].nunique() for c in cols1))
d1 = df2.to_dict()
在哪里
d ={ 123 : ({00987772: 3}, 2,2), 245: ({00987772: 1}, 1,1), 564: ({00982122: 1}, 1,1)}
d1 = {00987772: (2,), 00982122:(1,)}
接下来,我正在执行以下操作(以下只是相关代码,还有其他我正在做的事情,我已从以下代码中删除:
same_banknum={}
l=[]
w=[]
m = v[2].values()
h2 = sum(i > 6 for i in m)
mod2 = sum(i in [5,6] for i in m)
l2 = sum(i in [3,4] for i in m)
if h2 != 0:
for k2, v2 in v[2].items():
if v2 > 6:
l.append(k2)
w.append(v2)
new_l=[]
for i in l:
v3 = d1.get(i)
new_l.append(v3[0])
h3 = sum(i > 8 for i in new_l)
m3 = sum(i in [5,6,7,8] for i in new_l)
l3 = sum(i in [3,4] for i in new_l)
c=[]
if h3 != 0:
for g in new_l:
if g > 8:
c.append(g)
wt = sum(c)
same_banknum[k]= ("High", wt)
elif m3 != 0:
for g in new_l:
if g in [5,6,7,8]:
c.append(g)
wt = sum(c)
same_banknum[k]= ("Moderate", wt)
elif l3 != 0:
for g in new_l:
if g in [3,4]:
c.append(g)
wt = sum(c)
same_banknum[k]= ("Low", wt)
else:
same_banknum[k]= ("Low", 0.0)
elif mod2 != 0:
for k2, v2 in v[2].items():
if v2 in [5,6]:
l.append(k2)
w.append(v2)
new_l=[]
for i in l:
v3 = d1.get(i)
new_l.append(v3[0])
h3 = sum(i > 8 for i in new_l)
m3 = sum(i in [5,6,7,8] for i in new_l)
l3 = sum(i in [3,4] for i in new_l)
c=[]
if h3 != 0:
for g in new_l:
if g > 8:
c.append(g)
wt = sum(c)
same_banknum[k]= ("High", wt)
elif m3 != 0:
for g in new_l:
if g in [5,6,7,8]:
c.append(g)
wt = sum(c)
same_banknum[k]= ("Moderate", wt)
elif l3 != 0:
for g in new_l:
if g in [3,4]:
c.append(g)
wt = sum(c)
same_banknum[k]= ("Low", wt)
else:
same_banknum[k]= ("Low", 0.0)
elif l2 != 0:
for k2, v2 in v[2].items():
if v2 in [3,4]:
l.append(k2)
w.append(v2)
new_l=[]
for i in l:
v3 = d1.get(i)
new_l.append(v3[0])
h3 = sum(i > 8 for i in new_l)
m3 = sum(i in [5,6,7,8] for i in new_l)
l3 = sum(i in [3,4] for i in new_l)
c=[]
if h3 != 0:
for g in new_l:
if g > 8:
c.append(g)
wt = sum(c)
same_banknum[k]= ("High", wt)
elif m3 != 0:
for g in new_l:
if g in [5,6,7,8]:
c.append(g)
wt = sum(c)
same_banknum[k]= ("Moderate", wt)
elif l3 != 0:
for g in new_l:
if g in [3,4]:
c.append(g)
wt = sum(c)
same_banknum[k]= ("Low", wt)
else:
same_banknum[k]= ("Low", 0.0)
else:
for k2, v2 in v[2].items():
if v2 in [1,2]:
l.append(k2)
new_l=[]
for i in l:
v3 = d1.get(i)
new_l.append(v3[0])
h3 = sum(i > 8 for i in new_l)
m3 = sum(i in [5,6,7,8] for i in new_l)
l3 = sum(i in [3,4] for i in new_l)
c=[]
if h3 != 0:
for g in new_l:
if g > 8:
c.append(g)
wt = sum(c)
same_banknum[k]= ("High", wt)
elif m3 != 0:
for g in new_l:
if g in [5,6,7,8]:
c.append(g)
wt = sum(c)
same_banknum[k]= ("Moderate", wt)
elif l3 != 0:
for g in new_l:
if g in [3,4]:
c.append(g)
wt = sum(c)
same_banknum[k]= ("Low", wt)
else:
same_banknum[k]= ("Low", 0.0)
得到这样的字典:
same_banknum = {123: ('Low', 0.6), 245: ('Low', 0.6), 564: ('Low', 0.0)}
same_banknum 字典执行上述计算并找出BankNum
多个 ID 是否存在相同的 ID,然后为它们分配High
, Low
,Moderate
值以及它对它的权重,给我们一个字典。
我可以将其转换为如下数据框:
df1 = pd.DataFrame.from_dict(same_banknum, orient='index').reset_index()
df1.columns = ['ID','SameBankNum_Val','SameBankNum_Wt']
这使:
ID | SameBankNum_Val | SameBankNum_Wt
123 | Low | 0.6
245 | Low | 0.6
564 | Low | 0.0
我想要做的是,我不想为每个进来的新数据集一次又一次地执行此计算,而是想使用机器学习来构建一个预测模型,该模型预测上述SameBankNum_Val
和SameBankNum_Wt
新 ID(测试数据)。
我可以将SameBankNum_Val
&SameBankNum_Wt
列添加到上述训练数据框中。但是,我想知道的是:
如何将多列(BankNum
, FirstName
, LastName
, ID
)(来自上面的 Dataframe 1)作为火车数据和多列(SameBankNum_Val
, SameBankNum_Wt
)(来自上面的 Dataframe 2)作为机器中的火车标签学习模式?
此外,机器学习模型是否足够准确地确定何时给出它High
,Low
或者Moderate
价值和什么权重,而无需一次又一次地执行那么长的计算?对于这个问题,我想我只需要先用多个模型进行测试。
请帮忙!谢谢!