我实际上是在尝试通过最近的 Hackathon LTFS(Bank Data)解决分析问题,但我遇到了一些独特的问题,实际上并不太独特。让我解释
Problem
Bureau数据集中名为REPORTED DATE - HIST
, CUR BAL - HIST
, AMT OVERDUE
-的列很少
HIST & AMT PAID - HIST
,,
这是数据集的一部分(它不是原始数据,因为行大小很大)
**Requested Date - Hist**
20180430,20180331,
20191231,20191130,20191031,20190930,20190831,20190731,20190630,20190531,20190430,20190331
,
20121031,20120930,20120831,20120731,20120630,20120531,20120430,
----------------x-----------2nd column------------x-----------------------------------
**AMT OVERDUE**
37873,,
,,,,,,,,,,,,,,,,,,,,1452,,
0,0,0,
,,
0,,0,0,0,0,3064,3064,3064,2972,0,2802,0,0,0,0,0,2350,2278,2216,2151,2087,2028,1968,1914,1663,1128,1097,1064,1034,1001,976,947,918,893,866
-----x--other columns are similar---x---------------------
Seeking for a better option, if possible
以前当我解决这类问题时,它是 Movielens 项目的流派,我使用了使用虚拟列的概念,它在那里工作,因为流派列中没有太多的值,而且一些值在许多行中重复值,所以这很容易。但是这里似乎很难,因为有两个原因
1st reason
因为它包含很多价值,同时它可能不包含任何价值
2nd reason
如何为每个唯一值创建列或像 Movielens 流派案例中的行
**genre**
action|adventure|comedy
carton|scifi|action
biopic|adventure|comedy
Thrill|action
# so here I had extracted all unique value and created columns
**genre** | **action** | **adventure**| **Comedy**| **carton**| **sci-fi**| and so on...
action|adventure|comedy | 1 | 1 | 1 | 0 | 0 |
carton|scifi|action | 1 | 0 | 0 | 1 | 1 |
biopic|adventure|comedy | 0 | 1 | 1 | 0 | 0 |
Thrill|action | 1 | 0 | 0 | 0 | 0 |
# but here it's different how can I deal with this, I have no clue
**AMT OVERDUE**
37873,,
,,,,,,,,,,,,,,,,,,,,1452,,
0,0,0,
,,
0,,0,0,0,0,3064,3064,3064,2972,0,2802,0,0,0,0,0,2350,2278,2216,2151,2087,2028,1968,1914,1663,1128,1097,1064,1034,1001,976,947,918,893,866