0

我实际上是在尝试通过最近的 Hackathon LTFS(Bank Data)解决分析问题,但我遇到了一些独特的问题,实际上并不太独特。让我解释

Problem

Bureau数据集中名为REPORTED DATE - HIST, CUR BAL - HIST, AMT OVERDUE-的列很少 HIST & AMT PAID - HIST,,

这是数据集的一部分(它不是原始数据,因为行大小很大)

**Requested Date - Hist**                                                                   
20180430,20180331,
20191231,20191130,20191031,20190930,20190831,20190731,20190630,20190531,20190430,20190331
,
20121031,20120930,20120831,20120731,20120630,20120531,20120430,

----------------x-----------2nd column------------x-----------------------------------

**AMT OVERDUE**
37873,,
,,,,,,,,,,,,,,,,,,,,1452,,
0,0,0,
,,
0,,0,0,0,0,3064,3064,3064,2972,0,2802,0,0,0,0,0,2350,2278,2216,2151,2087,2028,1968,1914,1663,1128,1097,1064,1034,1001,976,947,918,893,866

-----x--other columns are similar---x---------------------

Seeking for a better option, if possible

以前当我解决这类问题时,它是 Movielens 项目的流派,我使用了使用虚拟列的概念,它在那里工作,因为流派列中没有太多的值,而且一些值在许多行中重复值,所以这很容易。但是这里似乎很难,因为有两个原因

1st reason因为它包含很多价值,同时它可能不包含任何价值

2nd reason如何为每个唯一值创建列或像 Movielens 流派案例中的行

**genre**
action|adventure|comedy
carton|scifi|action
biopic|adventure|comedy
Thrill|action

# so here I had extracted all unique value and created columns 

**genre**                 | **action** | **adventure**| **Comedy**| **carton**| **sci-fi**| and so on...
action|adventure|comedy   |   1        |     1        |      1    |     0     |      0    |    
carton|scifi|action       |   1        |     0        |      0    |     1     |      1    |
biopic|adventure|comedy   |   0        |     1        |      1    |     0     |      0    |
Thrill|action             |   1        |     0        |      0    |     0     |      0    |

# but here it's different how can I deal with this, I have no clue
**AMT OVERDUE**
37873,,
,,,,,,,,,,,,,,,,,,,,1452,,
0,0,0,
,,
0,,0,0,0,0,3064,3064,3064,2972,0,2802,0,0,0,0,0,2350,2278,2216,2151,2087,2028,1968,1914,1663,1128,1097,1064,1034,1001,976,947,918,893,866
4

1 回答 1

0

当在推荐器中时,通常具有稀疏矩阵。这些可能非常消耗空间(太多的零或​​空格),可能很好地转移到稀疏矩阵 scipy 表示,如这里。如前所述,它在推荐器中很常见,请在此处找到出色的示例。

不幸的是,我不能使用原始数据,也许在 csv 中有一个较小的例子。所以我会使用推荐者的例子,因为它也很常见。

import pandas as pd
import numpy as np
from scipy.sparse import csr_matrix

df = pd.DataFrame({
    'genres' : ["action|adventure|comedy", "carton|scifi|action","biopic|adventure|comedy","Thrill|action"],
})
print(df)
                    genres
0  action|adventure|comedy
1      carton|scifi|action
2  biopic|adventure|comedy
3            Thrill|action

让我们看看它看起来像一个矩阵:

# To identify the genres so we can create our columns
genres = []
for G in df['genres'].unique():
    for i in G.split("|"):
        print(i)
        genres.append(i)
# To remove duplicates
genres = list(set(genres))

# Create a column for each genere
for g in genres:
    df[g] = df.genres.transform(lambda x: int(g in x))

# This is the sparse matrix with many 0
movie_genres = df.drop(columns=['genres'])
print(movie_genres)
   comedy  carton  adventure  Thrill  biopic  action  scifi
0       1       0          1       0       0       1      0
1       0       1          0       0       0       1      1
2       1       0          1       0       1       0      0
3       0       0          0       1       0       1      0

我们不需要创建该矩阵,事实上,最好避免相同,这可能会非常消耗资源。

我们应该将其转换为 a csr_matrix,只有一部分大小:

from scipy.sparse import csr_matrix

M = df.index.__len__()
N = genres.__len__()

user_mapper = dict(zip(np.unique(df.index), list(range(M))))
genres_mapper = dict(zip(genres, list(range(N))))

user_inv_mapper = {user_mapper[i]:i for i in user_mapper.keys()}
genres_inv_mapper = {genres_mapper[i]:i for i in genres_mapper.keys()}

user_index = []
genre_index = []
for user in df.index:
    print(user)
    print(df.loc[user,'genres'])
    for genre in df.loc[user,'genres'].split('|'):
        genre_index.append(genres_mapper[genre])
        user_index.append(user_mapper[user])

X = csr_matrix((np.ones(genre_index.__len__()),
                (user_index,genre_index)), shape=(M,N))

看起来像:

print(X)
  (0, 0)    1.0
  (0, 2)    1.0
  (0, 5)    1.0
  (1, 1)    1.0
  (1, 5)    1.0
  (1, 6)    1.0
  (2, 0)    1.0
  (2, 2)    1.0
  (2, 4)    1.0
  (3, 3)    1.0
  (3, 5)    1.0

通过以上您可以看到使用较小数据集的过程。

于 2021-02-14T13:39:49.663 回答