0

我有一个df用于多类分类问题的数据集。我有一个巨大的班级不平衡。即,grade_Fgrade_G

>>> percentage = 1. / df['grade'].value_counts(normalize=True)
>>> print(percentage )

B    0.295436
C    0.295362
A    0.204064
D    0.136386
E    0.048788
F    0.014684
G    0.005279

同时,我对代表性较少的类有非常不准确的预测,正如可以在这里看到的那样。

我有一个输出维度为 7 的神经网络。我的意思是我要预测的数组是:

>>> print(y_train.head())
        grade_A  grade_B  grade_C  grade_D  grade_E  grade_F  grade_G
689526        0        1        0        0        0        0        0
523913        1        0        0        0        0        0        0
266122        0        0        1        0        0        0        0
362552        0        0        0        1        0        0        0
484987        1        0        0        0        0        0        0
...

所以我尝试了以下神经网络:

from keras.models import Sequential
from keras.layers import Dense, Dropout
from keras.constraints import maxnorm

def create_model(input_dim, output_dim):
    print(output_dim)
    # create model
    model = Sequential()
    # input layer
    model.add(Dense(100, input_dim=input_dim, activation='relu', kernel_constraint=maxnorm(3)))
    model.add(Dropout(0.2))

    # hidden layer
    model.add(Dense(60, activation='relu', kernel_constraint=maxnorm(3)))
    model.add(Dropout(0.2))

    # output layer
    model.add(Dense(output_dim, activation='softmax'))

    # Compile model
    model.compile(loss='categorical_crossentropy', loss_weights=lossWeights, optimizer='adam', metrics=['accuracy'])
    return model

from keras.callbacks import ModelCheckpoint
from keras.models import load_model

model = create_model(x_train.shape[1], y_train.shape[1])

epochs =  35
batch_sz = 64

print("Beginning model training with batch size {} and {} epochs".format(batch_sz, epochs))

checkpoint = ModelCheckpoint("lc_model.h5", monitor='val_acc', verbose=0, save_best_only=True, mode='auto', period=1)
# train the model
history = model.fit(x_train.as_matrix(),
                y_train.as_matrix(),
                validation_split=0.2,
                epochs=epochs,  
                batch_size=batch_sz, # Can I tweak the batch here to get evenly distributed data ?
                verbose=2,
                callbacks=[checkpoint])

# revert to the best model encountered during training
model = load_model("lc_model.h5")

所以我输入了一个与类频率成反比的权重向量:

lossWeights = df['grade'].value_counts(normalize=True)
lossWeights = lossWeights.sort_index().tolist()

但是它告诉我输出的大小为 1 :

---------------------------------------------------------------------------
ValueError                                Traceback (most recent call last)
<ipython-input-66-bf262c22c9dc> in <module>
      2 from keras.models import load_model
      3 
----> 4 model = create_model(x_train.shape[1], y_train.shape[1])
      5 
      6 epochs =  35

<ipython-input-65-9290b177bace> in create_model(input_dim, output_dim)
     19 
     20     # Compile model
---> 21     model.compile(loss='categorical_crossentropy', loss_weights=lossWeights, optimizer='adam', metrics=['accuracy'])
     22     return model

C:\ProgramData\Anaconda3\lib\site-packages\keras\engine\training.py in compile(self, optimizer, loss, metrics, loss_weights, sample_weight_mode, weighted_metrics, target_tensors, **kwargs)
    178                                  'The model has ' + str(len(self.outputs)) +
    179                                  ' outputs, but you passed loss_weights=' +
--> 180                                  str(loss_weights))
    181             loss_weights_list = loss_weights
    182         else:

ValueError: When passing a list as loss_weights, it should have one entry per model output. The model has 1 outputs, but you passed loss_weights=[4.9004224502112255, 3.3848266392035704, 3.385677583130476, 7.33212052000478, 20.49667767920116, 68.10064134188455, 189.42024013722127]
4

2 回答 2

1

您正在寻找的是class_weight功能fit

weights = {0: 1 / 0.204064,
           1: 1 / 0.295436, 
           2: 1 / 0.295362,
           3: 1 / 0.136386, 
           4: 1 / 0.048788,
           5: 1 / 0.014684,
           6: 1 / 0.005279}

您可能想要减小它们的大小,因为那里的权重大约为 3 到 200,但这种关系更为重要。

然后:

model.fit(....
          class_weight = weights, 
         )
于 2019-09-17T18:11:48.230 回答
0

loss_weights不加权不同的类别,它加权不同的输出。您的模型只有一个输出。是的,该输出是一个列表,但它仍然被 keras 视为单个实体。

使用功能 API 制作的模型可以有多个输出,每个输出都有自己的损失函数。在训练模型时,损失被定义为应用于各自输出的所有损失函数的总和。在这种情况下,loss_weights可用于对不同的输出进行加权。

但是,我认为它对想做的事情没有用。

于 2019-09-17T17:51:01.530 回答