python - TensorFlow 多类 ML 模型问题

Question

我一直在尝试让张量流解决多类 kaggle 问题。基本上，数据包含 6 个特征，我已将其转换为所有数值观察。目标是使用这 6 个特征来预测行程类型，其中有 38 种不同的行程类型。我一直在尝试使用 tensorflow 来预测这些行程类型类。以下代码是我迄今为止所拥有的，包括我用来格式化 csv 文件的代码。代码将运行，但运行 1 的输出开始正常，然后在其余运行中输出相同的输出非常差。以下是运行时的输出示例：

Run 0,0.268728911877
Run 1,0.0108088823035
Run 2,0.0108088823035
Run 3,0.0108088823035
Run 4,0.0108088823035
Run 5,0.0108088823035
Run 6,0.0108088823035
Run 7,0.0108088823035
Run 8,0.0108088823035
Run 9,0.0108088823035
Run 10,0.0108088823035
Run 11,0.0108088823035
Run 12,0.0108088823035
Run 13,0.0108088823035
Run 14,0.0108088823035

和代码：

import tensorflow as tf
import numpy as np
from numpy import genfromtxt
import sklearn
import pandas as pd
from sklearn.cross_validation import train_test_split
import sklearn
# function buildWalMartData takes in a csv file, converts to numpy         array, splits into training 
# and testing, then saves the file to specified target directory 
def buildWalmartData():
    df =    pd.read_csv('/Users/analyticsmachine/Desktop/Kaggle/WallMart_Kaggle/Data/full_train_complete.csv')
    df = df.drop('Unnamed: 0', 1) # 1 specifies axis to remove
    df_data = np.array(df.drop('TripType', 1).values) # convert to numpy array
    df_label = np.array(df['TripType'].values) # convert to numpy array
    X_train, X_test, y_train, y_test = train_test_split(df_data, df_label, test_size=0.25, random_state=50)
    f = open('/Users/analyticsmachine/Desktop/Kaggle/WallMart_Kaggle/Data/wm-training.csv', 'w')
    for i,j in enumerate(X_train):
        k = np.append(np.array(y_train[i]), j)
        f.write(','.join([str(s) for s in k]) + '\n')
    f.close()
    f = open('/Users/analyticsmachine/Desktop/Kaggle/WallMart_Kaggle/Data/wm-testing.csv', 'w')
    for i,j in enumerate(X_test):
        k=np.append(np.array(y_test[i]), j)
        f.write(','.join([str(s) for s in k]) + '\n')
    f.close() 
buildWalmartData()
# function convertOnehot takes in data and converts to tensorflow oneHot
# The corresponding labels in Wallmat TripType are numbers between 1 and 38, describing
# which trip is taken. We have already converted the labels to a one-hot vector, which is a 
# vector that is 0 in most dimensions, and 1 in a single dimension. In this case, the nth triptype
# will be represented as a vector which is 1 in the nth dimensions. 
def convertOneHot(data):
    y = np.array([int(i[0]) for i in data])
    y_onehot = [0]*len(y)
    for i,j in enumerate(y):
        y_onehot[i]=[0]*(y.max()+1)
        y_onehot[i][j] = 1
    return (y, y_onehot)

# import training data
data = genfromtxt('/Users/analyticsmachine/Desktop/Kaggle/WallMart_Kaggle/Data/wm-training.csv', delimiter=',') 

# import testing data
test_data = genfromtxt('/Users/analyticsmachine/Desktop/Kaggle/WallMart_Kaggle/Data/wm-testing.csv', delimiter=',')

x_train = np.array([i[1::] for i in data])

# example output for x_train:
#array([[  7.06940000e+04,   5.00000000e+00,   7.91005185e+09,
#          1.00000000e+00,   8.00000000e+00,   2.15000000e+02],
#       [  1.54653000e+05,   4.00000000e+00,   5.20001225e+09,
#          1.00000000e+00,   5.00000000e+00,   4.60700000e+03],
#       [  1.86178000e+05,   3.00000000e+00,   4.32136106e+09,
#         -1.00000000e+00,   5.00000000e+01,   1.90000000e+03],

y_train, y_train_onehot = convertOneHot(data)

x_test = np.array([ i[1::] for i in test_data])
y_test, y_test_onehot = convertOneHot(test_data)
# exmaple y_test output
#array([ 5, 32, 24, ..., 31, 28,  5])

# and example y_test_onehot:
#[0,...
# 0,
# 0,
# 0,
# 0,
# 0,
# 0,
# 1,
# 0,
# 0,
# 0,
# 0,
# 0]


# A is the number of features, 6 in the wallmart data
# B=38, which is the number of trip types 
A = data.shape[1]-1
B = len(y_train_onehot[0])
tf_in = tf.placeholder('float', [None, A]) # features
tf_weight = tf.Variable(tf.zeros([A,B]))
tf_bias = tf.Variable(tf.zeros([B]))
tf_softmax = tf.nn.softmax(tf.matmul(tf_in, tf_weight) + tf_bias)

# training via backpropogation
tf_softmax_correct = tf.placeholder('float', [None, B])
tf_cross_entropy = - tf.reduce_sum(tf_softmax_correct*tf.log(tf_softmax))

# training using tf.train.GradientDescentOptimizer
tf_train_step =   tf.train.GradientDescentOptimizer(0.01).minimize(tf_cross_entropy)

# add accuracy nodes
tf_correct_prediction = tf.equal(tf.argmax(tf_softmax,1),     tf.argmax(tf_softmax_correct, 1))
tf_accuracy = tf.reduce_mean(tf.cast(tf_correct_prediction, 'float'))


# initialize and run
init = tf.initialize_all_variables()
sess = tf.Session()
sess.run(init)


# running the training
for i in range(20):
    sess.run(tf_train_step, feed_dict={tf_in: x_train,   tf_softmax_correct: y_train_onehot})
    # print accuracy
    result = sess.run(tf_accuracy, feed_dict={tf_in: x_test,  tf_softmax_correct: y_test_onehot})
    print "run {},{}".format(i,result)

关于为什么运行会像这样退化的任何想法，将不胜感激。谢谢！

score 1 · Accepted Answer

如果你只是想在 Kaggle 比赛中快速启动并运行，我建议你先尝试TFLearn中的示例。有用于 one-hot 的 embedding_ops、用于提前停止、自定义衰减的示例，更重要的是，您遇到的多类分类/回归。一旦您对 TensorFlow 更加熟悉，您就可以很容易地插入 TensorFlow 代码来构建您想要的自定义模型（也有这方面的示例）。

python - TensorFlow 多类 ML 模型问题

1 回答 1

Related

Reference