tensorflow - 如何在 Tensorflow 中通过 Softmax 回归读取 csv 文件并训练数据

Question

我刚开始学习Tensorflow，在训练数据时遇到了一个问题。我的问题是读取 csv 文件，然后使用 softmax 分类根据学生的学习时间和上课时间来估计学生（A、B 或 C）的成绩。

我定义，然后将 csv 文件加载为

COLUMNS = ["studytime", "attendance", "A", "B", "C"]
FEATURES = ["studytime", "attendance"]
LABEL = ["A", "B", "C"]
training_set = pd.read_csv("hw1.csv", skipinitialspace=True,
                       skiprows=1, names=COLUMNS)

之后，我为特征和标签定义张量，如下所示

feature_cols = [tf.contrib.layers.real_valued_column(k) for k in FEATURES]
labels = [tf.contrib.layers.real_valued_column(k) for k in LABEL]

然后我按照 MNIST 的 Tensorflow 中使用 MNIST 数据训练 softmax 的方法

但我不知道如何定义batch_xs和batch_ys在这个循环中训练

for _ in range(1000):
batch_xs=????
batch_ys=????
sess.run(train_step, feed_dict={x: batch_xs, y_: batch_ys})

以及如何定义函数来估计三个学生的分数，如果他们的学习和出席时间，例如，[11,7]，[3,4]，[1,0]

你能帮我解决这个问题吗？

提前致谢，

score 0 · Accepted Answer

看起来您正在将 CSV 读入 DataFrame？您当然可以通过这种方式手动实现批处理过程，但是在 TF 中有一种有效的内置方式来构建队列和批处理。这有点令人费解，但它适用于按顺序或随机洗牌提供行，这非常方便。只需确保您的行都等长，这样您就可以轻松指定哪些销售代表 Xes，哪些代表 Ys。

你需要的两个函数是tf.decode_csvand tf.train.shuffle_batch（或者tf.train.batch如果你不需要随机洗牌）。

我们在这篇文章中详细讨论了这个问题，其中包括一个完整的工作代码示例： TF CSV Batching Example

看起来您的数据都是数字的，并且 Y 是 one-hot 格式，因此 MNIST 示例应该有助于实现您的估计功能。

***更新：这大致是操作的顺序： 1. 定义两个函数，如链接示例中所示 - 一个用于逐行读取 CSV 文件，另一个将这些行中的每一行打包成批次N（随机或顺序） 2. 通过while not coord.should_stop():此循环启动读取循环将一直运行，直到它耗尽您提供给队列的所有 CSV 文件的内容 3. 在循环的每次迭代中，执行sess.run这些variables 为您提供了一批 X 和 Y，以及您可能希望从 CSV 文件的每一行中获得的任何额外元类型内容，例如本示例中的日期标签（在您的情况下，它可能是学生的姓名或其他任何内容：

dateLbl_batch, feature_batch, label_batch = sess.run([dateLbl, features, labels])

当 TF 到达你文件的末尾时，它会抛出一个异常，这就是为什么上面所有的代码都在 try/catch 块中——通过捕获那个异常你知道你已经完成了。

上述功能为您提供了对 CSV 文件的逐个单元级别的非常精细的访问，并允许您将它们批处理成 N 个批次、您想要的 epoch 数量等。

***** 更新 2**

这是应该以您拥有的格式分批读取您的 CSV 文件的完整代码。它只是打印每批的内容。从这里，您可以轻松地将此代码与实际执行培训/等的代码连接起来。

import tensorflow as tf

fileName = 'data/study.csv'

try_epochs = 1
batch_size = 3

S = 1 # this is your Student label
F = 2 # this is the list of your features
L = 3 # this is one-hot vector of 3 representing the label

# set defaults to something (TF requires defaults for the number of cells you are going to read)
rDefaults = [['a'] for row in range((S+F+L))]

# function that reads the input file, line-by-line
def read_from_csv(filename_queue):
    reader = tf.TextLineReader(skip_header_lines=True) # skipt the header line
    _, csv_row = reader.read(filename_queue) # read one line
    data = tf.decode_csv(csv_row, record_defaults=rDefaults) # use defaults for this line (in case of missing data)
    studentLbl = tf.slice(data, [0], [S]) # first cell is my 'date-label' for internal pruposes
    features = tf.string_to_number(tf.slice(data, [S], [F]), tf.float32) # cells 2-480 is the list of features
    label = tf.string_to_number(tf.slice(data, [S+F], [L]), tf.float32) # the remainin 3 cells is the list for one-hot label
    return studentLbl, features, label

# function that packs each read line into batches of specified size
def input_pipeline(fName, batch_size, num_epochs=None):
    filename_queue = tf.train.string_input_producer(
        [fName],
        num_epochs=num_epochs,
        shuffle=True)  # this refers to multiple files, not line items within files
    dateLbl, features, label = read_from_csv(filename_queue)
    min_after_dequeue = 10000 # min of where to start loading into memory
    capacity = min_after_dequeue + 3 * batch_size # max of how much to load into memory
    # this packs the above lines into a batch of size you specify:
    dateLbl_batch, feature_batch, label_batch = tf.train.shuffle_batch(
        [dateLbl, features, label],
        batch_size=batch_size,
        capacity=capacity,
        min_after_dequeue=min_after_dequeue)
    return dateLbl_batch, feature_batch, label_batch

# these are the student label, features, and label:
studentLbl, features, labels = input_pipeline(fileName, batch_size, try_epochs)

with tf.Session() as sess:

    gInit = tf.global_variables_initializer().run()
    lInit = tf.local_variables_initializer().run()

    coord = tf.train.Coordinator()
    threads = tf.train.start_queue_runners(coord=coord)

    try:
        while not coord.should_stop():
            # load student-label, features, and label as a batch:
            studentLbl_batch, feature_batch, label_batch = sess.run([studentLbl, features, labels])

            print(studentLbl_batch);
            print(feature_batch);
            print(label_batch);
            print('----------');

    except tf.errors.OutOfRangeError:
        print("Done looping through the file")

    finally:
        coord.request_stop()

    coord.join(threads)

假设您的 CSV 文件如下所示：

name    studytime   attendance  A   B   C
S1  2   1   0   1   0
S2  3   2   1   0   0
S3  4   3   0   0   1
S4  3   5   0   0   1
S5  4   4   0   1   0
S6  2   1   1   0   0

上面的代码应该打印以下输出：

[[b'S5']
 [b'S6']
 [b'S3']]
[[ 4.  4.]
 [ 2.  1.]
 [ 4.  3.]]
[[ 0.  1.  0.]
 [ 1.  0.  0.]
 [ 0.  0.  1.]]
----------
[[b'S2']
 [b'S1']
 [b'S4']]
[[ 3.  2.]
 [ 2.  1.]
 [ 3.  5.]]
[[ 1.  0.  0.]
 [ 0.  1.  0.]
 [ 0.  0.  1.]]
----------
Done looping through the file

因此，不要打印批次的内容，只需将它们用作 X 和 Y 来进行培训feed_dict

score 0 · Accepted Answer

这是我的尝试。但是准确率并没有我想象的那么高。

import tensorflow as tf

fileName = 'hw1.csv'

try_epochs = 1
batch_size = 8

S = 1 # this is your Student label
F = 2 # this is the list of your features
L = 3 # this is one-hot vector of 3 representing the label

# set defaults to something (TF requires defaults for the number of cells you are going to read)
rDefaults = [['a'] for row in range((S+F+L))]

# function that reads the input file, line-by-line
def read_from_csv(filename_queue):
     reader = tf.TextLineReader(skip_header_lines=True) # skipt the header line
     _, csv_row = reader.read(filename_queue) # read one line
     data = tf.decode_csv(csv_row, record_defaults=rDefaults) # use defaults for this line (in case of missing data)
     studentLbl = tf.slice(data, [0], [S]) # first cell is my 'date-label' for internal pruposes
     features = tf.string_to_number(tf.slice(data, [S], [F]), tf.float32) # cells 2-480 is the list of features
     label = tf.string_to_number(tf.slice(data, [S+F], [L]), tf.float32) # the remainin 3 cells is the list for one-hot label
     return studentLbl, features, label

# function that packs each read line into batches of specified size
def input_pipeline(fName, batch_size, num_epochs=None):
    filename_queue = tf.train.string_input_producer(
       [fName],
       num_epochs=num_epochs,
       shuffle=True)  # this refers to multiple files, not line items within files
    dateLbl, features, label = read_from_csv(filename_queue)
    min_after_dequeue = 10000 # min of where to start loading into memory
    capacity = min_after_dequeue + 3 * batch_size # max of how much to load into memory
    # this packs the above lines into a batch of size you specify:
    dateLbl_batch, feature_batch, label_batch = tf.train.shuffle_batch(
       [dateLbl, features, label],
       batch_size=batch_size,
       capacity=capacity,
       min_after_dequeue=min_after_dequeue)
    return dateLbl_batch, feature_batch, label_batch

# these are the student label, features, and label:
studentLbl, features, labels = input_pipeline(fileName, batch_size, 
 try_epochs)

x = tf.placeholder(tf.float32, [None, 2])

W = tf.Variable(tf.zeros([2, 3]))

b = tf.Variable(tf.zeros([3]))

y = tf.nn.softmax(tf.matmul(x, W) + b)

y_ = tf.placeholder(tf.float32, [None, 3])

cross_entropy = tf.reduce_mean(tf.nn.softmax_cross_entropy_with_logits(labels=y_,logits=y))

train_step = tf.train.GradientDescentOptimizer(0.1).minimize(cross_entropy)


sess = tf.InteractiveSession()

tf.global_variables_initializer().run()


with tf.Session() as sess:

   gInit = tf.global_variables_initializer().run()
   lInit = tf.local_variables_initializer().run()

   coord = tf.train.Coordinator()
   threads = tf.train.start_queue_runners(coord=coord)

   try:
      while not coord.should_stop():
        # load student-label, features, and label as a batch:
        studentLbl_batch, feature_batch, label_batch = sess.run([studentLbl, features, labels])

        print(studentLbl_batch);
        print(feature_batch);
        print(label_batch);
        print('----------');
        batch_xs = feature_batch
        batch_ys = label_batch
        sess.run(train_step, feed_dict={x: batch_xs, y_: batch_ys})  # feeding data

  except tf.errors.OutOfRangeError:
     print("Done looping through the file")

  finally:
     coord.request_stop()

  coord.join(threads)


  correct_prediction = tf.equal(tf.argmax(y, 1), tf.argmax(y_, 1))

  accuracy = tf.reduce_mean(tf.cast(correct_prediction, tf.float32))

  print(sess.run(accuracy, feed_dict={x: feature_batch, y_: label_batch}))

  print(sess.run(W))
  print(sess.run(b))

准确度

  0.375

W,b

    [[ 0.00555556  0.00972222 -0.01527778] [ 0.00555556  0.01388889 -0.01944444]]
    [-0.00277778  0.00138889  0.00138889]

tensorflow - 如何在 Tensorflow 中通过 Softmax 回归读取 csv 文件并训练数据

2 回答 2

Related

Reference