tensorflow - How to split dataset into test train validation created with textlinedataset?

翻译自：https://stackoverflow.com/questions/68514968 2021-07-25T01:45:04.053

43 次

I have a datafile that I read with:

dataset = tf.data.TextLineDataset('BigDatabase/dihedraldescriptors_large.txt').map(parse_func)

I would like to split dataset into test train validation.

The data file contains feature vectors and corresponding vectors at each line. I process the text with parse_func:

def parse_func(x):
    vals = tf.strings.split([x],':').values
    x = tf.strings.split([vals[0]]).values
    y = tf.strings.split([vals[1]]).values[:1]
    x = tf.strings.to_number(x)
    y = tf.strings.to_number(y)
    z = tf.map_fn(fn=maplabels,elems=y,fn_output_signature=tf.int64)[0]
    return x,z

Finally I have another function that I call to map labels (they are float values) to integers.

def maplabels(x):
   keys = dp1_unique_values 
   return tf.where(tf.equal(x,keys))[[0]]

I am trying to grab the concept so far. "dataset" is an iterator object. My map function returns a tupple : x is the feature vector and z is the label. I know exactly how many feature vector in the data file ~ 900 000 lines. I can read all data into train test and validation sets. Maybe 600 000 , 200 000 and 100 000 for each. However this might effect the performance. And also in the feature I am planning to add more data to the set. What would be the correct strategy to build and feed the model here? How to split data while streaming to the model?

tensorflow - How to split dataset into test train validation created with textlinedataset?

0 回答 0

Related

Reference