0

I have a datafile that I read with:

dataset = tf.data.TextLineDataset('BigDatabase/dihedraldescriptors_large.txt').map(parse_func)

I would like to split dataset into test train validation.

The data file contains feature vectors and corresponding vectors at each line. I process the text with parse_func:

def parse_func(x):
    vals = tf.strings.split([x],':').values
    x = tf.strings.split([vals[0]]).values
    y = tf.strings.split([vals[1]]).values[:1]
    x = tf.strings.to_number(x)
    y = tf.strings.to_number(y)
    z = tf.map_fn(fn=maplabels,elems=y,fn_output_signature=tf.int64)[0]
    return x,z

Finally I have another function that I call to map labels (they are float values) to integers.

def maplabels(x):
   keys = dp1_unique_values 
   return tf.where(tf.equal(x,keys))[[0]]

I am trying to grab the concept so far. "dataset" is an iterator object. My map function returns a tupple : x is the feature vector and z is the label. I know exactly how many feature vector in the data file ~ 900 000 lines. I can read all data into train test and validation sets. Maybe 600 000 , 200 000 and 100 000 for each. However this might effect the performance. And also in the feature I am planning to add more data to the set. What would be the correct strategy to build and feed the model here? How to split data while streaming to the model?

4

0 回答 0