I have a datafile that I read with:
dataset = tf.data.TextLineDataset('BigDatabase/dihedraldescriptors_large.txt').map(parse_func)
I would like to split dataset into test train validation.
The data file contains feature vectors and corresponding vectors at each line. I process the text with parse_func:
def parse_func(x):
vals = tf.strings.split([x],':').values
x = tf.strings.split([vals[0]]).values
y = tf.strings.split([vals[1]]).values[:1]
x = tf.strings.to_number(x)
y = tf.strings.to_number(y)
z = tf.map_fn(fn=maplabels,elems=y,fn_output_signature=tf.int64)[0]
return x,z
Finally I have another function that I call to map labels (they are float values) to integers.
def maplabels(x):
keys = dp1_unique_values
return tf.where(tf.equal(x,keys))[[0]]
I am trying to grab the concept so far. "dataset" is an iterator object. My map function returns a tupple : x is the feature vector and z is the label. I know exactly how many feature vector in the data file ~ 900 000 lines. I can read all data into train test and validation sets. Maybe 600 000 , 200 000 and 100 000 for each. However this might effect the performance. And also in the feature I am planning to add more data to the set. What would be the correct strategy to build and feed the model here? How to split data while streaming to the model?