0

I got information where and when a cab customer entered his vehicle. Now I want to predict in which street he wants to drive. My dataset is looking like this:

Example

Day, Hour, Minute, Entrance, Destination (Label)

Monday, 10, 45, ExampleStreet, StackOverflowCorner (Not PreProcessed)

0, 10, 45, 0, 1 (PreProcessed)

Converted like this:

Now I PreProcessed my Dataset like this:

Day -> Number from 0-6 (0 Monday, 1 Tuesday ...)

Hour -> European format from 0-24

Minute -> No preprocess

Entrance -> I used LabelEncoder (0 ExampleStreet, 1 ExampleCorner ...)

Destination -> Same like Entrance with Label Encoder

I got 98 possible destinations and the same amount of entrances and around 700 samples. I already used Tensorflow but only get a validation accuracy near 0.

model = keras.Sequential([     

tf.keras.layers.Dense(100, activation='relu'),
keras.layers.BatchNormalization(),
tf.keras.layers.Dropout(0.4),

tf.keras.layers.Dense(100, activation='relu'),

tf.keras.layers.Dropout(0.3),

tf.keras.layers.Dense(98,activation="softmax")
]) 
optimizer=keras.optimizers.RMSprop()
model.compile(optimizer=optimizer, loss=tf.keras.losses.sparse_categorical_crossentropy,     metrics=['accuracy'])

Questions

Did I PreProcess my data rightly? Do I need hot-encoding or gather more samples? Is another algorithm mabye more effective (Tree?)?

Thanks in advance...

4

2 回答 2

1

You need one hot encoding of Entrance and day. And potentially - hour.

You need more samples (number of samples should be close to the order of a number of variables for your model). But try with one-hot and see

于 2021-01-15T09:31:28.747 回答
1

As a minimum you should one-hot encode entrance and destination, Using the label encoding assigns integers to these features which will be interpreted by the model as being ordinal values that have a numerical relationship. Clearly there is no "ordering" of the entrance and destination. I would leave day encoded as you have done because there is clearly a sequential ordering of days in a week, the same for hour and minute. I doubt minute has much use as a feature so you may want to not include it. With 98 classes and only 700 samples it is doubtful your model will result in very high accuracy.

于 2021-01-15T14:50:27.583 回答