What should you do?

You are developing an ML model using a dataset with categorical input variables. You have randomly split half of the data into training and test sets. After applying one-hot encoding on the categorical variables in the training set, you discover that one categorical variable is missing from the test set.

What should you do?
A . Randomly redistribute the data, with 70% for the training set and 30% for the test set
B. Use sparse representation in the test set
C. Apply one-hot encoding on the categorical variables in the test data.
D. Collect more data representing all categories

Answer: C

Explanation:

This approach ensures that the model is able to accurately interpret the categorical data in the test set. As the training set already contains one-hot encoded data, it is important to apply the same encoding to the test set so the model can interpret the data accurately.

References: https://machinelearningmastery.com/how-to-one-hot-encode-sequence-data-in-python/https://machinelearningmastery.com/how-to-use-one-hot-encoding-for-categorical-data/.

When working with categorical input variables, it’s important to ensure that the same preprocessing steps are applied to both the training and test sets. One-hot encoding is a common method used to convert categorical variables into numerical values, which can then be used as inputs to machine learning models. By applying one-hot encoding to the test set, you will ensure that the test data has the same format as the training data and that the model can make accurate predictions.

Subscribe
Notify of
guest
0 Comments
Inline Feedbacks
View all comments