In this class activity, you will learn to tune a model. You may work alone or with your group.
In the machine-learning/data/sms folder in the class Dropbox, you will find six files; three are scaled (meaning the features values have been adjusted to mostly fall between 0 and 1) and the other three are the original values. All of these list features extracted from SMS text messages, some of which have been labeled as spam, and the others ham.
Original, unscaled features:
As I outlined in one of my earlier videos, the development set (dev.csv and dev_scaled.csv) are used in order to tune a model. Every model has parameters that can be adjusted to increase it's predictive power (some of these are feature based and therefore separate from the machine learning model, e.g., scaling features). Here are the ones I want you to consider:
- feature type (scaled, unscaled)
- distance metric (euclidean, manhattan)
Using your kNN implementation, or mine (see the gi14-solutions folder in the class Dropbox), use train.csv (or train_scaled.csv) as the "training set" and dev.csv (or dev_scaled.csv) as the "testing set". Try using the original features as well as the unscaled; try different values of k on each, and try both distance measures. If you choose 5 different values of k to try, there are 5x2x2 = 20 different combinations of parameters if you want to be exhaustive, so either choose fewer values of k or don't be exhaustive in your search for the best model. I suggest redirecting the output of your kNN program to files with descriptive names, e.g., "dev_scaled_knn_k4_euclidean.csv" used scaled features, set k=4, and used Euclidean distance. Try at least 10 different parameter combinations.
Keep in mind that if you use the scaled version of the training data, you have to use the scaled version for development.
Finally, once you arrive at a combination of parameters that out performs your other attempts, you're ready to use those settings on test.csv (or test_scaled.csv). Run your kNN program using train.csv (or train_scaled.csv) as the "training set" and test.csv (or test_scaled.csv) as the "testing set". Then evaluate the results.
Upload a document (I don't care the format, as long as it's easy to read) that includes the following information:
- list the parameter combinations you tried and evaluated before arriving at your final model
- for your final model, report the evaluation statistics for both spam and ham on:
- the development set
- the testing set