Class Activity 15: Model tuning

Due Apr 14, 2020 by 2pm
Points 1
Submitting a file upload
Available after Apr 13, 2020 at 2pm

In this class activity, you will learn to tune a model. You may work alone or with your group.

In the machine-learning/data/sms folder in the class Dropbox, you will find six files; three are scaled (meaning the features values have been adjusted to mostly fall between 0 and 1) and the other three are the original values. All of these list features extracted from SMS text messages, some of which have been labeled as spam, and the others ham.

Original, unscaled features:

train.csv
dev.csv
test.csv

Scaled features:

train_scaled.csv
dev_scaled.csv
test_scaled.csv

As I outlined in one of my earlier videos, the development set (dev.csv and dev_scaled.csv) are used in order to tune a model. Every model has parameters that can be adjusted to increase it's predictive power (some of these are feature based and therefore separate from the machine learning model, e.g., scaling features). Here are the ones I want you to consider:

feature type (scaled, unscaled)
k
distance metric (euclidean, manhattan)

Using your kNN implementation, or mine (see the gi14-solutions folder in the class Dropbox), use train.csv (or train_scaled.csv) as the "training set" and dev.csv (or dev_scaled.csv) as the "testing set". Try using the original features as well as the unscaled; try different values of k on each, and try both distance measures. If you choose 5 different values of k to try, there are 5x2x2 = 20 different combinations of parameters if you want to be exhaustive, so either choose fewer values of k or don't be exhaustive in your search for the best model. I suggest redirecting the output of your kNN program to files with descriptive names, e.g., "dev_scaled_knn_k4_euclidean.csv" used scaled features, set k=4, and used Euclidean distance. Try at least 10 different parameter combinations.

Keep in mind that if you use the scaled version of the training data, you have to use the scaled version for development.

For evaluating your models, you may use your solutions to Guided Inquriy 15: Classification evaluation algorithms, or use my solutions (see the gi15-solutions folder in the class Dropbox).

Finally, once you arrive at a combination of parameters that out performs your other attempts, you're ready to use those settings on test.csv (or test_scaled.csv). Run your kNN program using train.csv (or train_scaled.csv) as the "training set" and test.csv (or test_scaled.csv) as the "testing set". Then evaluate the results.

Upload a document (I don't care the format, as long as it's easy to read) that includes the following information:

list the parameter combinations you tried and evaluated before arriving at your final model
for your final model, report the evaluation statistics for both spam and ham on:
- the development set
- the testing set

Rubric

Title:

Class Activity 15

Criteria

Ratings

Edit criterion description

Your submitted document lists the parameter combinations you tried.

Edit rating Delete rating

Pass

Edit rating Delete rating

Fail

Edit criterion description

Your submitted document includes the evaluation statistics of your final model run over the development set.

Edit rating Delete rating

Pass

Edit rating Delete rating

Fail

Edit criterion description

Your submitted document includes the evaluation statistics of your final model run over the development set.

Edit rating Delete rating

Pass

Edit rating Delete rating

Fail

Rubric

Find a Rubric

Title:

Title

Criteria

Ratings

Pts

Edit criterion description Delete criterion row

Description of criterion

Range

threshold: 5 pts

Edit rating Delete rating

5 pts

Full Marks

Edit rating Delete rating

0 pts

No Marks

pts

5 pts

Total Points: 5 out of 5