menu COMPSCI 2120/9642/DIGIHUM 2220 1.0 documentation

Assignment 4: A Taste of Machine Learning

_images/ml.jpg

The idea

Machine learning is a big topic, but in general it can refer to the idea of “teaching something” to a machine and “testing” whether the machine has “learned” that thing.

A simple example: teaching a machine to classify something into 2 groups…

_images/binary_classifier.png

A more complex example: teaching a self-driving car to “self-drive”…

Another complex example: teaching an MRI machine to model blood flow in the heart (and identify anomalies)…

Another complex example: teaching a computer to beat a human in Go…

For this assignment I am hoping to give you a taste of machine learning in Python. By taste I mean that it will focus on implementation rather than deep understanding. If we had more time, I’d love to do a deep dive into some of this stuff, but we don’t, and as an intro course we can only skim these topics. As with assignment 3, you will have the freedom in this assignment to pick a dataset which you find interesting and to pick your machine learning approaches. You will not need to use any of my code.

Anyways… the idea — you will need to complete the following:

  1. Pick a suitable dataset from a repository (UCI, Princeton, or another which you’ll need to cite [maybe from your own research if you have the necessary permission])

Suitable??

What I mean is, pick a dataset which has enough samples to train a classifier. What I also mean is, pick a dataset which can be used for classification (see below).

  1. Pick two machine learning approaches in order to classify the data (i.e. supervised approaches). Some examples include: Support Vector Machines, Neural Networks, k-Nearest Neighbours, Decision Trees, and Logistic Regression (among others).

Classify???

Classification, meaning to categorize inputs into classes. Say you have a set of 6 inputs in a csv, like this:

_images/inputs_csv.png

Note that that’s a tiny number of samples, and when training your machine you’ll probably need (at least) hundreds of samples. Anyways, your machine, which you want to train, needs to classify or predict the class of these samples, based on their attributes (number_of_legs, run_speed, sound_it_makes).

Try to guess the output classes (answer right below):

_images/input_output.png
  1. Use pandas (or whatever else you’d like) to import your data and prepare it for your classifier. You should have a function which loads and prepares your data for your classifier — call it prepData(…), where ... refers to which ever arguments you use.

  2. Apply each of your classifiers to your prepared data and output your results with some sort of visualization (using however many visualizations as you need for your discussion [see below]). One example of a visualization you could use is a confusion matrix.

  3. Discuss your work. Describe the dataset (make sure to include a reference to it or share it if it’s one of your own), the machine learning approaches you considered, and your results. Make sure to compare and contrast the results of both approaches, talk about which one you think is “better” and why (i.e. does one have higher accuracy for a smaller sample size but lower for another?). Note that this discussion will be unique to the specific machine learning approaches you considered.

If you want more freedom

Note that if you are interested in machine learning but not the restrictions imposed by this assignment, you can send me an email (jmorra6@uwo.ca) or request a Zoom with your proposed idea. Note that your idea should be at least as difficult as the assignment and should be under the umbrella of “machine learning”.

Maybe you want to create a medical image segmenter with U-Net, or code a Super-Mario-playing AI, or … Again, as long as it’s at least as difficult as the proposed assignment you can feel free to reach out to me with your idea and you can (potentially) do it in lieu of this assignment.

How you’ll be marked

a4_marking_scheme

What you’ll submit (in a zipped folder)

  • a4.py, containing all of your code

  • a4.pdf, containing parts 4 and 5 above (i.e. your visualizations and discussion).