CS 2120: Topic 11 ================= .. image:: ../img/lim_robot.jpg :width: 850px Videos for this week: ^^^^^^^^^^^^^^^^^^^^ .. raw:: html .. raw:: html .. raw:: html Getting started with Machine Learning (ML) ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ Learning to play Super Mario Bros: .. raw:: html Learning to scan for tumours: .. raw:: html Learning to understand speech: .. raw:: html .. admonition:: What is ML? :class: Note The field of study pertaining to **algorithms** which *learn* and *improve* through "experience" * Today we'll be using a *toy dataset* (i.e. a small dataset for showing an example) in order to go over (at a surface level) **two types** of ML: 1. ``supervised learning`` 2. ``unsupervised learning``. * We'll consider one example of each (i.e. Support Vector Machines, k-Means clustering). * We'll use ``scikit-learn`` to implement these examples. "scikit-learn" ^^^^^^^^^^^^^^ * Has many, many ML libraries: `click here if you want to take a look `_ Supervised Learning ^^^^^^^^^^^^^^^^^^^ * I show you photos of ducks or cats, and tell you that they're either ducks or cats. Then, I show you new photos (of different ducks or cats) and ask you to tell me whether they're ducks or cats. **Here's a "duck"**: .. image:: ../img/duck1.jpg :width: 300px **Here's a "cat"**: .. image:: ../img/cat1.jpg :width: 450 px **What's this?** .. image:: ../img/duck2.jpg :width: 300px * One of the main methods for supervised learning is "**classification**": putting data into various ``classes`` based on their ``attributes``. This is what we'll focus on here. * For classification, when you ``train`` your system, you provide both the ``input data`` and ``output data`` ("class labels"); when you ``test`` your system, you provide **only the input data** and test whether your system can correctly predict the outputs. * the goal: you want your system to be able to ``generalize`` the training data so that it can ``classify`` new (i.e. test) inputs .. admonition:: Let's review some terminology :class: Note Consider this sample data (from our toy dataset). .. image:: ../img/sample_data.png This data has ``3 Attributes`` (Attr1, Attr2, Attr3) and ``2 Class Labels`` (Class 1 or Class 2). One ``sample`` is essentially one **set of attributes**. If it is labelled, then it includes the class, if it is unlabelled, then it does not include the class. Here is one (labelled) sample: .. image:: ../img/sample.png If we trained our machine in a ``supervised`` way, we would provide it with each of these ``labelled samples``. Note that in general we use **most** of our dataset for training. To feed in our inputs and labels, we could choose to have two arrays (or dataframes, or ...) --- call them ``input`` and ``output``. Here is what ``input`` could look like: .. image:: ../img/input.png Here is what ``output`` could look like: .. image:: ../img/output.png If we then ``tested`` our machine, we could provide it with new inputs (unlabelled) and test whether or not the machine **correctly predicts** the labels. .. image:: ../img/test_data.png Since we know the corresponding labels (but our machine doesn't), we can then check how many correct predictions there were for each class. We can then represent this using a confusion matrix. Below is an example of a ``confusion matrix``: .. image:: ../img/conf.png Note that in the confusion matrix above there are 5 classes. The entry at Row 3, Col 1 (i.e. '(2)'') --- for example --- indicates that the machine ``predicted class 1`` when **the correct answer was class 3** a total of 2 times. Supervised learning: Support Vector Machines (SVM) ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ The "basic" idea (in 2D, 2 classes): * draw a *hyperplane* (in 2D, this is a line) which ``best separates the data from each class``. * the best *hyperplane* will maximize the distance between the hyperplane and the "support vectors" (see below) Graphically (in 2D, 2 classes): .. image:: ../img/svm.png .. admonition:: Looking at this graph... :class: Note * We see 2 classes of data (grey dots, black x's) * The **support vectors** are the 3 red arrows (**v1, v2, v3**) --- these correspond to data points which are at the boundary of the **margin**; for SVMs we want to maximize this margin. * Side note: a *soft margin* allows for some mis-classifications in order to improve classification accuracy Note that we want to be able to classify data with more than 2 attributes, so our vectors will be higher-dimensional * This is a lot harder to visualize, but it's the same idea: separate the classes via 'hyperplanes'. The idea is exactly the same: *partition* the space. This idea leads to the *Linear Support Vector Machine*. Do we have to code this from scratch? * No, with sklearn we can create and use one in 3 lines of code: >>> from sklearn import svm >>> svc = svm.SVC(kernel='linear') >>> svc.fit(input,output) Unsupervised learning ^^^^^^^^^^^^^^^^^^^^^ * i.e. the "duck-cat" example again, but ``without provided labels``. * One of the main methods for unsupervised learning is "clustering". This is what we'll focus on. * For clustering, given a bunch of *unlabelled* data (or you can ignore the labels), you want to see if any of ``these items`` look like any of ``those other items``. You want a program that will divide your dataset into *clusters* where all of the data items in the same cluster are similar to each other in some way. Unsupervised learning: k-means clustering ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ .. image:: ../img/kmeans.png :width: 700 px * The idea: * Plot all of our datapoints. * Guess the number of clusters we're looking for, i.e 3 (using the image above). * Randomly place 3 "means" on the plane. * Repeat the following until convergence: * Associate each data point to the nearest "mean". * Compute the centroid of all of the points attached to each "mean". * Move the position of the "mean" to this centroid. * Let's try it for our dataset (note we ignore ``labels``!): >>> from sklearn import cluster >>> k_means = cluster.KMeans(2) >>> k_means.fit(data) For next class ^^^^^^^^^^^^^^ * If you want to get deeper into ML, Andrew Ng offers a `ML course on Coursera `_. * Work on Assignment 4