CS 579 Spring 2005: Assignment 3

CS579 Machine Learning: Assignment #3.

Due: at the beginning of the lecture on Tuesday, March 8.

Problem 1 In this problem, you will use the WEKA system to analyze two data sets. You will apply three learning algorithms to each data set and compare their performance.

Learning Algorithms. We will compare LMS, Naive Bayes Simple, and 1R classifier.
Data Sets. We will apply these three algorithms to the data sets
1. contact-lenses.arff (from data directory in weka), and
2. http://www.cs.nmsu.edu/~ipivkina/cs579/Homework/hepatitis.arff (taken from UCI Repository).
Instructions on using Weka.
These instructions will describe how to apply the learning algorithms to a data set. Please refer to WEKA Explorer user guide for more information.
When you start up Weka, click on the Explorer button in the WEKA GUI Chooser. This opens a large panel with several tabs, and the Preprocess tab will already be selected.
Click on "Open file...", and then select the data file you want to use (an .arff file). The Explorer window should now show number of instances, number of attributes, a list of attributes. Click on different attributes from the list on the left and see the corresponding table and the bar plot on the right-hand side of the window. You need to understand what they mean.
Now click on the "Classify" tab of the Explorer window and examine the "Test options" panel. For this assignment we will use two test options: use training set and percentage split.
Now we need to select the learning algorithm to apply. Go to the "Classifier" panel (near the top) which initially shows two buttons: "Choose" and "ZeroR". (ZeroR is a very simple rule-learning algorithm.) The general idea of this user interface is that if you click on "Choose" you can choose a different algorithm. If you click on "ZeroR" (or whatever algorithm name is displayed there), you can set the parameters for the algorithm.
Click on "Choose", and you will see a hierarchical display whose top level is "weka", whose second level is "classifiers", and whose third level contains seven general kinds of classifiers: "bayes", "functions", "lazy", "meta", "misc", "trees", "rules". To choose NaiveBayesSimple, click on the "bayes" indicator and then select "NaiveBayesSimple". To select 1R, choose "rules" and then "OneR". The process for selecting LMS is a bit more complex. First, choose "meta" and then "ClassificationViaRegression". This is a "meta-level" classifier that converts a classification problem into a regression problem. Now we must tell it which regression algorithm to employ. Click on "ClassificationViaRegression" and you will see two entries "classifier" and "debug". The "classifier" entry has two buttons: "Choose" and "M5P -M 4.0". This is where the regression algorithm is chosen. Select "Choose", then "functions", then "LinearRegression". Back at the "classifier" entry, now click on "LinearRegression", and you will see a GenericObjectEditor with four options. For "attributeSelectionMethod" select "No attribute selection". For "eliminateColinearAttributes" select "False". Now click "OK" twice, and we are ready to run the classifier.
Now we are ready to run the algorithm. Click on the "Start" button, and the Classifier Output window will show the output from the classifier. For Naive Bayes, this output consists of several sections:
- Run Information: Details of the data set
- Classifier model: The learned model. For naive bayes, each Num attribute is modeled by its own gaussian distribution. The output shows the mean and standard deviation of that gaussian along with the probabilities of the two classes.
- Evaluation on test set: This gives various statistics. The key item is the second one: Incorrectly Classified Instances will be expressed as a count and a percentage. You should report the percentages in your answer. One other item of interest comes at the very end: The Confusion Matrix. This shows how many false positive and false negative errors were made.
Results.
- For contact-lenses.arff data set for each of the three algorithms submit classifier models which were learned with (full) training set as a test option. Make sure that you know how they were obtained and what they mean.
- For both data sets, submit error rates of each of the three algorithms obtained with "training set" and "percentage split" as test options. (6 numbers for each data set).

Problem 2 Exercise 4 from section 9.8 of the book (p.195)

Problem 3 To reduce the cost of nearest neighbor algorithm a data structure called kd-tree may be used. A kd-tree is similar to a decision tree, except that we split using the median value along the dimension having the highest variance, and points are stored in every internal node (the leaves are empty). The following figure shows an example of kd-tree and how it splits the x,y plane.

Propose an algorithm for finding the nearest neighbor of a point given a kd-tree.