CS579 Machine Learning: Assignment #1.

Due: at the beginning of the lecture on Thursday, January 27.

Assignment:Follow the instructions below.
Submit: your answers to Exercises 1, 3, 4, 5 for the weather dataset, Exercises 4, 5 for the census data, and Exercises 4, 5 for the Market-basket data.

Learning Association Rules

For this assignment you will need to use Weka - Data Mining Software in Java. You may download and install your own version of Weka (for Linux, Windows or Mac OS X) from this site: http://www.cs.waikato.ac.nz/ml/weka/ .
You may also use Weka software (for Linux) which I installed in my directory at /home/faculty5/ipivkina/weka-3-4/
The site http://www.cs.waikato.ac.nz/ml/weka/ provides a lot of information and documentation on Weka. Please use it. In order to run, Weka needs Java to be installed. I installed a more recent version of Java at /home/faculty5/ipivkina/jdk1.5.0_01/bin/java
Feel free to use it.

To run Weka you may type

java -jar weka.jar
(add path to java and weka.jar in the above command if needed).

Weka software contains an implementation of the Apriori algorithm for learning association rules.

Association rules are of the form LHS ==> RHS where LHS and RHS are sets of attribute-value pairs. These are called item sets: an attribute-value pair is called an item. For example:

rule 1 : outlook=sunny	==> play=no
rule 2 : temperature=cool windy=FALSE 2 ==> humidity=normal play=yes
Essentially, Apriori attempts to associate item sets on the LHS with item sets on the RHS.

Weka's Apriori association rule algorithm

Apriori works with categorical values only. Therefore, if a dataset contains numeric attributes, they need to be converted into nominal before applying the Apriori algorithm. For this part of the assignment we will use a version of the weather dataset weather.nominal.arff. The datasets are in weka-3-4/data directory (in directory /home/faculty5/ipivkina/weka-3-4/data/ if you are using my version). Make sure you work with copies of the datasets if you are requested to modify them. Apply the Apriori algorithm to the nominal weather dataset using Weka's command line interface (CLI).
java weka.associations.Apriori -t data/weather.nominal.arff
You should see output like the following:
Apriori
=======

Minimum support: 0.15
Minimum metric : 0.9
Number of cycles performed: 17

Generated sets of large itemsets:

Size of set of large itemsets L(1): 12

Size of set of large itemsets L(2): 47

Size of set of large itemsets L(3): 39

Size of set of large itemsets L(4): 6

Best rules found:

 1. outlook=overcast 4 ==> play=yes 4    conf:(1)
 2. temperature=cool 4 ==> humidity=normal 4    conf:(1)
 3. humidity=normal windy=FALSE 4 ==> play=yes 4    conf:(1)
 4. outlook=sunny play=no 3 ==> humidity=high 3    conf:(1)
 5. outlook=sunny humidity=high 3 ==> play=no 3    conf:(1)
 6. outlook=rainy play=yes 3 ==> windy=FALSE 3    conf:(1)
 7. outlook=rainy windy=FALSE 3 ==> play=yes 3    conf:(1)
 8. temperature=cool play=yes 3 ==> humidity=normal 3    conf:(1)
 9. outlook=sunny temperature=hot 2 ==> humidity=high 2    conf:(1)
10. temperature=hot play=no 2 ==> outlook=sunny 2    conf:(1)

Description of Output

The default values for Number of rules, the decrease for Minimum support (delta factor) and minimum Confidence values are 10, 0.05 and 0.9. Rule Support is the proportion of examples covered by the LHS and RHS while Confidence is the proportion of examples covered by the LHS that are also covered by the RHS. So if a rule's RHS and LHS covers 50% of the cases then the rule has 0.5 support, if the LHS of a rule covers 200 cases and of these the RHS covers 50 cases then the confidence is 0.25. With default settings Apriori tries to generate 10 rules by starting with a minimum support of 100%, iteratively decreasing support by the delta factor until minimum non-zero support is reached or the required number of rules with at least minimum confidence has been generated. If we examine Weka's output, a Minimum support of 0.15 indicates the minimum support reached in order to generate the 10 rules with the specified minimum metric, here confidence of 0.9. The item set sizes generated are displayed; e.g. there are 6 four-item sets having the required minimum support. By default rules are sorted by confidence and any ties are broken based on support. The number preceding $==>$ indicates the number of cases covered by the LHS and the value following the rule is the number of cases covered by the RHS. The value in parenthesis is the rule's confidence. These default settings can be modified using the following options:
-N Specify required number of rules
-C Specify minimum confidence of a rule
-D Specify delta for decrease in minimum support
-M Specify lower bound for minimum support
-I if set the item sets found are also output (default = no)
-T sort examples by different metrics described below:
confidence (0) the default, Lift (1), Leverage (2), Conviction (3)
Rules can be sorted according to different metrics. This is specified using the -T option. Suppose we have the rule L $\Rightarrow$ R and p(X) is the proportion of instances covered by the terms in X. We shall express the various metrics using R, L and p.
Lift indicates the degree to which the rule improves the accuracy of the default prediction of its RHS. Lift is confidence divided by the proportion of all examples that are covered by the RHS; i.e.

\begin{displaymath}Lift = \frac{\frac{p(L \wedge R)}{p(L)}}{p(R)} =
\frac{p(L \wedge R)}{p(L)p(R)}\end{displaymath}

If the RHS covers 250 cases out of a dataset of 1000 then the Lift is confidence/0.25.
Leverage is the proportion of additional examples covered by both the LHS and RHS above those expected if the LHS and RHS were independent of each other; i.e.

\begin{displaymath}Leverage = p(L \wedge R)-p(R)p(L) \end{displaymath}

For example, suppose that there are 1000 examples, the LHS covers 200 examples, the RHS covers 100 examples, and the RHS covers 50 of the examples covered by the LHS. The proportion of examples covered by both the LHS and RHS is 50/1000 = 0.05. The proportion of examples that would be expected to be covered by both the LHS and RHS if they were independent of each other is (200/1000) * (100/1000) = 0.02. Leverage = 0.05 - 0.02 = 0.03. The total number of examples that this represents is 30.
Conviction is similar to Lift but considers the effect when the RHS is not true, and the ratio is inverted.

\begin{displaymath}Conviction = \frac{p(L)p(not R)}{p(L \wedge not R)}\end{displaymath}

In the above example p(L) is 200/1000, p(not R) is 900/1000, and p(L $\wedge$ not R)=150/1000. Thus Conviction is 0.2*0.9/0.15 = 1.2.

Exercises

  1. How might you change the dataset before an item set of 5 can be generated?
  2. Set (-I yes) to view details of generated item sets.
  3. How might you change the values for -N, -C, -D and -M to increase the number of generated rules?
  4. Using the default value for -N, -D and -C, identify the maximum value for -M that would enable at least one rule to qualify (when sorted by confidence).
  5. Specify different values for -T and compare rule rankings by Confidence, Lift and Leverage. Are the top ranking rules affected by different -T values?

Exercises with census data

We will now use the adult.arff dataset which contains census data collected from about 48842 US adults. The goal of this dataset is to predict whether income exceeds $50000, however for association rule learning this is irrelevant. The original dataset is taken from the UCI Machine Learning Repository. More information about it is available in the original UCI Documentation
http://www.ics.uci.edu/~mlearn/MLRepository.html.
Attributes 3 and 5 in this dataset are numeric, therefore before applying the Apriori algorithm you will need to preprocess this dataset using the Discretize Filter in order to create a dataset with just nominal attributes. You may do it by typing the following (specify paths for java, adult.arff and adult-disc.arff if needed):
java weka.filters.unsupervised.attribute.Discretize -R 3,5 -B 10 -i adult.arff -o adult-disc.arff
Here -R specifies list of attributes to Discretize, -B specifies the (maximum) number of bins to divide numeric attributes into. The new file will be saved in adult-disc.arff with attributes 3 and 5 converted into nominal. (More information on Class Discretize you may find in the Weka API documentation.)

Generate association rules for the discretised Adult dataset. Repeat the same exercises carried out with the nominal weather dataset by trying out different option values and sorting metrics.

Exercises with Market-basket data

In this part we will use market-basket.arff dataset.