Assignments


Assignment 2



Deadline: 11-12-2014 (at 23:59). 
Weight: 3 Marks.


Objectives:

  • Reinforce concepts related to: decision tree learning, nearest neighbor classification, Bayesian learning and model evaluation.
  • Practice using WEKA.


Question 1: The Parity Problem [50%]


X1X2X3X4C
0000YES
0001NO
0010NO
0011YES
0100NO
0101YES
0110YES
0111NO
1000NO
1001YES
1010YES
1011NO
1100YES
1101NO
1110NO
1111YES


Given the dataset shown above:

  1. Draw the full decision tree. Use Misclassification Error and gain to choose the best-next-attribute for splitting. 
  2. Discuss the effect of pre-pruning and post-pruning (using misclassification error gain) on this tree and show your work. Explain what you have concluded from this exercise.


Question 2: WEKA [50%]


(a) Download the following extended dataset for the parity problem.

  1. Load the dataset in WEKA.
  2. Using the training set as the test set, run each of the following classifiers and record its accuracy:
    1. J48 (Decision Tree).
    2. IBK (Nearest Neighbor Classifier). Test with K=1 and K=8.
    3. Naive Bayes.
  3. Repeat the previous step using 10-Folds cross validation and record the results.
  4. Discuss the results and explain the reason behind the performance of each method in the two cases. To understand the results of J48, you may need to read about how it performs pruning.
(b) WEKA has many sample datasets. Which datasets can be used with IBK but not with J48 and not with Naive Bayes? Explain Why?



How to Submit The Assignment:


  • Zip all of your work and name the file "HW2_YourName.zip".
  • For Q1, you can either scan your work and include it in the email submission, or submit it as a hard copy before the deadline.
  • For Q2, include a few screenshots from WEKA for each of part a and b.

Good Luck!







Assignment 1 (Mini Project)





Deadline: 3-11-2014 (at 23:59). 
Weight: 5 Marks. 


Objectives:
  • Practice cleaning and preprocessing datasets.
  • Practice analyzing datasets without using any data mining algorithm.
  • Experiment with a toy data mining technique.


Question 1: Titanic [60%]


Background


The sinking of the RMS Titanic is one of the most infamous shipwrecks in history. On April 15, 1912, during her maiden voyage, the Titanic sank after colliding with an iceberg, killing 1502 out of 2224 passengers and crew. This sensational tragedy shocked the international community and led to better safety regulations for ships. One of the reasons that the shipwreck led to such loss of life was that there were not enough lifeboats for the passengers and crew. Although there was some element of luck involved in surviving the sinking, some groups of people were more likely to survive than others.

As a Data Miner, you are given the task of analyzing the Titanic passenger records in order to predict (for future voyages) which types of passengers are more likely to survive.
[adapted from Kaggle.com]

Dataset Files


Dataset: titanic.csv
Dataset Description: titanic_info.rtf

Question Requirements


Take a thorough look at the dataset files and then write a report that includes the following elements:
  1. technical description of the dataset in terms of: 
    • The dataset type, and the type of each of the attributes.
    • Are there any problems with the dataset? Describe each problem (if any), propose a solution and give a brief justification.
    • What transformations and data reduction steps could be useful for improving the accuracy of the predictions and/or the performance (time efficiency)? Justify your answer.
  2. Which passenger groups seem to have been more likely to survive? Justify your answer with as many visualizations/statistical measures as necessary. [Examples of useful statistics/graphs include: average age of survivals/non-survivals, percentage of male vs. female survivals, relationship between social class and survival, etc.]
Note: Just in case you are not comfortable with MS Excel, Kaggle.com provides a quick tutorial on how to start using Excel for analyzing data and performing predictions. The tutorial can be found here


Question 2: OneR [40%]

Background
[Adapted from: http://www.saedsayad.com/oner.htm]


OneR, short for "One Rule", is a simple, yet accurate, classification algorithm that generates one rule for each predictor in the data, then selects the rule with the smallest total error as its "one rule".  To create a rule for a predictor, we construct a frequency table for each predictor against the target. It has been shown that OneR produces rules only slightly less accurate than state-of-the-art classification algorithms while producing rules that are simple for humans to interpret.
The following is the OneR Algorithm:
For each attribute,
     For each value of that attribute, make a rule as follows;
           Count how often each class value appears.
           Find the most frequent class value.
           Make the rule assign that class to this value of the attribute.
     Calculate the total error of the rules of each predictor
Choose the predictor with the smallest total error.
  
Example:
Finding the best predictor (attribute) with the smallest total error using OneR algorithm based on related frequency tables.
 


The rules for the "Outlook" attribute will be:



The total error for the rules of "Outlook" is: 2 + 0 + 2 = 4.
Similarly, the total error for "Temperature" = 5, for "Humidity" = 4 and for "Windy" = 5.

Therefore, we choose the rules of either "Outlook" or "Humidity" since they have the lowest total error.

Question Requirements:

  • Prepare the titanic dataset by performing any necessary data preprocessing steps (cleaning, transformations, data reduction, etc.). This should comply with the report you have prepared for Q1.
  • Use Excel (or write a program) to apply oneR on the titanic dataset. The result should be the set of rules produced by oneR.
  • If you write a program, submit the code and the final set of rules. If you use Excel, submit the frequency tables and the final set of rules.
  • Compare and contrast the rules with the conclusions you have reached after exploring the data in Q1.


How to Submit The Assignment:

  • Zip all of your work and name the file "HW1_YourName.zip".
  • Keep the work of each  question in a separate folder.
  • Send the file to "ialbluwi.hws@gmail.com" with the subject "DM-HW1".
  • Submission check list:
    • A Report for Q1.
    • The modified dataset for Q2.
    • The oneR rules.
    • The oneR model. This is either the source code of the program you have written or the excel frequency tables you have created and the error rates you have computed.
    • A short comparison between the oneR rules and the conclusions from Q1. This can be in a separate file or in the same report of Q1.

Good Luck!

No comments:

Post a Comment