Monday, December 29, 2014

Wednesday, December 17, 2014

Most Popular Data Mining Algorithms

http://www2.cs.uh.edu/~ceick/DM/10Algorithms-08.pdf

This paper presents the top 10 data mining algorithms identified by the IEEE International Conference on Data Mining (ICDM) in December 2006.

Does anyone know of anything similar that is more recent?

Tuesday, December 16, 2014

Saturday, December 13, 2014

Boosting vs Bagging

This is a paper that compares bagging and boosting on decision trees:

http://home.eng.iastate.edu/~julied/classes/ee547/Handouts/q.aaai96.pdf

The paper shows that both bagging and boosting improve over individual trees and that boosting usually gives better results than bagging, although in some cases boosting fails, probably due to its tendency to get distracted by noisy records.

Note that Quinlan is an important name in the field of machine learning. He is the one that introduced C4.5.

Thursday, December 11, 2014

NetFlix

Here are a few articles related to the Netflix prize:

Lessons from the Netflix prize challenge (By the contest winners).
The BellKor Solution (2007)
The BellKor Solution (2009)
The Pragmatic Theory Solution (2009)
The Big Chaos Solution (2009)

De-anonymization of the Netflix Dataset (It turned out to be much easier than what I thought!)



On Bootstrapping Vs Cross Validation

Here is a link from Jan about the difference between cross validation and bootstrapping:
http://www.r-bloggers.com/comparing-the-bootstrap-and-cross-validation/

Here is also a link to a very famous paper (published in 1995) that compares between bootstrapping and cross validation:
http://www.cs.iastate.edu/~jtian/cs573/Papers/Kohavi-IJCAI-95.pdf

Monday, December 8, 2014

Model Ensembles Again!

Here is a good (a bit old though) reference and experimental comparison between ensemble methods.

Wednesday, December 3, 2014

Model Ensembles

I have added some "borrowed" slides on model ensembles. These slides are a good summary for a good portion of what we have covered in class but not everything.

Make sure also to review what I write on the board.

Sunday, November 30, 2014

Assignment 2

Assignment 2 is ready ... check it out!

This assignment should not take much time or effort. However, it is a little bit tricky and requires careful analysis.

I hope to see good submissions.

Friday, November 21, 2014

Saturday, November 8, 2014

About The Exam

Our exam will be on Monday 10/11/2014 inshaAllah.

I understand that this date is difficult for some of us. However, postponing the exam will also be difficult.

You can find the first exam from Fall 2013 here. There are a few differences between the material covered this year and the material covered last year. Nevertheless, this exam should be a good practice exercise for our exam.

Study well, sleep well and don't forget your calculators!

Bittawfeeq,

Tuesday, November 4, 2014

Email Correspondence

Please note that (ialbluwi.hws@gmail.com) is an email for homework submissions only, which means that I do not check it frequently.

If you have any questions, please use my other email address (i.albluwi@psut.edu.jo).

Sunday, November 2, 2014

Assignment 1 Deadline Extension

Since 4 in our class were busy the last weekend with the ACM programming contest, I have decided to extend the deadline for assignment 1 until Wednesday 5/11/2014.

I hope to see good submissions!

Sunday, October 19, 2014

Assignment 1

Assignment 1 is out! Check out the assignments tab.

Start early and never hesitate to ask if you face any difficulties.

Enjoy!

Thursday, October 16, 2014

Slides and Material Updated

I have just added the lecture slides: Know Your Data, Data Quality, and Data Exploration. I have also added an online resource that summarizes much of what have said about the data preprocessing phase.

Enjoy!

The Power of Statistics and Visualization

The following is an interesting documentary about statistics:





If you happen to be too busy to watch this, then don't miss the following video for the same person. It shows how effective visualization of data can summarize information that may require long long pages to describe.




Have a nice weekend!

Friday, October 3, 2014

What is not Data Mining?

We have discussed last lecture what Data Mining is and how it differs from other fields. Six of you have volunteered to help summarize what we have said last lecture by defining a "field" and giving an example about it from social media applications.

The 6 fields we would like to define are: Data Mining, Data Science, Statistics, Information Retrieval, Data Base Management Systems, and Artificial Intelligence.

The 6 volunteers are: Taysir, Osama, Mays, Muath, Basel and (was it Richard or Ziad? I forgot!).

You can write your contributions here: http://piratepad.net/pZhmljaqmR. Make sure to enter your name at the top right TextBox before contributing so that we can tell who wrote what.

Everyone is welcome to contribute even if he was not assigned a topic or if he wishes to add something to what another student said.

Have a nice holiday!

Update:
- Remember; the task is to: 1) define and 2) give an example from social media.
- Keep the definition and the examples as short as possible.
- Avoid "copy and paste", especially in the example.

Tuesday, September 30, 2014

Bonus Videos


In the first lecture, we heard numbers in Tera, Peta and Exabytes .... but what does it take to store and manage such Big Data?

Have a look at how the data giant  facebook  does it ..



Here is also another video to watch during the holidays: A documentary about the use of data mining by the police, scientists and in the stock market: Horizon BBC: The Age of Big Data. Don't miss it!

Enjoy...

Big Data Infographic

Here is the infographic we have seen together last lecture:
http://thumbnails.visually.netdna-cdn.com/the-big-data-explosion_510f8efa1ee3a.jpg

Friday, September 26, 2014

Welcome!

Assalamu Alaikum All, and welcome to this course on Data Mining!

Did you know that the amount of data we have produced as humans in the previous few years surpasses the amount of data produced in all of human history?

What can we make out of all of this data?

This will be our principle question in this course. We will try to find patterns, associations and meaning in large datasets and see how to make decisions based on them.

This blog will be our interaction arena, so feel free to browse the available tabs and subscribe to the posts by entering your email in the subscription box just above my profile pic.


For now ... whet your appetite by watching the following video:

What has Data Mining to do with the Obama campaign?