If you’re involved in the tech world, you’ll know that data mining has been creating a buzz for years. But, how exactly is it done? And what tools do data engineers actually use to ‘mine’ useful information from large databases?
The main tools in a data miner’s arsenal are algorithms. Today, I’m going to look at the top 10 data mining algorithms, and make a comparison of how they work and what each can be used for.
What Are Data Mining Algorithms?
Algorithms are a set of instructions that a computer can run. They aren’t specific to one programming language and can even be written down in plain English.
In data mining, clever algorithms are used to find patterns in large sets of data, and help classify new information. The applications for these are limitless – from predicting if a patient has cancer to complex genetic applications. Let’s take a look at some examples of data mining algorithms.
You’ll need to know some jargon words to learn how to use data mining algorithms. Here are some of the main ones:
- Classifier – A program that sorts data entries into different classes. E.g. a classifier might sort cars into classes such as sedans, SUVs, hatchbacks, etc.
- Supervised learning – Algorithms that need a ‘training’ set of data to learn from. E.g. a set of patients and symptoms that have already been marked ‘sick’ or ‘well’ by a doctor.
- Unsupervised learning – Algorithms that don’t need any training data to work properly.
- Outliers – Data points that are out of the usual range. E.g. in a test with most scores between 40-45, a score of 100 would be an outlier.
- Noisy data – Data with lots of outliers
The first on this list of data mining algorithms is C4.5. It is a classifier, meaning it takes in data and attempts to guess which class it belongs to.
C4.5 is also a supervised learning algorithm and needs training data. Data scientists run C4.5 on the training data to build a decision tree. This can then be used to classify new information.
Let’s take a look at an example. Say we had a sample of data showing who survived the sinking of the Titanic. We also know their sex, age, and the number of siblings they had on board. Running C4.5 might generate a decision tree like the one below.
Now we can use this decision tree on new data to predict whether a passenger survived or not.
The main problem with C4.5 is overfitting when using noisy data. This means the decision tree it generates pays too much attention to outliers and will make mistakes if there are too many in the dataset.
K-means is very different type of data mining algorithm than C4.5. It is an unsupervised learning algorithm, meaning it doesn’t need training data, and works even your data isn’t already marked or classified.
Rather than classifying data, the goal of k-means is to group data points together based on how similar they are. These groups are called “Clusters”. A very simple example might be to group a set of people together by age and blood pressure. The ‘K’ simply tells us there can be any number of groups.
It’s a popular data mining algorithm because it’s simple to implement and works quite well. Here’s an example of a data set that has been grouped using this techinque.
K-means works by finding points called “centroids”, then assigns each point in the data to a cluster based on its closest centroid. As you can guess, the trick here is to find the optimum number and position of centroids to group the data properly.
This is done in a very clever way, here’s a basic version of the algorithm:
- Choose the number K of centroids you want
- Randomly select positions for the centroids in multidimensional space
- Assign each data point to its closest centroid, creating K clusters
- Readjust the positions of the centroids to be the average (or mean) position of these new clusters (this is why it’s called K-“means”)
- Repeat steps 3 and 4 until the centroids no longer move!
Once the centroids stop moving, the algorithm has finished and the data has been clustered. For best results, k-means is usually run multiple times with different random starting points. The one with the tightest clusters is probably the best result.
K-medoids is a similar version of this algorithm but is more resilient to outliers.
Support Vector Machines (SVM)
SVM is another supervised classifier algorithm like C4.5. The difference is, it only classifies data into two groups. This can be thought of as a line separating data points on a graph.
The thing that makes the Support Vector Machine algorithm really cool is that it can find complex curved lines separating points more accurately than a straight line. It does this with a clever trick using dimensions.
SVM takes a dataset and projects the points into a higher dimension. by doing this, it can more easily separate the points into two groups. Here’s a great example.
The two-dimensional data on the left can’t be separated with a straight line. To fix this, SVM projects points into the 3-dimensional space shown on the right. Now, the data can easily be split into two classes with a plane that SVM can find automatically.
Of course, this won’t always work, so you’ll need to experiment with different classifier algorithms to see which one is best for a particular problem.
Apriori is an algorithm that learns which items in a data are commonly associated with each other. This can be very useful for grouping similar items together in tables.
For example, your company might have a database of customers and transactions. By running the Apriori algorithm on this data, you might find that coffee beans are commonly bought together with coffee machines. You could use this information to display product recommendations at the right time to increase sales.
The algorithm itself is quite complicated. It is usually considered unsupervised learning, as it can be used to explore sets of unlabelled data. But, it can be modified to give classifications too.
EM is a clustering algorithm that uses statistical models to group data.
You might know that the results of test scores usually follow a bell curve. That is, most of the test results will fall somewhere in the middle with few getting very high scores and few getting very low scores. Bell curves are used all the time in statistics and show up in all kinds of data. On paper it looks something like this:
EM attempts to guess the curve that represents a certain data set. For example, given a sample set of test scores, guess the bell curve that represents the whole set. Once it’s got this curve, it can be used to group new data into ‘grades’. This works quite well in practice and has many different applications to k-means.
PageRank is one of my favorite algorithms. It actually affects your life every day, and mine. You probably know it better as the main algorithm that powers Google searches.
Back before PageRank was invented by the guys at Google, internet searches were terrible. They basically counted the number of times a keyword appeared on a web page to determine how relevant it was to that keyword. This lead to lots of spam and bad results. So how do you fix that?
You can think of PageRank as a kind of voting algorithm. It counts the number of times a web page is linked to by other pages. The more links pointing to a page, the more important that web page is considered. Also, if a web page is found to be important, its links will also be more important, and carry more weight.
PageRank can be used for more than just ranking web pages. It can be extended to any kind of graph where things are linked together. Here’s an example of a linked graph that has been run through page rank. The sizes and percentages show how ‘important’ the nodes were found to be based on links to them.
AdaBoost is a data mining algorithm that builds a good classifier out of lots of bad classifiers.
In machine learning, A bad learner is a classifier that doesn’t work very well, barely better than random chance. However, if you have a few of them, you might be able to combine them in a way that ends up being pretty good!
AdaBoost does this by using a training set of data, so it is supervised learning. It iterates through the training data on each bad learner, working out which is the best at each step. Each new round, the misclassified points are more important. Eventually, a more complex decision tree is built from the simpler parts. This new tree should be better at classifying new data than any of the original learning algorithms.
The cool thing about AdaBoost is that it can be used with other algorithms. The ‘bad learner’ building blocks can be the simple decision trees build by other algorithms.
K-Nearest Neighbours (KNN)
K-nearest-neighbours is one of the more simple data mining algorithms. Like most of the list, it is also a classifier (yes these are very important in data mining). However, it does things a little differently from the ones I’ve already described.
K-means, AdaBoost, and SVM are all eager learners. They take training data and build a classifier that can them be used on new data.
KNN is a lazy learner. It doesn’t actually do anything until it is given new, unlabeled data. When this happens, it finds the ‘k’ nearest neighbors to the new data point (i.e. the ones that are most similar) and assigns a class similar to those. Here’s an example:
The new point X is given as new input. Of its five closest neighbors, four of them are in the red class and one is green. Based on this, X is classified as red, as most of its neighbors are red. Pretty simple!
This actually refers to a group of algorithms that share a common – naive – assumption.
That assumption is that each feature in a set of data is completely uncorrelated with all other features in the set. For example, say we have data about hospital patients. The features of the data include zip code, height, and age. Those three things probably aren’t correlated – your zip code doesn’t determine your height or age very much. Here the Naive Bayes assumption is fine.
Using this assumption, some very simple math can give us a classifier that can predict the class of new data.
But, this assumption isn’t always true. Let’s say we extend our hospital patient data set to include things like blood pressure, cholesterol, and weight. These probably are correlated, so will the algorithm still work?
The answer is usually yes. Naive Bayes classifiers work surprisingly well, even on data with correlated features.
The last data mining algorithm on the list is CART, or Classification And Regression Trees. It’s an algorithm used to build decision trees, just like many of the other algorithms we’ve discussed.
CART can be thought of as a more statistically grounded version of C4.5, and can lead to more robust decision trees. One of its defining features is that it constructs binary trees, meaning each internal node has exactly two edges.
The reason to use this algorithm over C4.5 is that it is less susceptible to outliers. That means it will work better on a noisy data set where not all data points are correctly classified.