Top-10 machine-learning and data-mining algorithms
Machine learning deals with hundreds of algorithms that have various modifications. When selecting an appropriate class of algorithms and an algorithm within the class, you should closely consider your problem, define what you should measure or predict and which tools you are going to use for this purpose.
A possible set of the best machine-learning algorithms you can study and try to implement is as follows:
Decision trees: C4.5. Note that there are many different tree-based algorithms. Try to use C4.5 that has some advantages over the classical CART algorithm. For example, CART works only with binary tests (YES/NO) but C4.5 can work with more than two outcomes.
Decision trees: CART. This tree-based algorithm partitions cases in a binary way and can deal with both numerical and categorical values. The method has its applications in medical research, biology, electrical engineering, marketing and other fields.
The k-means algorithm has been created to divide a given set of cases into k clusters. The algorithm starts by picking initial representatives of clusters and then iteratively redistributes data to clusters. The algorithm performs a series of iterations until convergence.
Support vector machines (SVMs). At present, this class of algorithms provides users with robust calculations for solving classification problems. It does not require a lot of training cases and it works well with multidimensional tasks, which makes it very useful for working with Big Data.
The Expectation-Minimization (EM) algorithm. It finds maximum likelihood in statistical models parameters in cases where the model parameters cannot be observed. EM is used in a variety of fields, for example, for the reconstruction of medical images.
AdaBoost. It relates to ensemble learning methods that use a set of algorithms to achieve better performance than with the help of a single algorithm. This algorithm is very simple and has very precise predictions. It is applied for solving many kinds of classification problems, in particular, for face detection.
Naïve Bayes. The main advantage of this old method is that this algorithm is very simple and it does not require complicated iterative procedures. This makes it efficient for working with big data.
The Apriori algorithm has been designed for working with associations in transaction datasets. It discovers frequent sets of items (for example, frequent sets of purchases in a supermarket) and then finds out association rules based on these itemsets. For example, “if a customer visits a certain webpage, he/she is likely to visit the conversion page”.
Genetic algorithms. These heuristics imitate natural selection processes with mutation and crossover that can be observed in nature. They are especially useful for finding complicated non-linear dependencies in data.
The PageRank algorithm. It calculates ranks of websites in search results based on using hyperlinks. The main idea of this algorithm is that it measures numerically the significance of webpages.