The wrong tool for the right project?
Machine learning might look like it is the solution to all the world’s problems. It’s being used to predict people’s online habits for online shopping, used in identifying faulty machinery for manufacturers, and even used in determining how many houses you should look at before you buy one. But surely not every problem can be solved with the same tool?
Machine learning (ML) encompasses a vast number of different types of algorithms. What they have in common is their ability to learn a solution from known data (supervised learning) or learn a solution that separates the data into a pre-defined number of data sets (un supervised learning). With a multitude of different algorithms at our fingertips, which one should be used for what application?
In my experience in machine learning, I have found that certain algorithms work better for certain problems. For image analysis, like facial recognition or text recognition, Artificial Neural Networks (ANN) lead the way. But a problem, like predicting who perished on the titanic, may not be solved by a straightforward ANN.
“Kaggle is a platform for data science competitions.” – www.kaggle.com. And here is where I obtained the data of the passengers who sailed on the Titanic. Currently, there is a competition going to determine the algorithm that will predict whom in the test set of passengers perished on the Titanic. Looking at the stats online there have already been a handful of people who can predict the outcome for the passengers with 100% accuracy. Despite this I wanted to give it a try.
To begin with I followed a tutorial online (http://trevorstephens.com/kaggle-titanic-tutorial) and used the programming language R with decision trees (a ML algorithm) to predict who would perish. Sadly, I was unable to get any better than about 78% with some minor feature engineering. So I thought, “What could be better than decision trees?”
My career in ML started with Deep Learning (ANNs) and so I went to Python to see what they had in terms of algorithms I could easily implement to test if deep learning was any better. Python has an amazingly simple (once you’ve got it installed) deep learning package called Keras. With Keras you can easily, and quickly, create an ANN with only a few lines of code. (Previously I have written ANNs from scratch which gives a very in depth idea of how they work but is more effort than needed here.)
I used several features of the data to train different topology ANNs, but found that the different topologies rarely increased the accuracy of the model. What did increase the accuracy of the model was including different features of the training set. The features I ended up using were gender, if a child or adult, cost of the fare, the class of the ticket, family size, and the embark code given. And despite all of these features, and using deep learning, I obtained 77% accuracy.
So deep learning didn’t achieve a better accuracy for predicting whom perished on the Titanic than using decision trees. Why? Because deep learning isn’t any more powerful than other algorithms when the features can’t differentiate between different classes. And deep learning is not necessarily the answer to everything.
If two algorithms can result in similar accuracy, how do you choose the right algorithm? Mostly, it comes from testing a few different ones and going with the one that shows the most promise. It also depends on your data. If you are using 1,000’s to 1,000,000’s of examples in training and testing you need to look at quicker algorithms. Decision trees are quite slow in comparison to ANNs and Support Vector Machines (SVMs). For a larger problem you may want to stay away from decision trees unless you have access to parallel programming or GPUs.
One other major point in deciding on your algorithm is what your problem is solving. Are you classifying data? Are you predicting how late plans will be in minutes? Or are you trying to predict where people give up on their online shopping cart? Each of these cases requires different outputs that cannot be returned by all ML algorithms. Classification can be done through decision trees, ANNs, SVMs, and some others. Predicting how late plans will be requires a regression output, which can be done with ANNs and SVMs. But identifying at what point someone stops their online shopping requires path analysis.
For each problem there may be more than one possible algorithm and it’s up to you/us to determine which is the right tool for the right project and quickly throw out the wrong tools. If we can fail fast and often we can quickly narrow down the solutions that will work and better make the predictions we are intending to whilst still getting to have lots of fun with ML.