Support Vector Machines and Hadoop: Theory vs. Practice (opens in new tab)

(distilnetworks.com)

34 pointstechtime7712y ago8 comments

8 comments

This is a interesting take and why I love machine learning and its intersection with HPC.

First the seemingly blind decision to implement an SVM for improved performance. An SVM isn't magical in fact an SVM and neural network are equivalent with the SVM being the general case. SVMs suffer from the the same problems as neural networks in that your # Hidden Nodes/Activation function is the same as trying to choose your kernel function. When looking at your problem you have to ask yourself.

1) Is the model time varying? -> NN

2) Very large N dimensional search space -> SVM

Secondly even without changing algorithms you can get a significant accuracy improvement by examining your features. Features are the most important part of machine learning (garbage-in/garbage-out). Even simple classifiers such as Naive Bayes can do well if given the right feature set. There are multiple methods to examine your features such as ReliefF another is ANOVA. If you find your features are not good enough try unsupervised feature detection and learn more about the problem domain and coming up with your own features.

The final issue specifically with HPC and machine learning is even given 100 cores your algorithms may not speedup. Many machine learning algorithms are built to be iterative in nature and do not lend themselves to the becoming parallel. This means that MapReduce must be invoked at each iteration. As you scale up the number of available cores the overhead of startup and shutdown of your cluster at each iteration overrides your gain in performance, as many nodes will finish faster than others and just sit and wait.

The solution to all of this is simple

1) Get your features correct

2) Try new algorithms

   - Try online learning algorithm first like vowpal wabbit
     here https://github.com/JohnLangford/vowpal_wabbit/

3) HPC

   - Apache Spark http://spark.incubator.apache.org/ 
     or 
     GraphLab http://graphlab.org/
     or (if personal computer only)
     GraphChi http://graphlab.org/graphchi/

   - Both support HPC with graph centric framework

   - Orders of magnitude faster than Hadoop  

   - Both built on top of Hadoop HDFS so connect and go

Hope this helps Everyone out there.

Have fun and try to solve some cool problems.

gfodor12y ago

Nice post. Question for you: feature selection is certainly the most important part of ML. But yet, the focus of most ML texts is on the algorithm zoo and they gloss over feature selection. Are there any good references on the variety of techniques, with examples, of feature selection best practices?

chcleaves12y ago

That's a good question. Feature selection is a large field of research and is a bit too broad for me to summarize in an abbreviated fashion. I would look into "model selection", specifically into scores of models that weigh both complexity (the number of variables) and goodness of fit. A good score to look into first is the Bayesian information criterion (BIC) which is used, for instance, in model selection in neuroscience. http://en.wikipedia.org/wiki/Bayesian_information_criterion

One thing you might want to try is cross-validation (http://en.wikipedia.org/wiki/Cross-validation_%28statistics%...). Cross-validation should help you determine if your model is overfitting, as it will perform significantly better on its training set than on the left out data.

pnachbaur12y ago

I'm still surprised there is no SVM implementation for Mahout (http://stackoverflow.com/questions/10482646/recently-svm-imp...).

chcleaves12y ago

A few years ago some SVM code was contributed to the Mahout project, but as of yet, it still doesn't appear to have a working implementation. It seems one can tweak existing Mahout functions a bit in order to accomplish the same sort of thing, but Mike went ahead and started working on an SVM implementation when he initially discovered it wasn't fully implemented. Given a package like sklearn (part of the Python scipy package), it's not so hard to implement a scheme similar to the one described in the blog once you know what to do.

PaulHoule12y ago

I'd like to see some classification performance numbers, ROC curves, etc.

eakyol12y ago

Paul,

We don't have any published performance numbers at the moment as we just implemented our cluster within our production environment. Looking to do a post-facto write up on that in a bit.

konstantintin12y ago

how much of an increase in accuracy is gained by using the full collection of data rather than sampling?

j / k navigate · click thread line to collapse

8 comments

HamSession12y ago

This is a interesting take and why I love machine learning and its intersection with HPC.

1) Is the model time varying? -> NN

2) Very large N dimensional search space -> SVM

The solution to all of this is simple

1) Get your features correct

2) Try new algorithms

   - Try online learning algorithm first like vowpal wabbit
     here https://github.com/JohnLangford/vowpal_wabbit/

3) HPC

   - Apache Spark http://spark.incubator.apache.org/ 
     or 
     GraphLab http://graphlab.org/
     or (if personal computer only)
     GraphChi http://graphlab.org/graphchi/

   - Both support HPC with graph centric framework

   - Orders of magnitude faster than Hadoop  

   - Both built on top of Hadoop HDFS so connect and go

Hope this helps Everyone out there.

Have fun and try to solve some cool problems.

gfodor12y ago

chcleaves12y ago

pnachbaur12y ago

I'm still surprised there is no SVM implementation for Mahout (http://stackoverflow.com/questions/10482646/recently-svm-imp...).

chcleaves12y ago

PaulHoule12y ago

I'd like to see some classification performance numbers, ROC curves, etc.

eakyol12y ago

Paul,

We don't have any published performance numbers at the moment as we just implemented our cluster within our production environment. Looking to do a post-facto write up on that in a bit.

konstantintin12y ago

how much of an increase in accuracy is gained by using the full collection of data rather than sampling?

j / k navigate · click thread line to collapse