First the seemingly blind decision to implement an SVM for improved performance. An SVM isn't magical in fact an SVM and neural network are equivalent with the SVM being the general case. SVMs suffer from the the same problems as neural networks in that your # Hidden Nodes/Activation function is the same as trying to choose your kernel function. When looking at your problem you have to ask yourself.
1) Is the model time varying? -> NN
2) Very large N dimensional search space -> SVM
Secondly even without changing algorithms you can get a significant accuracy improvement by examining your features. Features are the most important part of machine learning (garbage-in/garbage-out). Even simple classifiers such as Naive Bayes can do well if given the right feature set. There are multiple methods to examine your features such as ReliefF another is ANOVA. If you find your features are not good enough try unsupervised feature detection and learn more about the problem domain and coming up with your own features.
The final issue specifically with HPC and machine learning is even given 100 cores your algorithms may not speedup. Many machine learning algorithms are built to be iterative in nature and do not lend themselves to the becoming parallel. This means that MapReduce must be invoked at each iteration. As you scale up the number of available cores the overhead of startup and shutdown of your cluster at each iteration overrides your gain in performance, as many nodes will finish faster than others and just sit and wait.
The solution to all of this is simple
1) Get your features correct
2) Try new algorithms
- Try online learning algorithm first like vowpal wabbit
here https://github.com/JohnLangford/vowpal_wabbit/
3) HPC - Apache Spark http://spark.incubator.apache.org/
or
GraphLab http://graphlab.org/
or (if personal computer only)
GraphChi http://graphlab.org/graphchi/
- Both support HPC with graph centric framework
- Orders of magnitude faster than Hadoop
- Both built on top of Hadoop HDFS so connect and go
Hope this helps Everyone out there.Have fun and try to solve some cool problems.
One thing you might want to try is cross-validation (http://en.wikipedia.org/wiki/Cross-validation_%28statistics%...). Cross-validation should help you determine if your model is overfitting, as it will perform significantly better on its training set than on the left out data.
We don't have any published performance numbers at the moment as we just implemented our cluster within our production environment. Looking to do a post-facto write up on that in a bit.