Funny how SVMs are just max-margin loss functions and we just took for granted that you needed domain expertise to craft features like HOG/SIFT by hand.
By 2018, we use ConvNets to learn BOTH the features and the classifier. In fact, it’s hard to separate where the features end and the classifier begins (in a modern CNN).
The performance is still better in most cases but I often have to wonder, are people just doing feature engineering once removed and is the better performance just the result of having WAY more parameters in the model?
Is there anything like that in the ANN world?
Not really, or sort of, depending on how you think.
A deep neural network does work - at least to some extent - because of the large number of parameters. However, it is practical because it can be trained in a reasonable amount of time.
Things like ResNets are useful because they allow us to train deeper networks.
You can create a SVM with the same number of parameters[1], and in theory it could be as accurate (this is basically the no free lunch theorem[2]). But you won't be able to train it to the same accuracy.
[1] Of course there are practical concerns about what you do for features, since hand created features just aren't as good as neural network ones. One thing people do now is use the lower layers of a deep neural network as a feature extractor and then put a SVM on top of them as the classifier. This works quite well, and is reasonably fast to train.
See slide 7:
http://www.cs.rpi.edu/~magdon/courses/LFD-Slides/SlidesLect2...
I ask because when we're training human to understand things, there are a variety of benefits to separate feature-understanding from the classifiers. In particular, you get gains in flexibility, extendability, and debuggability.
I get why people are happy to take the ConvNet gains and run with them for now. But have you seen any interesting work to get the benefits of separation in the new paradigm? (Or, alternately, is there a reason why those concerns are outmoded?)
This was because we didn't understand how to train a deep model end-to-end until later. When we learned how to make that end-to-end training work it tended to perform better because the learned features were task specific.
You can still learn general features in a bunch of ways, in addition to the older method using autoencoders. For one example, multiple supervised heads with auxiliary losses can learn more generalize features.
Otherwise, why we would we need a gazillion architectures for different problems (or even the same exact problem)?
Disclaimer: written by me.
[1] https://blog.statsbot.co/support-vector-machines-tutorial-c1...
An SVM is literally just a linear model with hinge loss instead of log loss (logistic regression) or squared loss (ordinary linear regression) in primal form. For apparently historical reasons, it is usually derived from the "hard-margin" SVM in dual form, motivating with trying to maximize the margin. This is complicated and not very intuitive.
This also causes people to conflate the kernel trick and the dual form, while in fact they have nothing to do with each other. You can use the kernel trick in primal svm just fine.
Stochastic gradient descent can also be used for primal methods, while it doesn't work in the dual. That makes it much faster for large problems than the dual.
Once you realize that a linear SVM isn’t very different from logistic regression, it starts to all make sense (at least it did for me).
Key insight of the hinge-loss: once something is classified correctly beyond the margin, it incurs a loss of zero.
Now, Something fun to think about. Draw the hinge loss. Now draw the ReLU (which is found all over the place in CNNs). Now thing about L1-regularization (which was used to induce sparsity in compressed sensing). They are more similar in form than you would think.
Some people have had good luck with hinge or multi-hinge loss for neural networks instead of the almost universal log loss, since of course the hinge loss can be used in things other than linear models. It doesn't care how you get the y output.
The QP-dual formulation never seemed like something that could scale, and linear SVMs never seemed all that much better than just lasso/elasticnet regression. (hmmmmm :) )
The same thing happened with data storage. As soon as big data appeared everyone stopped doing just data and started doing "big data". Now the term is kind of a joke even.
I predict in a few years "deep learning" term will become mostly used in an ironic sense as well.
I may be a bit behind the times, but I'm also mystified by "deep learning's" popularity. Both giant neural nets and kernel methods have overfitting problems: torture a billion-parameter model long enough, and it will tell you what you want to hear.
SVMs address this by finding a large margin for error, which will hopefully improve generalization. DNNs (I think) do this by throwing more ("big") data at the problem and hoping that the training set covers all possible inputs. Work on adversarial learning suggests that DNNs go completely off the rails when presented with anything slightly unexpected.
I don't think your scenario is likely to occur unless something else starts outperforming deep learning (in the broadest sense) _and_ there's an approachable alternative to solve the same problems at least as well.
SVMs have been in active use since the early nineties and have been formulated much before that
I think the author may be working from a very different definition of the word "idiot".
In the pdf, it said that the optimization problem in SVMs have a nice property in that it was quadratic, which means that there's a nice global minimum to go towards, and not lots of local minimum like in NN. That means, it seems SVMs won't get stuck at a suboptimal solution.
Is that not a problem in DNNs now? Or is it that it's such high dimensionality that local minima don't stop the optimizer, because there's always another way around the local minimum?
With deep learning, the problem is non-convex, so the results are locally optimal but not necessarily globally optimal. We get around this with different techniques such as randomized initial weights, but with few exceptions, non-convex problems are NP-complete. Therefore there is no polynomial-time algorithm for finding the global optimum in deep learning unless P==NP.
I occasionally hear people claim deep learning finds the globally optimal solution, but that’s just not correct (or they mean they are doing an exhaustive search e.g. branch and bound, which has exponential worst case run time and isn’t practical).
That said, locally optimal solutions are often “good enough” for practical problems.
With DNNs, people learn everything (features + decision boundary), and this problem is not convex. Surprisingly DNNs work quite well in practice even though we were taught to be afraid of non-convex problems in grad school around 2005.
If back in early 2000s, we stopped worrying about theoretical issues and explored more approaches like ConvNets, we might have had the deep learning revolution 10 years earlier.
I once ran a small neural net 100 times on simple three-dimensional data re-
selecting the initial weights to be small and random on each run. I found 32
distinct minima, each of which gave a different picture, and having about equal
test set error.
– Leo Breiman
[1]: https://projecteuclid.org/download/pdf_1/euclid.ss/100921372...[2]: https://stats.stackexchange.com/questions/203288/understandi...
Note also that "optimal" in each case means "relative to other models in the same family." But a large neural net and an SVM do not exist in the same hypothesis space. As an analogy, recall that Ordinary Least Squares is the BLUE (Best Linear Unbiased Estimator.) But that's only relative to other unbiased (which in particular means no regularization: not ridge/lasso/elasticnet) linear models over the same feature set. Just because OLS provably gives the best performance in this family doesn't mean it is going to do well compared to models outside that family. In the same way, just because SVM offers a convergence guarantee doesn't mean its performance will always be the same or better than an NN.
The real problem with SVMs is that when you use a kernel (which is the whole point, linear SVMs are just logistic regression with hinge loss) then you introduce one landmark (feature) for every data point. So if your original data set was a manageable 1e6x1e3 then the SVM will view this data set as 1e6x1e6. It doesn't actually have to store that in memory thanks to the kernel trick[3] but training time still scales as O(N^2) where N is the number of observations (rows) in the original data. In practice, even with a highly optimized library like LIBSVM[4] you're not going to get good performance past N=1e6. (I've personally had the most success with SVMs on small datasets of only a few thousand rows.) While NN can very easily accommodate much later training sets with training time only growing as O(N), more or less. You can always sample a manageable training set from a larger data set but if your only using 1% of your data to train a model that's a problem.
Deep NN is also a far more modular approach: if you know your data has a 2D structure for example, you can add some CNN layers and get translational invariance. Deep NN comes with a large toolbox for specializing the model to the data, while the options for tuning an SVM are much more limited: you can change out the kernel function (not that anything is likely to beat RBF) and play around with the regularization parameter, and that's it. Once you've tuned those hyper-parameters with a small grid search there's not much else left to try with SVMs. If they work, they work, otherwise you have to change approaches.
[3]: https://en.wikipedia.org/wiki/Kernel_method [4]: https://www.csie.ntu.edu.tw/~cjlin/libsvm/
I have a minor nitpick regarding this point you make: not that anything is likely to beat RBF. Depending on the data, specialized kernels can help immensely. An easy example is sequence classification where something like a string kernel might work really well. Or image classification, where histogram based kernels might prove superior.
Note that sometimes you might want to measure how good a kernel is for a problem not by its prediction accuracy alone but also by the number of support vectors it needs - if the final model retains ~100% of the training data as support vectors, it is not a great model in some (subjective) sense since it is memorizing a lot. Depending on the data, you might "beat" the RBF kernel on this aspect too.
Regarding the training time, there are some interesting tricks I've come across (but not tried them out yet) -[1], [2].
[1] Ensemble SVMs http://www.jmlr.org/papers/volume15/claesen14a/claesen14a.pd...
[2] SVMPath - algorithm to fit the entire path of SVM solutions for every value of the cost parameter, with essentially the same computational cost as fitting one SVM model. http://www.jmlr.org/papers/volume5/hastie04a/hastie04a.pdf