This site is the bread and butter of each Research Engineers and Scientists working in Deep Learning. You use the site almost everyday.
Advanced learners also use the site regularly.
You would just think that "everyone knows" and never think of sharing the site on HN.
Knowledge about a field transfers best by hands-on association with people who practice it. Before widespread IT, communities of practice were local and relatively homogeneous; so it was easy to share the essentials of a field quickly, and get newcomers up and running with best practices.
Nowadays however, communities of practice are widespread, coming around all the world with very different backgrounds, communicating through low-bandwidth channels, and we're flooded with information so it's difficult to ascertain what is essential and what's accessory.
It is much more difficult for an outsider to grasp the essential qualities of a field they want to enter, as there are usually no guides comprehensive enough to detail everything you need to know.
I never found any subject that needed let's say more than 10 minutes of internet searches to know if it's worth pursuing.
It was much harder before the web. I remember as a kid seeing books about C++ in the local shop but even with looking inside not understanding what C++ was. Nowadays I would get my answer almost instantly.
You couldn't possibly believe this if you were old enough to remember what preceded the internet.
Good lord, no, today is not worse than microfiche and card catalogs.
“Bread and butter” for me is http://arxiv-sanity.com
https://news.ycombinator.com/item?id=19054501 (Feb 1, 2019) 411 points, 23 comments
https://news.ycombinator.com/item?id=23391934 (June 2, 2020) 304 points, 21 comments
- Checking the state of the art (SotA) for a given problem. For some problems 2 year old solutions are still close to SotA; in others - there is a huge difference. And if there is a huge difference - is it because of architecture and parameter tuning, or using totally different architectures and training modes.
- Running code - to be used somewhere, or as a reference. Papers never have all details, and do not compile.
Context: I used to work in the field, as a consultant. Though, I cite Papers with Code in one overview paper.
Pre-parenthesis part is dead serious, parenthesis part is slightly hyperbolic due to accumulated trauma with bad reviewers
"Ten Thousand"
I get that your run-of-the-mill paper saying "Here we present a novel algorithm for xyz" will usually have the algorithm defined in simple psuedo-code, maybe with an implementation in a "real" language as a proof of concept.
But for the many papers describing novel ML models, how does that work? They seem to use images that diagram out the different layers of the model. But is that truly "universal" the way that a psuedo-code algorithm is universal? As in, if the authors use PyTorch (or whatever), can I take the exact model they describe in their paper and apply it in MyFavoriteMLToolkit and achieve similar results?
I guess my question is, what are the "primitives" of papers describing ML models? Is saying "convolutional layer" enough, or do they also describe the dozens of hyper-parameters, etc?
The longer answer is that the rest of the Conv2D configuration can then be easily overlooked, unless changed from the defaults. And those can be different across frameworks and potentially break things, even they even exist in your preferred framework. You can always create custom layers though, if needed.
But many papers also seem to do a bad job describing the actual structure of their own ML network. They can be vague, confusing, or simply inaccurate. And that can be because they are a general concept with flexible details, or because they struggle to put their model into clear words and diagrams. Or simply because they know the code is going to do the lifting.
To keep things simple, I'd say the true "primitives" of ML models can be reduced to mathematical formulas. For example, a plain old feed forward network is implemented as matrix multiplication. Sprinkle in a bit of calculus to analytically derive the formula for back-propagating errors (aka training), and you have the basic building blocks of modern deep learning. Convolutions, Transformers, etc are just a bit fancier spins on the same mathematical foundations.
Hyper-parameters are essentially tunable variables in a formula. I'd say your instinct is spot on - they are absolutely necessary to capture for reproducible results.
If you have the code and the data the answer should be yes. You should be able to take that PyTorch code and translate it to MyFavoriteMLToolkit to obtain numerically identical results.
In practice, we face the same universal difficulties as other computer science based research: fighting inconsistencies in software, hardware, all the way down to the physics of the universe with cosmic ray induced bit flips, etc.
Generally, yes.
If they are standard, well-known layers that exist in both PyTorch and TF you can take a paper that was implemented in one and implement in the other and expect similar results (assuming you know a reasonable number of details[1]).
If they are non-standard layers it can be hard. There are lots of details that you need to port and even with access to the source code it can be easy to miss things.
[1] Here's an example of how things are implemented differently - you can still get the same result, but you need to know what you are doing: https://stackoverflow.com/questions/60079783/difference-betw...
Nothing you can't figure out by reading source code of the two frameworks or by reading the documentation closely.
Generally, people don't seem to care about reproducing exact metrics - as long as it is close enough they're happy. You need to dig a bit deeper if you want the full quality.
My experience has been that pseudo-code is anything but universal.
In fact, having had many times to implement actual working code from research papers pseudo-code, I would posit that pseudo-code is nothing but a license for academics to provide stuff that simply doesn't work to the reader and get away with it. Thanks to pseudo-code, they get to gently skip over the hard bits to get the paper out the door as quickly as possible.
Papers with actual, git-clonable, working code, should be the standard for CS academic publishing.
That's why a large number of journals now have requirements for publishing code and/or pretrained models (if applicable).
An annoying trend I've noticed in a number of SotA ML papers in video classification present multiple models and only publish the exact architecture & weights for the smaller models which are only as-good-as SotA (see tiny video networks, X3D for examples).
Arxiv.org won't accept a pdf with attachments though, so only a stripped-down version will come there (once/if I get an endorsement, fingers crossed).
I copied this concept from Joe Armstrong, where he suggested to distribute Erlang modules as PDFs with code files (*.erl) as attachments. "Documentation comes first, and the distribution should prioritize humans".
[1]: See Section A.1 of https://github.com/motiejus/wm/blob/main/mj-msc-full.pdf
I've stumbled upon a number of scientific papers from 2000s that include links to sourceforge for code listings. Most of those are dead now.
Github will not be there for ever.
https://jeffhuang.com/best_paper_awards/
And here's PapersWeLove Repo with similar sauce
[1] Attention Is All You Need (2017) https://paperswithcode.com/paper/attention-is-all-you-need
Introduced the Transformer architecture and applied it to NLP tasks.
[2] The Annotated Transformer (2018) https://nlp.seas.harvard.edu/2018/04/03/attention.html
An “annotated” version of [1] in the form of a line-by-line Pytorch implementation. Super helpful for learning how to implement Transformers in practice!
[3] BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding (2018) https://paperswithcode.com/paper/bert-pre-training-of-deep-b...
One of the most highly cited papers in machine learning! Proposed an unsupervised pre-training objective called masked language modeling; learned bidirectional representations from unlabeled text by jointly conditioning on both left and right context in all layers.
Bonus: https://nlp.stanford.edu/seminar/details/jdevlin.pdf
See the above slideshow from the primary author, noting the remarkably prescient conclusion: "With [unsupervised] pre-training, bigger == better, without clear limits (so far)"
[4] Conformer: Convolution-augmented Transformer for Speech Recognition (2020) https://paperswithcode.com/paper/conformer-convolution-augme...
Proposed an architecture combining aspects of CNNs and Transformers; performed data augmentation in frequency domain (spectral augmentation).
[5] Scaling Laws for Neural Language Models (2020) https://paperswithcode.com/paper/scaling-laws-for-neural-lan...
Arguably one of the most important papers published in the last 5 years! Studies empirical scaling laws for (Transformer) language models; performance scales as a power-law with model size, dataset size, and amount of compute used for training; trends span more than seven orders of magnitude.
[6] Language Models are Few-Shot Learners (May 2020, NeurIPS 2020 Best Paper) https://paperswithcode.com/paper/language-models-are-few-sho...
Introduced GPT-3, a Tranformer model with 175 billion parameters, 10x more than any previous non-sparse language model. Trained on Azure's AI supercomputer, training costs rumored to be over 12 million USD. Presented evidence that the average person cannot distinguish between real or GPT-3 generated news articles that are ~500 words long.
[7] CvT: Introducing Convolutions to Vision Transformers (May 2020) https://paperswithcode.com/paper/cvt-introducing-convolution...
Introduced the Convolutional vision Transformer (CvT) which has alternating layers of convolution and attention; used supervised pre-training on ImageNet-22k.
[8] Pushing the Limits of Semi-Supervised Learning for Automatic Speech Recognition (Oct 2020) https://paperswithcode.com/paper/pushing-the-limits-of-semi-...
Scaled up the Conformer architecture to 1B parameters; used both unsupervised pre-training and iterative self-training. Observed through ablative analysis that unsupervised pre-training is the key to enabling growth in model size to transfer to model performance.
[9] Switch Transformers: Scaling to Trillion Parameter Models with Simple and Efficient Sparsity (Jan 2021) https://paperswithcode.com/paper/switch-transformers-scaling...
Introduced the Switch Transformer architecture, a sparse Mixture of Experts model advancing the scale of language models by pre-training up to 1 trillion parameter models. The sparsely-activated model has an outrageous number of parameters, but a constant computational cost. 1T parameter model was distilled (shrunk) by 99% while retaining 30% of the performance benefit of the larger model. Findings were consistent with [5].
[10] ProtTrans: Towards Cracking the Language of Life's Code Through Self-Supervised Deep Learning and High Performance Computing (August 2021) https://paperswithcode.com/paper/prottrans-towards-cracking-...
Applied Transformer based NLP models to classify & predict properties of protein structure for a given amino acid sequence, using supercomputers at Oak Ridge National Laboratory. Proved that unsupervised pre-training captured useful features; used learned representation as input to small CNN/FNN models, yielding results challenging state of the art methods, notably without using multiple sequence alignment (MSA) and evolutionary information (EI) as input. Highlighted a remarkable trend across an immense diversity of protein LMs and corpus: performance on downstream supervised tasks increased with the number of samples presented during unsupervised pre-training.
[11] CoAtNet: Marrying Convolution and Attention for All Data Sizes (December 2021) https://paperswithcode.com/paper/coatnet-marrying-convolutio...
Current state of the art Top-1 Accuracy on ImageNet.
According to the latest ImageNet standings [2], ViT appears to have slipped to second place in Top-1 Accuracy. CoAtNet-7 is the new leader, but only by a slight margin and at the cost of what appears to be a significantly larger model.
[1] Scaling Vision Transformers https://paperswithcode.com/paper/scaling-vision-transformers
[2] https://paperswithcode.com/sota/image-classification-on-imag...