For instance, never train a model in end-to-end FP16. Use mixed precision, either via native TF/PyTorch or as a freebie when using TF32 on A100s. This’ll ensure that only suitable ops are run with lower precision; no need to fiddle with anything. Also, PyTorch DDP in multi-node regimes hasn’t been slower or less efficient than Horovod in ages.
Finally, buying a local cluster of TITAN Xs is an outright weird recommendation for massive models. VRAM limitations alone make this a losing proposition.
This blog is more of an intro to a few high level concepts (multi-GPU and multi-node training, fp32 vs fp16, buying hardware and dedicated machines vs AWS/GCP, etc) for startups that are early into their deep learning journey, and that might need a nudge in the right direction.
If you're looking for a deep dive into the best GPUs to buy (cost/perf, etc), the link in the below comment gives a pretty good overview.
PS - I can send you some benchmarks we did that show (at least for us) Horovod is ~10% faster than DDP for multi-node training FWIW. Email is in my profile!
Do you have an alternative recommendation?
It provides some modern, real life, deep learning benchmarks using the mixed precision (TF32) that gp was referring to.
Did you play around with any AI-specific accelerators (eg TPUs?)
Looking at some basic cost analysis from a stranger on the Internet - https://medium.com/bigdatarepublic/cost-comparison-of-deep-l... - you can probably get a decent price reduction in training, especially using preemptive instances (and perhaps a better pricing contract with Google/AWS)
It's kind of crazy how the shortage of GPUs is affecting pricing on physical devices. My RTX Titan I bought in 2019 for $2,499 runs almost $5k on Amazon and is in short supply. The Titan V options you linked (although I think theres a typo because you referred it it as a Titan X) is an option - but it is still super overpriced for it's performance. Of course, this will probably settle down in the next year or two, and by then there will be new GPUs that are ~2-4x flop/$ compared to the V100/A100.
How to train large deep learning models at a well founded startup*
Everything described here is absolutely not affordable by bootstrappers and startups with little funding, unless the model to train is not that deep.
Other tips not mentioned in the article:
1. Tune your hyper parameters on a subset of the data.
2. Validate new methods with smaller models on public datasets.
3. Tune models instead of training from scratch (either public models or your previously trained ones).
1. if you choose the wrong subset, you'll find a non optimum local min
2. still risk dead ends when expanding the model and lengthen the time to finding that out
3. a lot of public models are made from inaccurate datasets, so beware
Overall you have to start somewhere though, and your points are still valid.
Full disclosure: I'm a founder of the project.
Concretely, the system is regularly taking checkpoints (which include model weights and optimizer state) and so if the spots disappear (as they do), the system has enough information to resume from where things were last checkpointed when resources become available again.
Of course, some development practices, such as ensuring that your loss function works in a basic sense, are covered in many places. But I'd love to see more in-depth coverage of architecture development & development best practices. Does anyone know of any particularly good resources / discussions there?
It'd be cool if there was the ability to train a phrase locally on your own premises and then use that to begin the real transcription.
This probably wouldn't be super difficult to build, but was wondering if it was available (didn't see anything at a glance)
There are some open source libraries that make this relatively easy:
- https://github.com/Kitt-AI/snowboy (looks to be shutdown now) - https://github.com/cmusphinx/pocketsphinx
This avoids having to stream audio 24x7 to a cloud model which would be super expensive. This being said, I'm pretty sure what the Alexa does, for example, is send any positive wake word to a cloud model (that is bigger and more accurate) to verify the prediction of the local wake word detection model AFAIK.
Once you are positive you have a positive wake word detected - that's when you start streaming to an accurate cloud based transcription model like Assembly to minimize costs!
Here's an example repo that might be interesting (from initial impressions, though there are many more out there) : https://github.com/vineeths96/Spoken-Keyword-Spotting
uMusic patent: https://patents.google.com/patent/CN1637743A/en
Further reading: http://products.bose.com/pdf/customer_service/owners/uMusic_...
The best do it yourself instructions are in a book called Tiny ML.
Compared to super deep transformers, you'll find that deployed WW detectors are as simple as SVMs or 2 layer NNs.
Salary costs are probably even higher than compute costs. Automatic Speech Recognition is an industrial scale application, it costs a lot to train, but so do many other projects in different fields. How expensive is a plane or a ship? How much can a single building cost? A rocket launch?
I mean, that's a pretty good principled approach to a lot of ML problems.
Unless you're Google, who even trains models from scratch these days, at most you do some fine-tuning
https://news.ycombinator.com/item?id=26251322
My email is in my profile if you want to reach out to chat more!
Another advantage is that you can do more custom things - add words to vocabulary, detect speakers with biometric features, detect emotions.
For example if you take the WER of "I live in New York" and "i live in new york" the WER would be 60% because you're comparing a capitalized version vs an uncapitalized version.
This is why public WER results vary so widely.
We publish our own WER results and normalize the human and automatic transcription text as much as possible to get as close to "true" numbers as possible. But in reality, we see a lot of people comparing ASR services simply by doing diffs of transcripts.
First, at my company Milk Video, we are huge fans of Assembly AI. The quality, speed and cost of their transcription is galaxies beyond the competition.
Having worked in machine learning focused companies for a few years, I have been researching this exact question. I'm curious how I can better forecast the amount of ML talent I should expect to build into our team (we are a seed stage company), and how much I can confidently outsource to best-in-class.
A lot of the ML services we use now are utilities that we don't want to manage (speech-to-text, video content processing, etc), and also want to see improve. We took a lot of time to decide who we outsource these things to, like working with AssemblyAI, because we were very conscious of the pace of improvement in speech-to-text quality.
When we were comparing products, the most important questions were:
1. How accurate is the speech-to-text API
1.a Word error rate
1.b Time attributed to start/end word
2. How fast does it process our content
3. How much does it cost
AssemblyAI was the only tool that used modern web patterns (ie. not Googles horrible API or other non-tech based companies trying to provide transcript services) that made it easy to integrate with in a short Sunday morning. The API is also surprisingly better than other speech-to-text services, because its trained for the kind of audio/video content being produced today (instead of old call center data, or perfect audio from studio-grade media).
Google's api forced you to manage your asset hosting in GCP, handle tons of unnecessary configuration around auth/file access/identity, and its insanely slow/inaccurate. Some other transcription services we used were embarrassingly horrible from a developer experience perspective, in that they also required you to actually talk to a person before giving you access.
The reason Assembly is so great is that you can literally make an API request with a media file url (video or audio), and boom, you get a nice intuitive JSON formatted transcript response. You can also add params to get speakers, get topic analysis, personal information detection, and it's just a matter of changing the payload in the first API request.
I'm very passionate about this because I spent so much time fighting previously implemented transcript services, and want to help anyone avoid the pain because Assembly really does it correctly.
Also, most of the models there are undertrained.
How much does it ultimately cost to train a model at this size, and is it feasible to do without VS funding (and cloud credits)?
In general - this is expensive stuff. Training big, accurate models just requires a lot of compute, and there is a "barrier to entry" wrt costs, even if you're able to get those costs down. I think it's similar to startups not really being able to get into the aerospace industry unless they raise lots of funding (ie, Boom Supersonic).
Practically speaking though, for startups without funding, or access to cloud credits, my advice would be to just train the best model you can, with the compute resources you have available. Try to close your first customer with an "MVP" model. Even if your model is not good enough for most customers - you can close one, get some incremental revenue, and keep iterating.
When we first started (2017), I trained models that were ~1/10 the size of our current models on a few K80s in AWS. These models were much worse compared to our models today, but they helped us make incremental progress to get to where we are now.
https://pytorch.org/docs/stable/notes/cuda.html#tensorfloat-...
This is mostly important because these settings can significantly affect the price/perf evaluation for your specific model & the available hardware.
I wonder what is the state of art for horizontal scaling here ...preferably on kubernetes.
Pytorch is tricky to integrate (using TorchElastic). You could use Dask or Ray Distributed. Tensorflow has its own mechanism that doesnt play nice with Kubernetes.
How are others doing it ?
1. Purchase GPU machines from Lambda Labs. I went with machines with 256 GB of CPU RAM, 24-core AMD Threadrippers, 2 NVIDIA RTX 3090s, and 10gbps Ethernet. You might want to choose even more expensive GPUs.
2. Make sure your electrical circuits have sufficient capacity to run your GPU machines at peak power consumption. I gave each machine its own US residential electrical circuit. If you are storing your GPU servers in a data center, look into whether they can get you enough electrical power for Lambda Labs's 8-GPU machines. When talking with a data center's sales team, make sure they understand how much electrical power you need. They might charge you a lot of money if you ask for much more electrical power than they usually install in a cabinet. Try to negotiate with multiple data centers to see you can give you the best offer.
3. Purchase storage machines from 45Drives. I recommend buying their 30-drive machines and setting up a ZFS pool of 10 3-drive mirrors. Do not bother with raidz because your read and write speeds will be too slow, bottlenecking your ETL and training jobs.
4. Serve files from your storage machines to your GPU machines using NFS. I like to use MergerFS to merge mounts from different NFS servers. Alternatively, you might want to use Ceph, Min.io, or Lustre.
5. Buy Intel NUCs to run miscellaneous services--like monitoring--that you wouldn't want to colocate with your storage or GPU machines. They are small, cheap, and don't require a lot of electrical power. I bought a couple of NUCs with 64 GB of RAM and a 1 TB NVMe SSD each. Then I purchased external 10gbps Ethernet cards to plug into each NUC's 40gbps Thunderbolt 3 port.
6. Buy 10gbps network switches. MikroTik has affordable 4-port, 8-port, and 16-port 10gbps switches. These are SFP+ (optical) switches, so you may need to buy copper adapters. I really like MikroTik's balance of quality and affordability, so I also buy network routers and other equipment from MikroTik.
7. If possible, try to train models small enough that each model only needs one machine to train. For this reason, maybe you will want to buy one 10-GPU machine instead of 5 2-GPU machines. There are Amdahl's Law-style coordination costs to using multiple machines to train the same model. When I do large hyperparameter searches over many candidate models, I minimize these coordination costs and maximize throughput by limiting each model to only one machine. Of course, this is impossible if you are like AppliedAI and need 48 V100s to train a model.
8. If you do need to train a single model using multiple machines, I've heard good things about Horovod, but I'm also excited about Ray.io--which offers user-friendly distributed training wrappers around TensorFlow MultiWorkerMirroredStrategy, PyTorch's DistributedDataParallel, or Horovod (which itself can train TensorFlow, PyTorch, or MXNet).
I'm sortof joking, but also not. Specifically Bayesian inference takes forever, and there's no really good way to speed it up (GPU's don't work as well, because the sampling is sequential).
Saved you a click.