1: The feature extraction ends with mean-summarizing across the entire audio clip - leaving no temporal information. This only works well for simple tasks. At least mentioning something about analysis windows and temporal modelling would be good, as the natural next step. Be it LSTM/GRU on the MFCC, or CNN on mel-spectrogram.
2: The folds of the Urbansound8k dataset are not respected in the evaluation. In Urbansound8k different folds contains clips extracted from the same original audio files, usually very close in time. So mixing the folds for the testset means it is no longer entirely "unseen data". The model very likely exploits this data leakage, as the reported accuracy is above SOTA (for no data-augmentation) - unreasonable given the low fidelity feature representation. At least mentioning this limitation and that the performance number they give cannot be compared with other methods, would be prudent.
When I commented similarly on r/machinelearning the authors acknowledged these weaknesses, but did not update the article to reflect it.
This is a weird field: these are not difficult problems to solve, yet as far as I can tell, all of the popular choices available so far each suck in their own unique way and there's no option that I know of that actually offers convenience and high performance. FOSS options are barely existent, as well, and they also suck.
For the things where Comet.ml would be too onerous to deal with, I still use pen and paper.
Thanks!
If you don't have the transcript, you'd use a transcription service that also gives you timestamps. E.g. there was a frontpage submission yesterday where someone used AWS Transcription to count the number of words in each minute of a talk: https://news.ycombinator.com/item?id=21635939
But if you want to do this on the audio you chop up your audio stream into fixed-length (in time) analysis windows. These length of the window should be a bit longer than the sound of interest (the word). Overlap is normally used for the windows. Say with 90% overlap the next window is created by moving forward by 10%. This gives the model multiple "shots" at detecting the word as it passes by. This is suitable for spotting a word and giving the time within something like 50ms resolution.
For each analysis window you apply feature pre-processing and a model such as the one shown in the article.
This task sounds like what is called Keyword Spotting in academic literature. Which can be seen as as specific version of Audio Event Detection, applied to spoken words.