Applying machine learning and deep learning methods to audio analysis (opens in new tab)

(comet.ml)

93 pointsgidim6y ago20 comments

20 comments

As an introduction introduction I guess this is OK. However there are two major limitations:

1: The feature extraction ends with mean-summarizing across the entire audio clip - leaving no temporal information. This only works well for simple tasks. At least mentioning something about analysis windows and temporal modelling would be good, as the natural next step. Be it LSTM/GRU on the MFCC, or CNN on mel-spectrogram.

2: The folds of the Urbansound8k dataset are not respected in the evaluation. In Urbansound8k different folds contains clips extracted from the same original audio files, usually very close in time. So mixing the folds for the testset means it is no longer entirely "unseen data". The model very likely exploits this data leakage, as the reported accuracy is above SOTA (for no data-augmentation) - unreasonable given the low fidelity feature representation. At least mentioning this limitation and that the performance number they give cannot be compared with other methods, would be prudent.

When I commented similarly on r/machinelearning the authors acknowledged these weaknesses, but did not update the article to reflect it.

gidimOP6y ago

we're working on another version fixing the folds issue on Urbansound8k and will update the article asap.

jononor6y ago

Nice!

1 more reply

jononor6y ago

Warning: shameless-self-promotion. For those that wish to go a bit beyond this article, I gave a presentation on the topic at EuroPython. https://www.youtube.com/watch?v=uCGROOUO_wY It explains how to build models that can make use of temporal variations and learn the feature representations based on the (Mel) spectrogram. Especially suited if you are already familiar with image-classification using Convolutional Neural Networks.

m0zg6y ago

As one of the long-suffering Comet.ml customers, I wish they'd spend more time working on their site's performance and less on writing blog posts. It takes multiple seconds for graphs to render, and leaving any part of Comet.ml UI open in the browser leads to spinning fans and quick battery drain when working from a laptop. The logging component will sometimes hang without a warning and hang your training session as well. Bizarrely, there's no way to show min/max metric values for ongoing and completed runs (AKA the only thing a researcher actually cares about): you have to log them separately in order to display them.

This is a weird field: these are not difficult problems to solve, yet as far as I can tell, all of the popular choices available so far each suck in their own unique way and there's no option that I know of that actually offers convenience and high performance. FOSS options are barely existent, as well, and they also suck.

For the things where Comet.ml would be too onerous to deal with, I still use pen and paper.

gidimOP6y ago

Hi M0zg! Gideon from Comet- sorry to hear you're having issues. Did you every try to report these? if you share more info at support@comet.ml or at our slack channel i'm sure we can fix it / improve. On a general note: 1. you can see min/max values in the metrics tab for finished/running experiments. 2. we spend tons of time on performance but these are actually difficult problems to solve, i.e if you have ten charts all showing 10k data points all updating in real time. That said if you share your project we can use it to improve. Finally the SDK is designed to never crash or slow down your training and this is the first time we've heard that complaint - again please ping us so we can figure out what's going on.

alon76y ago

We're actually very happy with Comet and have been using it on v large projects (>50 researchers, 10k models). You can reduce the refresh interval and the amount of data points reported if things feel slow

m0zg6y ago

I don't log that many points as it is: about 4K data points per run in total (windowed average loss and LR every 25-30 batches, eval metrics every epoch), for all metrics combined. I also log the same data to TensorBoard, which renders everything pretty much instantaneously with no issues at all, even though I tell it to not downsample beyond 5K samples per graph.

1 more reply

bentoboox6y ago

When we did our evaluation comet was far superior to the alternatives and we’re not seeing any of the issues you reported. For better performance make sure you don’t log every step but rather every epoch.

abrichr6y ago

I'd love to learn more about your use case. What kind of models are you training? What are you using Comet.ml for?

Thanks!

syntaxing6y ago

Is there a method to detect a specific word and tell me the timestamp throughout an audio sample easily? I've been trying to implement something like this but wasn't sure how to approach it.

yorwba6y ago

If you already have the transcript without timestamps (e.g. for an audiobook where you know the source text), you could use https://github.com/readbeyond/aeneas , which infers the timestamps by aligning text-to-speech output with the audio using dynamic time warping.

If you don't have the transcript, you'd use a transcription service that also gives you timestamps. E.g. there was a frontpage submission yesterday where someone used AWS Transcription to count the number of words in each minute of a talk: https://news.ycombinator.com/item?id=21635939

jononor6y ago

If you can utilize a cloud API, speech transcription route is likely the simplest. Recognizing spoken words is challenging and data-demanding when it can be spoken by many different speakers.

But if you want to do this on the audio you chop up your audio stream into fixed-length (in time) analysis windows. These length of the window should be a bit longer than the sound of interest (the word). Overlap is normally used for the windows. Say with 90% overlap the next window is created by moving forward by 10%. This gives the model multiple "shots" at detecting the word as it passes by. This is suitable for spotting a word and giving the time within something like 50ms resolution.

For each analysis window you apply feature pre-processing and a model such as the one shown in the article.

This task sounds like what is called Keyword Spotting in academic literature. Which can be seen as as specific version of Audio Event Detection, applied to spoken words.

bootloop6y ago

Given you have a transcription methode (but without the t timestamp output) the simplest methode might be to do a moving window throughout the sample and try to see where the detection results in a positiv.

j / k navigate · click thread line to collapse

20 comments

jononor6y ago

As an introduction introduction I guess this is OK. However there are two major limitations:

When I commented similarly on r/machinelearning the authors acknowledged these weaknesses, but did not update the article to reflect it.

gidimOP6y ago

we're working on another version fixing the folds issue on Urbansound8k and will update the article asap.

jononor6y ago

Nice!

1 more reply

jononor6y ago

m0zg6y ago

For the things where Comet.ml would be too onerous to deal with, I still use pen and paper.

gidimOP6y ago

alon76y ago

m0zg6y ago

1 more reply

bentoboox6y ago

abrichr6y ago

I'd love to learn more about your use case. What kind of models are you training? What are you using Comet.ml for?

Thanks!

syntaxing6y ago

Is there a method to detect a specific word and tell me the timestamp throughout an audio sample easily? I've been trying to implement something like this but wasn't sure how to approach it.

yorwba6y ago

jononor6y ago

If you can utilize a cloud API, speech transcription route is likely the simplest. Recognizing spoken words is challenging and data-demanding when it can be spoken by many different speakers.

For each analysis window you apply feature pre-processing and a model such as the one shown in the article.

This task sounds like what is called Keyword Spotting in academic literature. Which can be seen as as specific version of Audio Event Detection, applied to spoken words.

bootloop6y ago

j / k navigate · click thread line to collapse