...but I don't have the slightest clue what they mean, and I've certainly dabbled in FFT and spectrogram and wavelet work, on top of a lot IPA vowel work, but I'm missing the why behind the formulas given and I'm missing how these plots are supposed to relate to frequencies visually.
A spectrogram of someone pronouncing vowels is extremely straightforward. Recognizing patterns of formants in spectrograms is quite simple.
So what is this trying to reveal that spectrograms don't? Besides that, what are the axes? Why are these circular or presumably polar? Why are they spiky? Why the particular blue/red bandpass filter? And what does autocorrelation have to do with vowels?
I'm not sure I've ever found myself so mystified by something I feel like I should have the background to understand quite easily.
If they're just supposed to be works of art then that's cool. But the title "visual morphology of vowels" seems like the plots are intended to reveal some kind of link between frequencies and the shape of the mouth maybe? But the example images aren't even labeled by which vowel they represent so I'm just baffled.
On these ACF images, consonant frequencies produce regular patterns, that appear good due to their regular structure. High and low frequencies map to different colors, that appear to arrange themselves in a certain good looking way - this effect is surprising to me. The interesting observation here is that the good looking arrangements happen only for pleasing sounds. Different vowels, 29 total, taken from the Wikipedia's IPA table, produce different and distinct shapes - that's what I meant by "visual morphology".
The ACF data can be presented in any form, it's just data after all, but I'm not interested in just information, I want the image to convey the "harmonic nature" of sound, and the polar coordinates happen to do this well.
There is a link to demo there, and you can generate ACF images for any sounds you have, just make sure they are isolated 1-2 sec recordings. After looking at the images and listening to sounds that correspond to them, you'll quickly notice some pattern and will be able to guess the sound by looking at its image.
But they do! It’s entirely possible for even inexperienced phoneticians to reconstruct speech given only a spectrogram — and it isn’t even that hard to do so. I cannot make any firm statements about these ACF images, but given that they present no temporal information, I find it difficult to imagine this being possible with them.
And as for ‘conveying the nature of sound’, I invite you to consider e.g. [0] or [1]. It’s easy to see on the spectrogram that some sounds are noisy, some are resonant, some are strong, some are weak, and so on.
[0] https://home.cc.umanitoba.ca/~krussll/phonetics/acoustic/spe...
(They really are very pretty, though!)
How to Break Pink Trombone:
https://www.youtube.com/watch?v=djUxAqss4KY
Pink Trombone takes on "Take On Me"
It slightly improves the way ACF images are presented, but this small improvement makes a big difference. It works best on "small sounds" that last 1-2 sec, such as vowels or sample recordings of flute, violin and so on. The sound is analysed with FFT with the sliding window of 1/4 sec that advances by 1/500 sec at a time until it covers the entire waveform. After computing FFT spectrum for each frame, a basic bandpass filter is applied to separate high and low frequencies. The result is fed to the inverse FFT, thus computing ACF, and presented in polar coordinates using a basic red-blue color scheme. The effect is that low frequencies appear red and high frequencies appear blue.
To my surprise, this basic method reveals a large variety of distinctive, yet visually appealing, shapes for vowel sounds.
Excellent work though.