It's just an easier form of speech recognition. That was plenty advanced in 2002.
(Here is Shazam in a chip from 1988:
https://www.youtube.com/watch?v=kFth9K_IvwA
Now imagine you have a magnitude better signal fidelity and 10e8 times the storage and processing power)