See also, psychoacoustics. The ear doesn't
just do frequency decomposition. It's not clear if it even does frequency decomposition. What actually happens is lot of perceptual modelling and relative amplitude masking which makes it possible to do real-time source separation.
Which is why we can hear individual instruments in a mix.
And this ability to separate sources can be trained. Just as pitch perception can be trained, with varying results from increased acuity up to full perfect pitch.
A component near the bottom of all that is range-based perception of consonance and dissonance, based on the relationships between beat frequencies and fundamentals.
Instead of a vanilla Fourier transform, frequencies are divided into multiple critical bands (q.v.) with different properties and effects.
What's interesting is that the critical bands seem to be dynamic, so they can be tuned to some extent depending on what's being heard.
Most audio theory has a vanilla EE take on all of this, with concepts like SNR, dynamic range, and frequency resolution.
But the experience of audio is hugely more complex. The brain-ear system is an intelligent system which actively classifies, models, and predicts sounds, speech, and music as they're being heard, at various perceptual levels, all in real time.