If you compare [ta] and [da], you find that the only difference is the time between when you make the consonant, and when your vocal chords start vibrating (voice onset time). In theory, VOT is a contimum, with any value being possible. However, in English it forms a tri-modal distribution /tʰ/ /t/ and /d/. The experiment artificially edited a sound to vary between /t/ and /d/, including with VOTs between the two that do not occur in English. What they found is that people put all of the sounds in 2 boxes, and were unable to distinguish between sounds in the same box, even if their VOT varied considerably.
However, when test subjects were played the same sounds, but told they were listening to rain drops, this effect disapeared, and they were able to distinguish between sounds in the same box.