"When these tones arrive together at the microphone’s power amplifier, they are amplified as expected, but also multiplied due to fundamental non-linearities in the system"
"In practice, however, acoustic amplifiers maintain strong linearity only in the audible frequency range; outside this range, the response exhibits non-linearity."
That suggests to me that the nonlinear mixing isn't occurring in the MEMS structure, but rather the amplification stage. Perhaps the authors' language is imprecise?
They do say immediately after the last bit:
"The diaphragm also exhibits similar behavior [non-linearity]."
Is just the diaphragm's nonlinearity sufficient for the effect?