Microsoft's curation techniques for the Phi models remain proprietary. So we can't really criticize or praise their methods, because we don't know what they are. It might be GPT-4. It might be Artificial Artificial Intelligence (a warehouse in Pakistan). But the results speak for themselves.
The models are a bit janky in my testing (especially prone to leaking test materials, and highly specialized on a narrow domain), but fantastic for their size.
Intentional "under-generalization" seems like a fairly self-evident approach to making optimal (and economical, on the training side) use of smaller models.
As for whether it works for a general purpose model, my intuition says that it does (i.e. cutting off the "long tail of knowledge" in favour of a better handling of the mainstream, by the limited neurons available).
As for whether that tech exists, I reckon a simple tf-idf would get you 80% of those wins, but that might be ignorance/arrogance on my part.