Supposedly there's thousands of different features that are scored, and those are just the rolled-up categories that needed their own separate ML pipeline step.
Like, maybe, for example, a feature is "this site has a favicon.ico that is unique and not used elsewhere" (page quality). Or "this page has ads, but they are below the fold" (page layout). Or "this site has > X amount of inbound links from a hand curated list of 'legitimate branded sites'" (page/site authority).
Google then picks a starting weight for all these things, and has human reviewers score the quality of the results, order of ranking, etc, based on a Google written how-to-score document. Then tweaks the weights, re-runs the ML pipeline, and has the humans score again, in some iterative loop until they seem good.
There's a never-acted-on FTC report[1] that describes how they used this system to rank their competition (comparison shopping sites) lower in the search results.
[1] http://graphics.wsj.com/google-ftc-report/
Edit: Note that a lot of detail is missing here. Like topic relevance, where a site may rank well for some niche category it specializes in. But that it wouldn't necessarily rank well for a completely different topic, even with good content, since it has no established signals it should.