undefined | Better HN

0 pointsbagels2y ago0 comments

You have to choose the number of clusters, before using k-means.

Imagine that you have a dataset, where you think there are likely meaningful clusters, but you don't know how many, especially where it's many-dimensioned.

If you pick a k that is too small, you lump unrelated points together.

If k is too large, your meaningful clusters will be fragmented/overfitted.

There are some algorithms that try to estimate the number of clusters or try to find the k with the best fit to the data to make up for this.

0 comments

keenmaster2y ago

Couldn’t you make some educated guesses and then stop when you arrive at a K that gives you meaningful clusters that are neither too high level nor too atomized.

yarg2y ago

Probably not the best in terms of efficiency.

Easier just to deliberately overshoot (with a too high k) and then merge any clusters with too much overlap.

j / k navigate · click thread line to collapse