- Embed all the text documents.
- Project to 2D using UMAP which also creates its own emergent "clusters".
- Use k-means clustering with a high cluster count depending on dataset size.
- Feed the ChatGPT API ~10 examples from each cluster and ask it to provide a concise label for the cluster.
- Bonus: Use DBSCAN to identify arbitrary subclusters within each cluster.
It is extremely effective and I have a theoetical implementation of a more practical use case to use said UMAP dimensionality reduction for better inference. There is evidence that current popular text embedding models (e.g. OpenAI ada, which outputs 1536D embeddings) are way too big for most use cases and could be giving poorly specified results for embedding similarity as a result, in addition to higher costs for the entire pipeline.
I see some significant speedups can be achieved when discretising dimensions into buckets, and doing a simple frequency count of associated buckets -- leaving only highly related buckets per document. These 'signatures' can then be indexed LSH style and a graph construed from documents with similar hashes.
When the input set is sufficiently large, this graph contains 'natural' clusters, without any UMAP or k-means parameter tuning required. When implemented in BQ, I achieve sub minute performance for 5-10 million documents, from indexing to clustering.
For a given k:
for n=30 or 100 or 300 trials:
subsample 80% of the points
cluster them
compute Fowlkes-Mallow score (available in sklearn) of the subset to the original, restricting only to the instances in the subset (otherwise you can't compute it)
output the average f-m score
This essentially measure how "stable" the clusters are. The Fowlkes-Mallow score decreases when instances pop over to other clusters in the subset.If you do this and plot the average score versus k, you'll see a sharp dropoff at some point. That's the maximal plausible k.
edit: Here's code
def stability(Z, k):
kmeans = KMeans(n_clusters=k, n_init="auto")
kmeans.fit(Z)
scores = []
for i in range(100):
# Randomly select 80% of the data, with replacement
# TODO: without
idx = np.random.choice(Z.shape[0], int(Z.shape[0]*0.8))
kmeans2 = KMeans(n_clusters=k, n_init="auto")
kmeans2.fit(Z[idx])
# Compare the two clusterings
score = fowlkes_mallows_score(kmeans.labels_[idx], kmeans2.labels_)
scores.append(score)
scores = np.array(scores)
return np.mean(scores), np.std(scores)https://en.m.wikipedia.org/wiki/Silhouette_(clustering)
Another elegant method is the Calinsky-Harabasz Index
https://en.m.wikipedia.org/wiki/Calinski%E2%80%93Harabasz_in...
These aren't visualized: I use identified clusters to look at manually to find trends.
With the pipeline mentioned, it's much easier to look at cluster density to identify patterns and high-level trends.
https://scikit-learn.org/stable/modules/generated/sklearn.cl...
I used the bge-large-en-v1.5 model (https://huggingface.co/BAAI/bge-large-en-v1.5) because I could, but the common all-MiniLM-L6-v2 model is sufficient. The trick is to batch generate the embeddings on a GPU, which SentenceTransformers mostly does by default.
Other libraries are the typical ones (umap for UMAP, scikit-learn for k-means/DBSCAN, chatgpt-python for ChatGPT interfacing, plotly for viz, pandas for some ETL). You don't need to use a bespoke AI/ML package for these workflows and they aren't too complicated.
Hence why I think the typical embedding dimensionality is way way too high.
Please let me know if you fork this library and update it to the latter versions of Spark.
1. https://www.youtube.com/watch?v=kH-hqG34ylA&t=4788s&ab_chann...
2. https://www.youtube.com/watch?v=K7hWqxC_7Mw&ab_channel=Tsodi...
Essentially K-Means is a way of “learning” categories or other kinds of groupings within an unlabeled dataset, without any fancy deep learning. It’s handy for its simplicity and speed.
The demo works with simple 2D coordinates for illustrative purposes but the technique works with any number of dimensions.
Note that there may be some things I got wrong with the implementation and that there are other variations of the algorithm surely, but it still captures the basic idea well enough for an intro.
K-means is not that complicated and naive implementation with e.g. Euclidean distance is a couple of dozens of lines of code, so should be practical enough for an interview.
Participation At Scale Can Repair The Public Square https://www.noemamag.com/participation-at-scale-can-repair-t...
Polis: Scaling deliberation by mapping high dimensional opinion spaces https://scholar.google.com/scholar?q=Polis:+Scaling+delibera...
Restricting clustering to 2-5 groups impacts group aware/informed consensus and comment routing https://github.com/compdemocracy/polis/issues/1289
That said, something like hdbscan doesn’t suffer from this problem.
(n.b. Celebi's, note usage of Lab / notHSL, respect Cartesian / polar nature of inputs / outputs, and ask for high-K, 128 is what we went with but it's arbitrary. Can get away with as few as 32 if you're ex. Doing brand color from favicon)
For each pixel instead of a single color value it generated k mean color values, using an online algorithm. These were then combined to produce the final pixel color.
The idea was that a pixel might have several distinct contributions (ie from different light sources for example), but due to the random sampling used in path tracing the variance of sample values is usually large.
The value k then was chosen based on scene complexity. There was also a memory trade-off of course, as memory usage was linear in k.
What does online mean here?
If you want accuracy at an order of magnitude more compute, you can use something like DBSCAN.
Built as part of a larger carpet based localisation project [2]
1: https://nbviewer.org/github/tim-fan/carpet_color_classificat...
2: https://github.com/tim-fan/carpet_localisation/wiki/Carpet-L...
So my question is... could you elaborate?
Assume you collect some kindergartners and top NBA players into a room and collect their heights. Now say you pass these to two hapless grad students and ask them to perform K-means clustering.
Suppose one of the grad students knew the composition of the people you measured and can guess these height should clump into 2 nice clusters. The other student who doesn't know the composition of the class - what should they guess K to be?
I understood the GP's comment to refer to the state of the second grad student. How useful is K-means clustering without knowing K in advance?
k-means assumes these hard parts of the problem are taken care of and offers a trivial solution to the rest. Thanks for the help, I'll cluster it myself.
Imagine that you have a dataset, where you think there are likely meaningful clusters, but you don't know how many, especially where it's many-dimensioned.
If you pick a k that is too small, you lump unrelated points together.
If k is too large, your meaningful clusters will be fragmented/overfitted.
There are some algorithms that try to estimate the number of clusters or try to find the k with the best fit to the data to make up for this.
It’s honestly fine for just finding key differences like a principal component for light storytelling. They don’t need to be distinct clusters
https://alliance.seas.upenn.edu/~cis520/dynamic/2022/wiki/in...
The other is that in practice, you typically want to bring your true optimization objective as close as possible to what the algorithm is optimizing, and what k-means is optimizing for is usually pretty far removed. Even small tweaks (lets say, augmenting data with some sparse labels, or modifying the loss function weight based on some aspect of embedding values) are difficult to do with k-means.
IMO it is very good as a final clustering algorithm once you've already applied some more complex transformations on your data to linearise it and account for deeper knowledge of the problem. This might be a spectral space transformation (you care about connectedness), or an embedding (you care about whatever the network was trained on) or descriptor (you care about the algorithm's similarity).
But once you've applied the transform, you then have a minimalist fast scalable clustering that just does clustering and doesn't need to know anything more about the problem being solved. Very unix-y feeling.