undefined | Better HN

0 pointstimClicks7mo ago0 comments

It's been a while since I've played in the area, but is PCA still the go to method for dimensionality reduction?

0 comments

wenc7mo ago

PCA (essentially SVD) the one that makes the fewest assumptions. It still works really well if your data is (locally) linear and more or less Gaussian. PLS is the regression version of PCA.

There are also nonlinear techniques. I’ve used UMAP and it’s excellent (particularly if your data approximately lies on a manifold).

https://umap-learn.readthedocs.io/en/latest/

The most general purpose deep learning dimensionality reduction technique is of course the autoencoder (easy to code in PyTorch). Unlike the above, it makes very few assumptions, but this also means you need a ton more data to train it.

ChadNauseam7mo ago

> PCA (essentially SVD) the one that makes the fewest assumptions

Do you mean it makes the *strongest* assumptions? "your data is (locally) linear and more or less Gaussian" seems like a fairly strong assumption. Sorry for the newb question as I'm not very familiar with this space.

wenc7mo ago

You’re correct in a mathematical sense: linearity and Gaussian are restrictive assumptions.

However I meant it colloquially in that those assumptions are trivially satisfied by many generating processes in the physical and engineering world, and there aren’t a whole lot of other requirements that need to be met.

Lerc7mo ago

From 2019 but a decent overview by Leland McInnes https://www.youtube.com/watch?v=9iol3Lk6kyU

There's a newer thing called PacMap which is an interesting thing that handles difference cases better. Not as robustly tested as UMAP but that could be said of any new thing. I'm a little wary that it might be overfitted to common test cases. To my mind it feels like PacMap seems like a partial solution of a better way of doing it.

The three stage process of PacMap is either asking to be developed into either a continuous system or finding a analytical reason/way to conduct a phase change.

wenc7mo ago

Leland McInnes is amazing. He's also the author of UMAP.

baq7mo ago

PCA is nice if you know relationships are linear. You also want to be aware of TSNE and UMAP.

wenc7mo ago

A lot of relationships are (locally) linear so this isn’t as restrictive as it might seem. Many real-life productionized applications are based on it. Like linear regression, it has its place.

T-SNE is good for visualization and for seeing class separation, but in my experience, I haven’t found it to work for me for dimensionality reduction per se (maybe I’m missing something). For me, it’s more of a visualization tool.

On that note, there’s a new algorithm that improves on T-SNE called PaCMAP which preserves local and global structures better. https://github.com/YingfanWang/PaCMAP

a_bonobo7mo ago

There's also Bonsai, it's parameter-free and supposedly 'better' than t-SNE, but it's clearly aimed at visualisation purposes (except that in Bonsai trees, distances between nodes are 'real' which is usually not the case in t-SNE)

https://www.biorxiv.org/content/10.1101/2025.05.08.652944v1....

energy1237mo ago

I'd add that PCA/OLS is linear in the functional form (linear combination), but the input variables can be non-linear (X_new := X_{old,1}*X_{old,2}^2), so if the non-linearities are simple, then basic feature engineering to strip out the non-linearities before fitting PCA/OLS may be acceptable.

j / k navigate · click thread line to collapse

0 comments

wenc7mo ago

PCA (essentially SVD) the one that makes the fewest assumptions. It still works really well if your data is (locally) linear and more or less Gaussian. PLS is the regression version of PCA.

There are also nonlinear techniques. I’ve used UMAP and it’s excellent (particularly if your data approximately lies on a manifold).

https://umap-learn.readthedocs.io/en/latest/

ChadNauseam7mo ago

> PCA (essentially SVD) the one that makes the fewest assumptions

wenc7mo ago

You’re correct in a mathematical sense: linearity and Gaussian are restrictive assumptions.

Lerc7mo ago

From 2019 but a decent overview by Leland McInnes https://www.youtube.com/watch?v=9iol3Lk6kyU

The three stage process of PacMap is either asking to be developed into either a continuous system or finding a analytical reason/way to conduct a phase change.

wenc7mo ago

Leland McInnes is amazing. He's also the author of UMAP.

baq7mo ago

PCA is nice if you know relationships are linear. You also want to be aware of TSNE and UMAP.

wenc7mo ago

A lot of relationships are (locally) linear so this isn’t as restrictive as it might seem. Many real-life productionized applications are based on it. Like linear regression, it has its place.

On that note, there’s a new algorithm that improves on T-SNE called PaCMAP which preserves local and global structures better. https://github.com/YingfanWang/PaCMAP

a_bonobo7mo ago

https://www.biorxiv.org/content/10.1101/2025.05.08.652944v1....

energy1237mo ago

j / k navigate · click thread line to collapse