TabPFN: Transformer Solves Small Tabular Classification in a Second (opens in new tab)

(automl.org)

60 pointsjupiterelastica3y ago13 comments

13 comments

It's exciting to see a novel approach to applying NNets to Tabular ML. Definitely need to call out the 1000 row limitation. It will be interesting to see if this approach stands the test of time. Other algorithms (SAINT cough, cough) made big claims, but AFAIK no one actually uses them. It's still a "XGBoost is all you need" world in tabular ML (unless you've discovered AutoGluon).

jilijeanlouis3y ago

On the other hand xgboost tends to overfit on small dataset so tabpfn is a great complement

westurner3y ago

From https://twitter.com/FrankRHutter/status/1583410845307977733 :

> This may revolutionize data science: we introduce TabPFN, a new tabular data classification method that takes 1 second & yields SOTA performance (better than hyperparameter-optimized gradient boosting in 1h). Current limits: up to 1k data points, 100 features, 10 classes. 1/6

[Faster and more accurate than gradient boosting for tabular data: Catboost, LightGBM, XGBoost]

ersiees3y ago

Hey there! I am one of the authors on this paper. If there are any questions, I am happy to answer :)

jupiterelasticaOP3y ago

You are very clear about the current limitations on data size, which I find refreshingly honest! How sensible do you find the idea to fine tune the model to a specific problem that has more than 1000 observations, by resampling the data (similar to bootstrapping) and retraining on the subsamples? As I understand it, one could fine tune the algorithm that TabPFN learned to the specific problem.

Many thanks also for open-sourcing your work and making the colab notebook, I've been playing around with that a bit.

Edit: spelling

ersiees3y ago

We did try it a bit a while back, but did not have conclusive results. I expect you can bend it to perform better for larger datasets, too, but how exactly I cannot say for sure. The bootstrapping is definitely a good candidate for this.

snthpy3y ago

Thanks. Looks very interesting!

My main observation just looking at your example pictures is that its closest competitor is Gaussian Processes which I've long been a fan of.

Just looking at those pictures it looks like GP and TabPFN are very similar where there is data but TabPFN is more happy to extrapolate while GP is localised around the data (look at the top row for example).

I can't decide whether that's a feature or a bug. I guess it's good to have a choice whether you want to show that you're uncertain in regions where you've never seen data before or be able to extrapolate on what you have seen.

ersiees3y ago

Yeah, we also find this interesting. Arguably, though, GP is far worse in all metrics. Thus, it is not really our closest competition.

snthpy3y ago

I'd be interested to hear what you see as your closest competition?

janee3y ago

As a lazy non-ML simpleton, is there a simple explanation for it's usage?

Would tabular classification usually refer to say, extraction of tabular data in a picture to text?

I tried googling and looking through the site but it wasn't obvious to me what this actually does.

learndeeply3y ago

No, it's meant for taking something like a CSV file and deciding if each row matches a specific category. A common example, have a CSV, columns corresponding to different features of flowers (e.g. number of petals, size of petals) and the output is the type of flower.

jerpint3y ago

No it’s really just tabular csv data, like a typical spreadsheet would hold. These datasets are rarely ever outperformed by deep learning compared to standard ML.

edmundsauto3y ago

Doesn’t it being a NN automate some of the feature engineering?

j / k navigate · click thread line to collapse

13 comments

newfocogi3y ago

jilijeanlouis3y ago

On the other hand xgboost tends to overfit on small dataset so tabpfn is a great complement

westurner3y ago

From https://twitter.com/FrankRHutter/status/1583410845307977733 :

[Faster and more accurate than gradient boosting for tabular data: Catboost, LightGBM, XGBoost]

ersiees3y ago

Hey there! I am one of the authors on this paper. If there are any questions, I am happy to answer :)

jupiterelasticaOP3y ago

Many thanks also for open-sourcing your work and making the colab notebook, I've been playing around with that a bit.

Edit: spelling

ersiees3y ago

snthpy3y ago

Thanks. Looks very interesting!

My main observation just looking at your example pictures is that its closest competitor is Gaussian Processes which I've long been a fan of.

ersiees3y ago

Yeah, we also find this interesting. Arguably, though, GP is far worse in all metrics. Thus, it is not really our closest competition.

snthpy3y ago

I'd be interested to hear what you see as your closest competition?

janee3y ago

As a lazy non-ML simpleton, is there a simple explanation for it's usage?

Would tabular classification usually refer to say, extraction of tabular data in a picture to text?

I tried googling and looking through the site but it wasn't obvious to me what this actually does.

learndeeply3y ago

jerpint3y ago

No it’s really just tabular csv data, like a typical spreadsheet would hold. These datasets are rarely ever outperformed by deep learning compared to standard ML.

edmundsauto3y ago

Doesn’t it being a NN automate some of the feature engineering?

j / k navigate · click thread line to collapse