Deep Learning for Tabular Data: The Foundation Model Era

An update on the state of deep learning for tabular data in 2026, featuring practical model implementation.

deep learning
machine learning
boosting
Author
Affiliation
Published

March 1, 2026

Introduction

It’s been nearly four years since I first summarized the state of deep learning (DL) for tabular data, and about three years since my follow-up post. Back then, the verdict was pretty clear: for most tabular data scenarios, especially those with heterogeneous features and even very large sample sizes, gradient boosting methods like XGBoost, LightGBM, and CatBoost, were still the pragmatic and performant choice. Complex DL architectures like TabNet and even transformer-based models struggled to consistently outperform these boosting approaches, which meant they weren’t really viable options for most practitioners or production.

The question now is: has anything fundamentally changed?

Let’s see if the landscape has shifted, or if we’re still in “XGBoost is all you need” territory.

Quick Take (if you’re in a hurry)

Key takeaways:

  • Several new deep learning models show impressive performance compared to boosting, but context still matters.
  • It’s significantly easier to implement and experiment with modern tabular DL methods
  • TabArena has created a more rigorous benchmarking environment with standardized datasets and evaluation protocols.
  • Foundation models for tabular data are emerging and have had success, but it’s a bit early to say how transformative they’ll be in practice yet.
  • For most practitioners: you’d still do very well with boosting, especially for heterogeneous data and with limited time/resources. However, you should start to consider DL methods if predictive performance is critical and you have the resources to invest in experimentation.

Now, let’s dig into the details.

What’s Changed: Benchmarking and Accessibility

Better Evaluation Standards

One major improvement in tabular data modeling is the emergence of more rigorous benchmarking. TabArena (leaderboard) and TALENT provide standardized datasets, consistent evaluation metrics, and transparency in preprocessing—addressing the apples-to-oranges comparison problems from earlier papers. TabArena, which will be the primary reference here, has set a new standard for evaluating tabular DL models with:

  • A curated set of tabular datasets with standardized train/test splits
  • Consistent evaluation metrics across different model types
  • A public leaderboard tracking model performance
  • Transparency in preprocessing and hyperparameter choices
  • Focus on heterogeneous datasets that better reflect real-world tabular data scenarios
  • Use of elo-style model ranking for easier model comparisons

Current benchmarks address one of the major frustrations from my previous post summaries: papers often used different datasets, preprocessing pipelines, and hyperparameter tuning budgets, making fair comparisons nearly impossible. They also often focused primarily on very homogeneous data, which isn’t representative of typical real-world tabular data, and sometimes only classification targets. This is a far better assessment, and gives us more confidence when comparing methods.

Current Leaderboard Landscape

As of February 2026, the TabArena leaderboard shows some interesting patterns:

Figure 1: TabArena Leaderboard

The key insight from these benchmarks is that some newer DL models are not only competitive with boosting, a few typically win in head-to-head comparisons against boosting models. For example, the top 4 DL models win > 50% of the time against LGBM, XGB, and CatBost.

Figure 2: Head to Head

Practical Implementation Tools

The even bigger story is accessibility. Modern tools make it trivial to try these methods, so let’s try it out for ourselves!

Making Tabular DL Practical

Using pytabkit

pytabkit provides implementations of state-of-the-art tabular DL models with:

  • Optimized hyperparameter defaults based on extensive benchmarking
  • Unified API similar to scikit-learn
  • Automated preprocessing for heterogeneous data
  • Support for both classification and regression

This is significant because one of the historical barriers to using tabular DL was implementation difficulty. While boosting modules were already easy to add to typical ML pipelines and often worked well out of the box, DL models often required custom code, careful tuning, and extensive experimentation to get good results.

Let’s see it in action with a simple example. We’ll generate a synthetic dataset with mixed numeric and categorical features, then train a RealMLP model and compare it to an XGBoost baseline.

Show the code: Create Classification Data
from pytabkit import RealMLP_TD_Classifier

from sklearn.model_selection import train_test_split
from sklearn.datasets import make_classification
from sklearn.metrics import accuracy_score, roc_auc_score, log_loss, f1_score
from sklearn.model_selection import StratifiedKFold

import pandas as pd
import numpy as np
import string


# Generate a classification dataset
X, y = make_classification(
    n_samples=5000,
    n_features=15,
    n_informative=8,
    n_redundant=3,
    n_classes=2,
    random_state=42,
    # make it a more difficult problem
    flip_y=0.2,
    class_sep=0.1,
    n_clusters_per_class=2,
)


# Convert to DataFrame for easier handling
feature_names = [f"Feature_{i}" for i in range(X.shape[1])]
X = pd.DataFrame(X, columns=feature_names)

# Add three categorical columns with 3, 5, and 12 categories, using letters as group names
X['Cat_3'] = pd.cut(X['Feature_0'], bins=3, labels=list(string.ascii_uppercase[:3])).astype('category')
X['Cat_5'] = pd.cut(X['Feature_1'], bins=5, labels=list(string.ascii_uppercase[:5])).astype('category')
X['Cat_12'] = pd.cut(X['Feature_2'], bins=12, labels=list(string.ascii_uppercase[:12])).astype('category')


# Use a single train/test split
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42, stratify=y)

# Example CV splitter for pytabkit (if needed)
cv_splitter = StratifiedKFold(n_splits=5, shuffle=True, random_state=42)
# Get train/validation indices from the splitter
# X_val, y_val = X_train.iloc[val_idx], y_train[val_idx]

val_idxs_list = [val_idxs for train_idxs, val_idxs in cv_splitter.split(X_train, y_train)]

# make sure that each validation set has the same length, so we can exploit vectorization
max_len = max([len(val_idxs) for val_idxs in val_idxs_list])
val_idxs_list = [val_idxs[:max_len] for val_idxs in val_idxs_list]
val_idxs = np.asarray(val_idxs_list)

print(f"Train size: {X_train.shape[0]}, Test size: {X_test.shape[0]}, Baseline rate: {np.mean(y):.2f}")

With data in hand, we can train a RealMLP model.

model_realmmlp = RealMLP_TD_Classifier(
    device='mps',
    random_state=0,
    n_cv=5,
    val_metric_name='cross_entropy',
    verbosity=2,  # will show per epoch loss
    # example modifications
    # n_refit=1,
    # n_repeats=1,
    # n_epochs=256,
    # batch_size=256,
    # hidden_sizes=[256] * 3,
    # lr=0.04,
)

model_realmmlp.fit(
    X_train,
    y_train,
    val_idxs=val_idxs,
    cat_col_names=['Cat_3', 'Cat_5', 'Cat_12']
)

On an old Macbook M1 this ran just at around one minute with the default 256 epochs, and it doesn’t look like we needed that many (#fig-RealMLPTraining). We definitely don’t need to sweat that kind of time on a dataset of this size regardless. Now let’s evaluate the model on the test set. Note that inference is near instant for this model.

Figure 3: RealMLP Training Log Loss
Show the code: Evaluate RealMLP
y_pred = model_realmmlp.predict(X_test)
y_proba = model_realmmlp.predict_proba(X_test)[:, 1]

metrics_realmlp = {
    "Accuracy": accuracy_score(y_test, y_pred),
    "ROC AUC": roc_auc_score(y_test, y_proba),
    "Log Loss": log_loss(y_test, y_proba),
    "F1 Score": f1_score(y_test, y_pred)
}

metrics_realmlp = pd.DataFrame(metrics_realmlp, index=["RealMLP"])
print(metrics_realmlp)
Accuracy ROC AUC Log Loss F1 Score
0.772 0.838 0.505 0.767

Compare to XGBoost. With pytabkit we can use defaults based on models used across a variety of data sets (TD = tuned defaults), making it easy to get started very well. Model training with data of this size took just a couple seconds.

from pytabkit import XGB_TD_Classifier

model_xgb = XGB_TD_Classifier(
    device='mps',
    random_state=0,
    n_cv=5
)

model_xgb.fit(
    X_train,
    y_train,
    val_idxs=val_idxs,
    cat_col_names=['Cat_3', 'Cat_5', 'Cat_12']
)
Show the code: Evaluate XGBoost
y_pred = model_xgb.predict(X_test)
y_proba = model_xgb.predict_proba(X_test)[:, 1]

metrics_xgb = {
    "Accuracy": accuracy_score(y_test, y_pred),
    "ROC AUC": roc_auc_score(y_test, y_proba),
    "Log Loss": log_loss(y_test, y_proba),
    "F1 Score": f1_score(y_test, y_pred)
}
Model Accuracy ROC AUC Log Loss F1 Score
XGB 0.752 0.824 0.567 0.750
RealMLP 0.772 0.838 0.505 0.767

While boosting appears to do fine, in very little time we can get not only competitive but even better results with a modern DL approach. Of course, this is just one (simulated) dataset that shows how easy it is, but the TabArena results suggest this generalizes well beyond also. There we see that RealMLP is one of the top performers across a variety of datasets, and beats XGBoost in head-to-head comparisons more than 75% of the time (Figure 2).

Even Easier: AutoGluon

AutoGluon is a popular AutoML library that has recently added support for some of the newer tabular DL models, including many of those in pytabkit. This means you can get competitive performance with minimal code and tuning for multiple models, whether boosting, DL, or other approaches. Here we use 3 DL models and 3 boosting models1.

from autogluon.tabular import TabularDataset, TabularPredictor

X_train_ag = TabularDataset(X_train.assign(target=y_train))
X_test_ag = TabularDataset(X_test)

model_ag = (
    TabularPredictor(label='target', eval_metric='logloss')
    .fit(
        X_train_ag,
        num_bag_folds=5,
        # num_bag_sets=1,
        # num_stack_levels=0,
        # you can set these and/or let AutoGluon search 
        hyperparameters={
            'REALTABPFN-V2.5': {'device': 'mps'},
            'REALMLP': {},
            'TABM': {},
            'GBM': {},
            'XGB': {},
            'CAT': {},
        }
    )
)
Show the code: AutoGluon Predictions
y_pred = model_ag.predict(X_test_ag)
y_proba = model_ag.predict_proba(X_test_ag).loc[:, 1]

One very nice aspect of AutoGluon is its easily accesible ‘leaderboard’ so you can compare and contrast your models immediately. The best non-ensemble model was RealTabPFN-v2.5. It also makes very clear just how much more time it took to get the best performance (in seconds), which is a lot. Whether it’s worth it will depend on your context.

ag_leaderboard = model_ag.leaderboard(X_test_ag.assign(target=y_test))
model score_test score_val eval_metric pred_time_test pred_time_val fit_time pred_time_test_marginal pred_time_val_marginal fit_time_marginal
WeightedEnsemble_L2 −0.858 −0.434 log_loss 128.786 450.092 451.637 0.002 0.001 0.057
RealTabPFN-v2.5_BAG_L1 −0.869 −0.434 log_loss 128.784 450.091 451.581 128.784 450.091 451.581
XGBoost_BAG_L1 −0.980 −0.505 log_loss 0.095 0.034 0.953 0.095 0.034 0.953
LightGBM_BAG_L1 −1.030 −0.498 log_loss 0.041 0.070 0.904 0.041 0.070 0.904
CatBoost_BAG_L1 −1.095 −0.487 log_loss 0.055 0.152 10.700 0.055 0.152 10.700
TabM_BAG_L1 −1.096 −0.488 log_loss 0.460 0.401 169.268 0.460 0.401 169.268
RealMLP_BAG_L1 −1.280 −0.483 log_loss 0.201 0.448 38.095 0.201 0.448 38.095


Now let’s compare all our approaches. Standard XGB with tuned defaults did pretty well, but our RealMLP model did better, and AutoGluon did even better still in terms of loss.

Model Accuracy ROC AUC Log Loss F1 Score
XGB 0.752 0.824 0.567 0.750
RealMLP 0.772 0.838 0.505 0.767
AutoGluon 0.798 0.865 0.464 0.793

What Affects Performance?

The performance of any model on tabular data can be influenced by several factors:

  • Dataset characteristics: The number of samples, features, and the degree of heterogeneity can impact which model performs best. DL models may excel with larger datasets and more complex feature interactions, while boosting methods may perform better with smaller datasets or more homogeneous features.
  • Hyperparameter tuning: DL models often require more careful tuning of hyperparameters (e.g., learning rate, architecture, batch size) compared to boosting methods, which can be more robust to default settings. However, tools like pytabkit and AutoGluon are making this easier by providing optimized defaults and automated tuning.
  • Computational resources: DL models typically require more computational power and longer training times to get that better performance, especially for larger datasets and more complex architectures. Boosting methods are generally faster to train and can be more efficient for smaller datasets or when resources are limited.
  • Ensembling: Combining multiple models can often improve performance, but the benefits may vary depending on the dataset and the models being combined. Some DL models may benefit more from ensembling than boosting methods, but this can also increase training time and complexity.

TabArena provides the following plot. On the y-axis, which is improvability or is how much better the ‘best’ model is relative to the one in question. If your model did just as well as the best improvability would be 0, as there would be no room for improvement. The x-axis is difficult to parse because it’s a combination of dataset characteristics like sample size and number of features. Plus the display choices make it hard to see the relationship between different models. TabPFN doesn’t gain much with increased training time, but that is partly because it does so well to begin with. Furthermore, we may see better improvement if that training time is due to increased features or increased samples, but it’s not clear from that plot. Note that each point represents a larger ensemble of different configurations of the model from 1 parameter set on the left to 201 on the right, so the training time is the median time during that training process2.

Figure 4: Improvability vs. Train Time

The takeaway from the plot is that deep learning has officially caught up to gradient-boosted trees on tabular data, but claiming that top spot requires a massive “training tax”—specifically, ensembling dozens of configurations to squeak past standard baselines. While models like RealMLP offer the absolute highest accuracy if you have unlimited resources, trees like CatBoost remain the practical choice, delivering stable, near-peak performance at a fraction of the compute and inference cost. Unless you are chasing the final fraction of a percent in accuracy at any price, traditional tree-based models are still the most efficient tool for the job.

If we actually break out the reported relationships in the TabArena paper, a couple things become more clear. I looked at the top 4 DL models and their performance relative to standard 3 boosting methods, and how sample size, number of features, and heterogeneity (based on percentage of categorical features) relate to improvability3. I found that:

  • Increased sample and feature set is correlated with better performance for boosting, but this relationship seems to be mostly for categorical targets and/or the default parameters.
  • The percentage of categorical features seems a mixed bag for boosting and DL models, but may help boosting methods for binary and numeric targets. For DL models this seems to be more helpful for numeric targets, but not binary or multiclass.
  • For multiclass targets, increased feature set helps DL models.
  • Default settings benefit more from increased data, likely because the tuned/ensembled models are already doing much better in the same settings, so there’s less room for improvement.

While there may be some apparent trends, I think the overall takeaway here is that data nuances matter, and it is difficult to say ahead of time how a model will perform.

How long does it take?

When considering whether to use DL methods for tabular data, training and inference time are important factors. While DL models can offer performance improvements, they require more computational resources and longer training times compared to boosting methods. This is especially true for larger datasets and more complex architectures.

In our example, the RealMLP model took about one minute to train on a small dataset with an old M1 chip, while XGBoost took just a couple seconds. The AutoGluon approach had several models to work through and then combine, but even then was less than 10 minutes. This sort of difference is manageable for many applications, especially if you’re looking for that extra bit of performance and have the resources to invest in tuning. Autogluon makes it even easier to implement multiple models quickly, with ensemble results that will often be the best you’ll get.

Of course, with larger data comes more pronounced gaps. Some models like Tab-PFN* will not even be applicable beyond a certain point, and others will just not be viable for the data context. For some, they may already do quite well in smaller data settings (like Tab-PFN), while others may generally do better with more data/ensembling.

For DL models and tabular data, the ease at which we can implement these models means it’s worth trying them out on your own data, especially if you’re already using boosting methods and want to see if you can eke out some extra performance.

What to Do Now

This is how I see things at this point.

Start Here (Decision Framework)

  • Small data (<10k) and infrequent model runs: anything you want that works including strong statistical models like generalized additive mixed models (GAMs).
  • Small data and frequent runs: boosting/MLP-based models like RealMLP, TabM, TabICL.
  • Medium data (up to 250k) and infrequent runs: try DL models if resources permit.
  • Medium data and frequent runs: boosting/MLP-based models.
  • Large data (>250k): boosting is your primary go to. If you really want to do DL, consider TabICL-v2.

Other considerations will include multiclass outcomes, number of features, non-standard loss functions, computational resources available at inference and more. But right now it seems you can get the best performance on small to moderate data with DL models, but the cost of that performance is more time and resources, and the benefits are less clear for larger data.

Looking Ahead: Foundation Models for Tabular Data

The Promise

Recent work has explored pre-training models on large collections of tabular datasets and/or synthetic data, then fine-tuning for specific tasks. This is similar to the paradigm that revolutionized NLP and vision. But do they work?

The idea is that just like in other domains, we can leverage large amounts of tabular data to learn generalizable representations that can be fine-tuned for specific tasks. This could potentially allow us to:

  • Achieve better performance with less task-specific data
  • Handle heterogeneous features more effectively
  • Enable zero-shot or few-shot learning on new tabular tasks

Key models include TabPFN and variants, Mitra (Amazon), TabICL, and others.

The Reality

Results are hit and miss. Some such as TabPFN obviously do well if we look at the leaderboard… if the data is not large. Others like TabICL are restricted to classification4. Note that TabArena data maxes out at 150k which is considered ‘medium’ size. Even then, training and inference time may be prohibitive in many cases. Right now you could use MLP approaches like TabM and RealMLP with faster yet comparable results. But more to the point, it is very common in the tabular domain to see much larger datasets, and boosting models are still working fine even with millions of rows, and can utilize GPUs as well.

What I’d Still Like to See

Reflecting on my 2022 wishlist what progress has been made and what is still missing:

Progress made:

  • ✅ Better implementation tools
  • ✅ Standardized benchmarking
  • ✅ More rigorous/realistic comparisons

Would like more:

  • ❌ Clearer guidance on when complexity pays off
  • ❌ Fair comparisons with better statistical methods than linear/knn regression (e.g., GAMM)

I think the biggest issue is a breakdown of specific data characteristics and how they relate to model performance. Efficiency is certainly worth focusing on, but practitioners need to know ‘is my data too large for this model?’ or ‘do I have enough features for this model to be worth it?’5. I can appreciate trying to tie them all together in one plot, but a clearer delineation of these relationships would be very helpful for practitioners trying to decide which model to use for their specific problem.

Conclusion

Four years ago I concluded that DL wasn’t ready for most tabular tasks. Today, the tabular DL landscape has matured considerably. We have better benchmarks, better tools, and a little bit better understanding of where these methods excel. Some DL models now regularly outperform boosting on realistic benchmarks and are ready for use, but still not always necessary. In general:

XGBoost and friends are still the pragmatic starting point for most tabular data tasks, especially with heterogeneous features and modest sample sizes. They’re fast, well-understood, and more easily tunable than ever.

DL methods for tabular data have become practically competitive. However, as they often require more resources to implement effectively, it will be important to consider the specific context of your problem, including dataset size, feature characteristics, and computational constraints. For smaller datasets or when chasing that last bit of performance, DL models like TabICL and RealMLP can be a great choice.

The “all you need” framing from paper titles misses the point. The real question is: “What do you need for your specific problem?” In 2026, we have better tools to explore that question, but the answer will often still point to classical ML methods for tabular data work.

That said, the rapid progress in this space suggests checking back in another year or so might reveal even more interesting developments. Foundation models are still early in development, and the community is actively exploring how to make them more efficient and effective. In the meantime, the best approach is to experiment with your own data, use benchmarks like TabArena for inspiration, and don’t be afraid to start simple and iterate from there.

References

Previous posts in this series:

Resources:

References

Erickson, Nick, Lennart Purucker, Andrej Tschalzev, David Holzmüller, Prateek Mutalik Desai, David Salinas, and Frank Hutter. 2025. “TabArena: A Living Benchmark for Machine Learning on Tabular Data.” https://arxiv.org/abs/2506.16791.
Gorishniy, Yury, Akim Kotelnikov, and Artem Babenko. 2025. “TabM: Advancing Tabular Deep Learning with Parameter-Efficient Ensembling.” https://arxiv.org/abs/2410.24210.
Grinsztajn, Léo, Klemens Flöge, Oscar Key, Felix Birkel, Brendan Roof, Phil Jund, Benjamin Jäger, et al. 2025. “TabPFN-2.5: Advancing the State of the Art in Tabular Foundation Models.”
Holzmüller, David, Léo Grinsztajn, and Ingo Steinwart. 2025. “Better by Default: Strong Pre-Tuned MLPs and Boosted Trees on Tabular Data.” https://arxiv.org/abs/2407.04491.
Qu, Jingang, David Holzmüller, Gaël Varoquaux, and Marine Le Morvan. 2026. “TabICLv2: A Better, Faster, Scalable, and Open Tabular Foundation Model.” https://arxiv.org/abs/2602.11139.

Footnotes

  1. Amazon provides its own tabular foundation model Mitra, but it wouldn’t run due to memory issues on my old machine.↩︎

  2. So many issues with this visualization. If you don’t understand it, don’t worry about it, it conflates many different things to the point of being minimally useful at best. This is even acknowledged in the paper.↩︎

  3. The TabArena paper reports rmse for numeric outcomes, logloss for multiclass, and AUC for binary. Their conception of improvability is 100* (err - best) / err, where err is the error of the model in question and best is the error of the best model for that dataset/setting/metric. For AUC, which is already on a 0-1 scale, I used the percentage point difference: 100*(best_auc - model_auc), which is more interpretable than a ratio of proportions. This has no bearing on the correlation in the end.↩︎

  4. I’m not sure what the reasoning for focusing only on classification was, but TabICL-v2, released a few days before I started this post, is now applicable to regression tasks. The authors claim it beats everything.↩︎

  5. I think some of the data is going far overboard with the number of features. I’ve consulted across dozens of disciplines and the most I’ve ever had for a tabular model was a little over 100, which was mostly due to the client not having the domain knowledge to do feature selection (IIRC, the actual useful number was < 10).↩︎

Reuse

Citation

BibTeX citation:
@online{clark2026,
  author = {Clark, Michael},
  title = {Deep {Learning} for {Tabular} {Data:} {The} {Foundation}
    {Model} {Era}},
  date = {2026-03-01},
  url = {https://m-clark.github.io/posts/2026-03-01-dl-for-tabular-foundational/},
  langid = {en}
}
For attribution, please cite this work as:
Clark, Michael. 2026. “Deep Learning for Tabular Data: The Foundation Model Era.” March 1, 2026. https://m-clark.github.io/posts/2026-03-01-dl-for-tabular-foundational/.