This is definitely not all you need

deep learning machine learning

A summary of findings regarding deep learning for tabular data.

Michael Clark https://m-clark.github.io
2021-07-19

Motivation

I’ve been a little perplexed at the lack of attention of deep learning (DL) toward what I consider to be ‘default’ data in my world, often referred to as tabular data, where typically we have a two dimensional input of observations (rows) and features (columns) and inputs are of varying type, scale and source. Despite the ubiquity of such data in data science generally, and despite momentous advances in areas like computer vision and natural language processing, at this time, it’s not very clear what the status of DL for tabular data is.

There have been developments in the area recently though, with some modeling approaches, such as TabNet, gaining traction. In June of 2021, I actually came across three papers on Arxiv that were making related claims about the efficacy of DL for tabular data. As in many academic and practical (and life) pursuits, results of these studies are nuanced, so I thought I’d help myself and others by summarizing here.

Goal

I want to know if, e.g. time and/or resources are limited, whether it will be worth diving into a DL model if I have a satisfactory simpler/easier one ready to implement that does pretty well. Perhaps this answer is already, ‘if it ain’t broke, don’t fix it’, but given the advancements in other data domains, it would be good to assess what the current state of DL with tabular data is.

Caveats

Quick Take

In case you don’t want any detail, here’s a quick summary based on my impressions from these articles. Right now, if you want to use DL on tabular data, don’t make a fuss of it. A simple architecture, even a standard multi-layer perceptron, will likely do as well as more complicated ones. In general though, the amount of effort put into prep/tuning may not be worth it for many typical tabular data settings, for example, relative to a suitably flexible statistical model (e.g. GAMM) or a default fast boosting implementation like XGBoost. However, DL models are already thinking ‘big data’, so for very large data situations, a DL model might make a great choice, as others may not be computationally very viable. It also will not be surprising at all that in the near future some big hurdle may be overcome as we saw with DL applications in other fields, in which case some form of DL may be ‘all you need’.

Now, on to the rest!

Tabular Data: Deep Learning is Not All You Need

Paper Info

From the Abstract

We analyze the deep models proposed in four recent papers across eleven datasets, nine of which were used in these papers, to answer these questions. We show that in most cases, each model performs best on the datasets used in its respective paper but significantly worse on other datasets. Moreover, our study shows that XGBoost (Chen and Guestrin, 2016) usually outperforms the deep models on these datasets. Furthermore, we demonstrate that the hyperparameter search process was much shorter for XGBoost.

Overview

For each model they used the data that was implemented in the original model papers by the authors (e.g. the dataset used in the TabNet article), and also used their suggested parameter settings. They tested all the models against their own data, plus the other papers’ data, plus two additional data sets that were not used in any of the original papers.

Data

They use eleven total datasets. Nine datasets are those used in the original papers on TabNet, DNF-Net, and NODE, drawing three datasets from each paper. Additionally, Shwartz-Ziv & Armon use two Kaggle datasets not used in any of those papers. Sample sizes ranged from 7k to 1M, 10-2000 features, with two being numeric targets, while the other target variables ranged from 2-7 classes. Datasets are described in detail in the paper along with links to the source (all publicly available).

Models Explored

Brief summaries of the DL models are found in the paper.

Quick Summary

Not counting the ensemble methods…
Counting the ensemble methods…

Of those, XGB was notably better on ‘unseen’ data, and comparable to the best performing ensemble. A simple ensemble was also very performant. From the paper:

The ensemble of all the models was the best model with 2.32% average relative increase, XGBoost was the second best with 3.4%, 1D-CNN had 7.5%, TabNet had 10.5%, DNF-Net had 11.8% and NODE had 14.2% (see Tables 2 and 3 in the appendix for full results).

As a side note, XGBoost + DL was best, but that defeats the purpose in my opinion. Presumably any notably more complicated setting will be potentially better with enough complexity, but unless there is an obvious way on how to add such complexity, it’s mostly an academic exercise. However, as the authors note, if search is automated, maybe the complexity of combining the models is less of an issue.

Other stuff

Kudos

The authors cite the No Free Lunch theorem in the second paragraph, something that appears to be lost on many (most?) of these types of papers touting small increases in performance for some given modeling approach.

Issues

There are always things like training process/settings that are difficult to fully replicate. By the time authors publish any paper, unless exact records are kept, the iterations (including discussions that rule out various paths) are largely lost to time. This isn’t a knock on this paper, just something to keep in mind.

Opinion

I liked this one in general. They start by giving the competing models their best chance with their own settings and data, which was processed and trained in the same way. Even then, those models still either didn’t perform best, and/or performed relatively poorly on any other dataset.

Regularization is all you Need: Simple Neural Nets can Excel on Tabular Data

Paper Info

From the Abstract

Tabular datasets are the last “unconquered castle” for deep learning… In this paper, we hypothesize that the key to boosting the performance of neural networks lies in rethinking the joint and simultaneous application of a large set of modern regularization techniques. As a result, we propose regularizing plain Multilayer Perceptron (MLP) networks by searching for the optimal combination/cocktail of 13 regularization techniques for each dataset using a joint optimization over the decision on which regularizers to apply and their subsidiary hyperparameters.

We empirically assess the impact of these regularization cocktails for MLPs on a large-scale empirical study comprising 40 tabular datasets and demonstrate that (i) well-regularized plain MLPs significantly outperform recent state-of-the-art specialized neural network architectures, and (ii) they even outperform strong traditional ML methods, such as XGBoost.

We emphasize that some of these publications claim to outperform Gradient Boosted Decision Trees (GDBT) [1, 37], and other papers explicitly stress that their neural networks do not outperform GBDT on tabular datasets [38, 22]. In contrast, we do not propose a new kind of neural architecture, but a novel paradigm for learning a combination of regularization methods.**

Overview

This data is more about exploring regularization techniques (e.g. data augmentation, model averaging via dropout) rather than suggesting any particular model is superior. Even in the second paragraph they state their results do not suggest a performance gain over boosting methods. Their focus is on potentially improving DL for tabular data through regularization with two hypotheses:

Data

Forty total datasets ranging from as little as ~400 observations to over 400k, and between 4 and 2000 features. All were categorical targets, with about half binary. All available at openml.org with target ID provided.

Models Explored

Comparison models:

Quick Summary

Other Stuff

Kudos
Issues
Opinion

If we ignore the regularization focus and just look at the model comparisons, I’m not overly convinced we have a straightforward victory for cocktail vs. GB as implied in the conclusion. Results appear to be in favor of their proposed method, but not enough to be a near-guarantee in a particular setting, so we’re back to square one of just using the easier/faster/better tool. I’m also not sure who was questioning the use of regularization for neural networks or modeling in general, so the comparison to any model without some form of regularization isn’t as interesting to me. What is interesting to me is that we have another round of evidence that the fancier DL models like TabNet do not perform that well relative to GB or simpler DL architectures.

Revisiting Deep Learning Models for Tabular Data

Paper Info

From the Abstract

The necessity of deep learning for tabular data is still an unanswered question addressed by a large number of research efforts. The recent literature on tabular DL proposes several deep architectures reported to be superior to traditional “shallow” models like Gradient Boosted Decision Trees. However, since existing works often use different benchmarks and tuning protocols, it is unclear if the proposed models universally outperform GBDT. Moreover, the models are often not compared to each other, therefore, it is challenging to identify the best deep model for practitioners.

In this work, we start from a thorough review of the main families of DL models recently developed for tabular data. We carefully tune and evaluate them on a wide range of datasets and reveal two significant findings. First, we show that the choice between GBDT and DL models highly depends on data and there is still no universally superior solution. Second, we demonstrate that a simple ResNet-like architecture is a surprisingly effective baseline, which outperforms most of the sophisticated models from the DL literature. Finally, we design a simple adaptation of the Transformer architecture for tabular data that becomes a new strong DL baseline and reduces the gap between GBDT and DL models on datasets where GBDT dominates.

Overview

This paper compares different models on a variety of datasets. They are interested in the GB vs. DL debate, but like the previous paper, also interested in how well a simpler DL architecture might perform, and what steps might help the more complicated ones do better.

Data

They have 11 datasets with a mix of binary, multiclass and numeric targets. Sizes range from 20K to 1M+. There appears to be some overlap with the first paper (e.g. Higgs, Cover type).

Models Explored

‘Baselines’
DL Comparisons

In addition, they look at ensembles of these models, but this is not of interest to me for this post.

Quick Summary

Note that these refer to the ‘single model’ results, not the results for ensembles.

Other Stuff

Kudos
Issues
Opinion

These results seem consistent with others at this point. Complex DL isn’t helping, and simpler architectures, even standard MLP show good performance. In the end, we still don’t have any clear winner over GB methods.

Overall Assessment

These papers put together are helpful in painting a picture of where we are at present with deep learning for tabular data, especially with mixed data types. In this setting, it seems that more complicated DL models do not seem to have any obvious gain over simpler architectures, which themselves do not consistently beat boosting methods. It may also be the case that for data of mixed data types/sources, boosting is still the standard to beat.

Even though these articles are geared toward comparisons to GB/XGBoost, in several settings I’ve applied them, I typically do not necessarily have appreciably greater success compared to a default setting random forest (e.g. from the ranger package in R), or sufficiently flexible statistical model. Unfortunately this comparison is lacking from the papers, and would have been nice to have, especially for smaller data settings where such models are still very viable. I think a viable fast model, preferably one without any tuning required (or which simply is taken off the shelf) should be the baseline.

In that light, for tabular data I think one should maybe start with a baseline of a penalized regression with appropriate interactions (e.g. ridge/lasso), or a more flexible penalized approach (GAMM) as a baseline, the latter especially, as it can at least automatically incorporate nonlinear relationships, and tools like mgcv or gpboost in R can do so with very large data (1 million +) in a matter of seconds. In settings of relatively higher dimensions, interactions and nonlinearities should be prevalent enough such that basis function, tree, and DL models should be superior. Whether they are practically so is the key concern even in those settings. With smaller, noisier data of less dimension, I suspect the tuning/time effort with present day DL models for tabular data will likely not be worth it. This may change very soon however, so such an assumption should be regularly checked.



last updated: 2022-05-02

Neural Net image source from UC Business Analytics R Programming Guide

Gorishniy, Yuri, Ivan Rubachev, Valentin Khrulkov, and Artem Babenko. 2021. “Revisiting Deep Learning Models for Tabular Data.” arXiv Preprint arXiv:2106.11959.
Kadra, Arlind, Marius Lindauer, Frank Hutter, and Josif Grabocka. 2021. “Regularization Is All You Need: Simple Neural Nets Can Excel on Tabular Data.” arXiv Preprint arXiv:2106.11189.
Shwartz-Ziv, Ravid, and Amitai Armon. 2021. “Tabular Data: Deep Learning Is Not All You Need.” arXiv Preprint arXiv:2106.03253.

References

Corrections

If you see mistakes or want to suggest changes, please create an issue on the source repository.

Reuse

Text and figures are licensed under Creative Commons Attribution CC BY-SA 4.0. Source code is available at https://github.com//m-clark/m-clark.github.io, unless otherwise noted. The figures that have been reused from other sources don't fall under this license and can be recognized by a note in their caption: "Figure from ...".

Citation

For attribution, please cite this work as

Clark (2021, July 19). Michael Clark: This is definitely not all you need. Retrieved from https://m-clark.github.io/posts/2021-07-15-dl-for-tabular/

BibTeX citation

@misc{clark2021this,
  author = {Clark, Michael},
  title = {Michael Clark: This is definitely not all you need},
  url = {https://m-clark.github.io/posts/2021-07-15-dl-for-tabular/},
  year = {2021}
}