Deep Learning for Tabular Data

deep learning machine learning

A continuing exploration

Michael Clark https://m-clark.github.io
2022-05-01

Introduction

In a previous post, I offered a summary of several articles that came out over the summer of 2021 regarding the application of deep learning (DL) methods to tabular data. DL has shown astounding success in the natural language processing, computer vision, and other fields, but when it comes to the sorts of data common in other situations, especially where data is usually smaller and of mixed source and type (e.g. demographic, social science, biological data), results were mostly unimpressive for complex DL architectures. In particular, it did not appear that DL methods could consistently compete with, much less consistently beat, common machine learning (ML) approaches such as gradient boosting (e.g. XGBoost). Here I provide a bit of an update, as another few articles have come along continuing the fight.

TLDR: the meta-analysis

I collected most of the results from the summarized articles here and those covered in the previous post to see if we come to any general conclusions about which methods are best or work best in certain settings. In the following tables, I excluded those I knew to be image data, as well as datasets where I thought results were indistinguishable across all models tested (e.g. less than 1% difference in accuracy). This left comparisons for 92 datasets across six articles. However, it’s important to note that these were not independent datasets or studies. For example, Gorishniy et al. are the source of two papers and essentially the same testing situations, and other datasets were common across papers (e.g. Higgs Boson). In the rare situations there was a tie, I gave the nod to boosting methods as a. the whole point is to do better than those, b. they are the easier model to implement, and c. they are not always given the same advantages in these studies (e.g. pre-processing).

Feature Type

The following shows results by feature type.

The results suggest that current DL approaches’ strength is mostly with purely numeric data, and for heterogeneous data, simpler MLP or Boosting will generally prevail. I initially thought that boosting would do even better with heterogeneous data, and I still suspect that with more heterogeneous data and on more equal footing, results would tilt even more.

winner_model_type All Cat All Num Heterogeneous Min. Combo
Boost 2 10 14 6
MLP 2 4 9 11
DL_complex 0 22 7 5
winner_model_type All Cat All Num Any Combo
Boost 2 10 20
MLP 2 4 20
DL_complex 0 22 12
Sample/Feature Set Size

The following suggests that complex DL methods are going to require a lot of data to perform better. This isn’t that surprising but the difference here is quite dramatic. Interestingly, MLP methods worked well for fewer features. N total in this case means total size reported (not just training).

winner_model_type N features N total
Boost 209 133,309
DL_complex 207 530,976
MLP 114 114,164
Target Type

In the following we compare binary (bin), multiclass (mc), and numeric (num) target results1, but there’s no strong conclusion for this. The main thing to glean from this is that these papers do not test numeric targets nearly enough. Across dozens of disciplines and countless datasets that I’ve come across in various settings, if anything, this ratio should be reversed.

winner_model_type bin mc num
Boost 17 10 5
DL_complex 17 11 6
MLP 10 14 2
Combinations

In the following I look at any heterogeneous, smaller data (N < 200,000). A complex DL model will likely not do great in this setting.

winner_model_type n
Boost 19
DL_complex 8
MLP 19

Now, on to the details of some of the recent results that were included.

On Embeddings for Numerical Features in Tabular Deep Learning

Overview

Yura Gorishniy, Rubachev, and Babenko (2022) pit several architectures against one another, such as standard multilayer perceptron (MLP), ResNet, and their own transformer approach (see Yuri Gorishniy et al. (2021)). Their previous work, which was summarized in my earlier post, was focused on the architecture, while here they focus on embedding approaches. The primary idea is to take the value of some feature and expand it to some embedding space, then use the embedding in lieu of the raw feature. It can essentially be seen as a pre-processing task.

One approach they use is piecewise linear encoding (PLE), which they at one point describe as ‘a continuous alternative to the one-hot encoding’2. Another embedding they use is basically a fourier transform.

Data

Models Explored

* Using proposed embeddings

Quick Summary

In general this was an interesting paper, and I liked the simple embedding approaches used. It was nice to see that they may be useful in some contexts. The fourier transform is something that analysts (including our team at Strong) have used in boosting, so I’m a bit curious why they don’t do Boosting + embeddings for comparison for that or both embedding types. These embeddings can be seen as a pre-processing step, so nothing would keep someone from using them for any model.

Another interesting aspect was how little difference there was in model performance. It seemed half the datasets showed extremely small differences between any model type.

SAINT: Improved neural networks for tabular data via row attention and contrastive pre-training

Overview

This paper applies BERT-style attention over rows and columns, along with embedding/data augmentation. They distinguish the standard attention over features, with intersample attention of rows. In addition, they use CutMix for data augmentation (originally devised for images), which basically combines pairs of observations to create a new observation4. Their model is called SAINT, the Self-Attention and Intersample Attention Transformer.

Data

Models Explored

Quick Summary

This is an interesting result, but somewhat dampened by lack of including numeric targets and more heterogeneous data. The authors include small data settings which is great, and are careful to not generalize despite some good results, which I can appreciate.

I really like the fact they also compare a simple logistic regression to these models, because if you’re not able to perform notably better relative to the simplest model one could do, then why would we care? The fact that logistic regression is at times competitive and even beats boosting/SAINT methods occasionally gives me pause though. Perhaps some of these data are not sufficiently complex to be useful in distinguishing these methods? It is realistic though. While it’s best not to assume as such, sometimes a linear model is appropriate given the features and target at hand.

Self-Attention Between Datapoints: Going Beyond Individual Input-Output Pairs in Deep Learning

Overview

This paper introduces Non-Parametric Transformers, which focus on holistic processing of multiple inputs, and attempts to consider an entire dataset as input as opposed to a single row. Their model attempts to learn relations between data points to aid prediction. They use a mask to identify prediction points from the non-masked data, i.e. the entire \(X_{\textrm{not masked}}\text{ }\) data used to predict \(X_{\textrm{masked}}\text{ }\). The X matrix actually includes the target (also masked vs. not). At prediction, the model is able to make use of the correlations of inputs of training to ultimately make a prediction.

Data

Models Explored

Quick Summary

When I first read the abstract where they say “We challenge a common assumption underlying most supervised deep learning: that a model makes a prediction depending only on its parameters and the features of a single input.”, I immediately was like ‘What about this, that, and those?’. The key phrase was ‘deep learning’, because the authors note later that this has a very long history in the statistical modeling realm. I was glad to see in their background of the research that they explicitly noted the models that came to my mind, like gaussian processes, kernel regression, etc. Beyond that, many are familiar with techniques like knn-regression and predictive mean matching, so it’s definitely not new to consider more than a single data point for prediction. I thought it was good of them to add k-nn regression to the model mix, even though it was not going to do well compared to the other approaches.

Though the author’s acknowledge a clear thread/history here, I’m not sure this result is the fundamental shift they claim, versus a further extension/expansion into the DL domain. Even techniques that may work on a single input at a time may ultimately be taking advantage of correlations among the inputs (e.g. spatial correlations in images). Also, automatic learning of feature interactions is standard even in basic regularized regression settings, but here their focus is on observation interactions (but see k-nn regression).

Conclusion

In the two reviews on DL for tabular data that I’ve done, it appears there is more work in store for DL methods applied to tabular data. While it’d be nice to have any technique that would substantially improve prediction for such settings, I do have a suspicion results are likely rosier than they are, since that is just about the case for any newly touted technique, and at least in some cases, I don’t think we’re even making apple to apple comparisons.

That said, I do feel like some ground has been made for DL applications for tabular data, in that architectures can now more consistently performing as well as boosting methods in certain settings, especially if we include MLP. In the end though, results don’t appear strong enough to warrant a switch from boosting for truly heterogeneous data, or even tabular data in general. I feel like someday we’ll maybe have a breakthrough, but in the meantime, we can just agree that messy data is hard stuff to model, and the best tool is whichever one works for your specific situation.

Guidelines for future research

I was thinking about what would be a convincing result, the type of setting and setup where if a DL technique was consistently performing statistically better than boosting methods, I’d be impressed. So I’ve made a list of things I’d like to see more of, and which would make for a better story if the DL method were to beat out other techniques.

Gorishniy, Yura, Ivan Rubachev, and Artem Babenko. 2022. “On Embeddings for Numerical Features in Tabular Deep Learning.” arXiv Preprint arXiv:2203.05556.
Gorishniy, Yuri, Ivan Rubachev, Valentin Khrulkov, and Artem Babenko. 2021. “Revisiting Deep Learning Models for Tabular Data.” arXiv Preprint arXiv:2106.11959.
Kadra, Arlind, Marius Lindauer, Frank Hutter, and Josif Grabocka. 2021. “Regularization Is All You Need: Simple Neural Nets Can Excel on Tabular Data.” arXiv Preprint arXiv:2106.11189.
Shwartz-Ziv, Ravid, and Amitai Armon. 2021. “Tabular Data: Deep Learning Is Not All You Need.” arXiv Preprint arXiv:2106.03253.

  1. I don’t refer to numeric targets as ‘regression’ because that’s silly for so many reasons. 😄↩︎

  2. A quick look suggests it’s not too dissimilar from a b-spline.↩︎

  3. I’ll let you go ahead and make your own prediction about which method was best on that data set.↩︎

  4. It’s not clear to me how well this CutUp approach would actually preserve feature correlations. My gut tells me the feature correlations of this approach would be reduced relative to the observed, since the variability of the new observations is likely reduced. This ultimately may not matter for predictive purposes or their ultimate use in embeddings. However, I wonder if something like SMOTE, random (bootstrap) sampling, other DL methods like autoencoders, or similar approaches might do the same or better.↩︎

References

Corrections

If you see mistakes or want to suggest changes, please create an issue on the source repository.

Reuse

Text and figures are licensed under Creative Commons Attribution CC BY-SA 4.0. Source code is available at https://github.com//m-clark/m-clark.github.io, unless otherwise noted. The figures that have been reused from other sources don't fall under this license and can be recognized by a note in their caption: "Figure from ...".

Citation

For attribution, please cite this work as

Clark (2022, May 1). Michael Clark: Deep Learning for Tabular Data. Retrieved from https://m-clark.github.io/posts/2022-04-01-more-dl-for-tabular/

BibTeX citation

@misc{clark2022deep,
  author = {Clark, Michael},
  title = {Michael Clark: Deep Learning for Tabular Data},
  url = {https://m-clark.github.io/posts/2022-04-01-more-dl-for-tabular/},
  year = {2022}
}