A continuing exploration
In a previous post, I offered a summary of several articles that came out over the summer of 2021 regarding the application of deep learning (DL) methods to tabular data. DL has shown astounding success in the natural language processing, computer vision, and other fields, but when it comes to the sorts of data common in other situations, especially where data is usually smaller and of mixed source and type (e.g. demographic, social science, biological data), results were mostly unimpressive for complex DL architectures. In particular, it did not appear that DL methods could consistently compete with, much less consistently beat, common machine learning (ML) approaches such as gradient boosting (e.g. XGBoost). Here I provide a bit of an update, as another few articles have come along continuing the fight.
I collected most of the results from the summarized articles here and those covered in the previous post to see if we come to any general conclusions about which methods are best or work best in certain settings. In the following tables, I excluded those I knew to be image data, as well as datasets where I thought results were indistinguishable across all models tested (e.g. less than 1% difference in accuracy). This left comparisons for 92 datasets across six articles. However, it’s important to note that these were not independent datasets or studies. For example, Gorishniy et al. are the source of two papers and essentially the same testing situations, and other datasets were common across papers (e.g. Higgs Boson). In the rare situations there was a tie, I gave the nod to boosting methods as a. the whole point is to do better than those, b. they are the easier model to implement, and c. they are not always given the same advantages in these studies (e.g. pre-processing).
The following shows results by feature type.
The results suggest that current DL approaches’ strength is mostly with purely numeric data, and for heterogeneous data, simpler MLP or Boosting will generally prevail. I initially thought that boosting would do even better with heterogeneous data, and I still suspect that with more heterogeneous data and on more equal footing, results would tilt even more.
winner_model_type | All Cat | All Num | Heterogeneous | Min. Combo |
---|---|---|---|---|
Boost | 2 | 10 | 14 | 6 |
MLP | 2 | 4 | 9 | 11 |
DL_complex | 0 | 22 | 7 | 5 |
winner_model_type | All Cat | All Num | Any Combo |
---|---|---|---|
Boost | 2 | 10 | 20 |
MLP | 2 | 4 | 20 |
DL_complex | 0 | 22 | 12 |
The following suggests that complex DL methods are going to require a lot of data to perform better. This isn’t that surprising but the difference here is quite dramatic. Interestingly, MLP methods worked well for fewer features. N total in this case means total size reported (not just training).
winner_model_type | N features | N total |
---|---|---|
Boost | 209 | 133,309 |
DL_complex | 207 | 530,976 |
MLP | 114 | 114,164 |
In the following we compare binary (bin), multiclass (mc), and numeric (num) target results1, but there’s no strong conclusion for this. The main thing to glean from this is that these papers do not test numeric targets nearly enough. Across dozens of disciplines and countless datasets that I’ve come across in various settings, if anything, this ratio should be reversed.
winner_model_type | bin | mc | num |
---|---|---|---|
Boost | 17 | 10 | 5 |
DL_complex | 17 | 11 | 6 |
MLP | 10 | 14 | 2 |
In the following I look at any heterogeneous, smaller data (N < 200,000). A complex DL model will likely not do great in this setting.
winner_model_type | n |
---|---|
Boost | 19 |
DL_complex | 8 |
MLP | 19 |
Now, on to the details of some of the recent results that were included.
Yura Gorishniy, Rubachev, and Babenko (2022) pit several architectures against one another, such as standard multilayer perceptron (MLP), ResNet, and their own transformer approach (see Yuri Gorishniy et al. (2021)). Their previous work, which was summarized in my earlier post, was focused on the architecture, while here they focus on embedding approaches. The primary idea is to take the value of some feature and expand it to some embedding space, then use the embedding in lieu of the raw feature. It can essentially be seen as a pre-processing task.
One approach they use is piecewise linear encoding (PLE), which they at one point describe as ‘a continuous alternative to the one-hot encoding’2. Another embedding they use is basically a fourier transform.
* Using proposed embeddings
In general this was an interesting paper, and I liked the simple embedding approaches used. It was nice to see that they may be useful in some contexts. The fourier transform is something that analysts (including our team at Strong) have used in boosting, so I’m a bit curious why they don’t do Boosting + embeddings for comparison for that or both embedding types. These embeddings can be seen as a pre-processing step, so nothing would keep someone from using them for any model.
Another interesting aspect was how little difference there was in model performance. It seemed half the datasets showed extremely small differences between any model type.
This paper applies BERT-style attention over rows and columns, along with embedding/data augmentation. They distinguish the standard attention over features, with intersample attention of rows. In addition, they use CutMix for data augmentation (originally devised for images), which basically combines pairs of observations to create a new observation4. Their model is called SAINT, the Self-Attention and Intersample Attention Transformer.
It seems the SAINT does quite well on some of the data, and average AUROC across all datasets is higher than XGB.
Main table shows only 9 datasets though, which they call ‘representative’ but it’s not clear what that means when you only have 16 to start. One dataset showed near perfect classification for all models so will not be considered. Of the 15 total remaining:
SAINT benefits from data augmentation. This could have been applied to any of the other models, but doesn’t appear to have been done.
At least they also used some form of logistic regression as a baseline, though I couldn’t find details on its implementation (e.g. regularization, including interactions). I don’t think this sort of simple baseline is utilized enough.
This is an interesting result, but somewhat dampened by lack of including numeric targets and more heterogeneous data. The authors include small data settings which is great, and are careful to not generalize despite some good results, which I can appreciate.
I really like the fact they also compare a simple logistic regression to these models, because if you’re not able to perform notably better relative to the simplest model one could do, then why would we care? The fact that logistic regression is at times competitive and even beats boosting/SAINT methods occasionally gives me pause though. Perhaps some of these data are not sufficiently complex to be useful in distinguishing these methods? It is realistic though. While it’s best not to assume as such, sometimes a linear model is appropriate given the features and target at hand.
This paper introduces Non-Parametric Transformers, which focus on holistic processing of multiple inputs, and attempts to consider an entire dataset as input as opposed to a single row. Their model attempts to learn relations between data points to aid prediction. They use a mask to identify prediction points from the non-masked data, i.e. the entire \(X_{\textrm{not masked}}\text{ }\) data used to predict \(X_{\textrm{masked}}\text{ }\). The X matrix actually includes the target (also masked vs. not). At prediction, the model is able to make use of the correlations of inputs of training to ultimately make a prediction.
When I first read the abstract where they say “We challenge a common assumption underlying most supervised deep learning: that a model makes a prediction depending only on its parameters and the features of a single input.”, I immediately was like ‘What about this, that, and those?’. The key phrase was ‘deep learning’, because the authors note later that this has a very long history in the statistical modeling realm. I was glad to see in their background of the research that they explicitly noted the models that came to my mind, like gaussian processes, kernel regression, etc. Beyond that, many are familiar with techniques like knn-regression and predictive mean matching, so it’s definitely not new to consider more than a single data point for prediction. I thought it was good of them to add k-nn regression to the model mix, even though it was not going to do well compared to the other approaches.
Though the author’s acknowledge a clear thread/history here, I’m not sure this result is the fundamental shift they claim, versus a further extension/expansion into the DL domain. Even techniques that may work on a single input at a time may ultimately be taking advantage of correlations among the inputs (e.g. spatial correlations in images). Also, automatic learning of feature interactions is standard even in basic regularized regression settings, but here their focus is on observation interactions (but see k-nn regression).
In the two reviews on DL for tabular data that I’ve done, it appears there is more work in store for DL methods applied to tabular data. While it’d be nice to have any technique that would substantially improve prediction for such settings, I do have a suspicion results are likely rosier than they are, since that is just about the case for any newly touted technique, and at least in some cases, I don’t think we’re even making apple to apple comparisons.
That said, I do feel like some ground has been made for DL applications for tabular data, in that architectures can now more consistently performing as well as boosting methods in certain settings, especially if we include MLP. In the end though, results don’t appear strong enough to warrant a switch from boosting for truly heterogeneous data, or even tabular data in general. I feel like someday we’ll maybe have a breakthrough, but in the meantime, we can just agree that messy data is hard stuff to model, and the best tool is whichever one works for your specific situation.
I was thinking about what would be a convincing result, the type of setting and setup where if a DL technique was consistently performing statistically better than boosting methods, I’d be impressed. So I’ve made a list of things I’d like to see more of, and which would make for a better story if the DL method were to beat out other techniques.
Always use heterogeneous data. For giggles let’s say 20%+ of the minority feature type.
Features should at least be minimally correlated, if not notably so.
Image data results are not interesting (why would we use boosting on this in practice?).
Numeric targets should at least be as much of focus as categorical targets.
Include ‘small’ datasets.
Include very structured data (e.g. clustered with repeated observations, geographical points, time series).
Use a flexible generalized additive or similar penalized regression with interactions as a baseline statistical model.
Maybe add survival targets to the mix.
If using a pre-processing step that is done outside of modeling, this likely should be applied to non-DL methods for better comparison, especially, if we’re only considering predictive accuracy and don’t care too much about interpretation.
Note your model variants before analyzing any data. Tweaking/torturing model architecture after results don’t pan out is akin to p-hacking in the statistical realm, and likewise wastes both researcher and reader’s time.
Regarding results…
With the datasets
I don’t refer to numeric targets as ‘regression’ because that’s silly for so many reasons. 😄↩︎
A quick look suggests it’s not too dissimilar from a b-spline.↩︎
I’ll let you go ahead and make your own prediction about which method was best on that data set.↩︎
It’s not clear to me how well this CutUp approach would actually preserve feature correlations. My gut tells me the feature correlations of this approach would be reduced relative to the observed, since the variability of the new observations is likely reduced. This ultimately may not matter for predictive purposes or their ultimate use in embeddings. However, I wonder if something like SMOTE, random (bootstrap) sampling, other DL methods like autoencoders, or similar approaches might do the same or better.↩︎
If you see mistakes or want to suggest changes, please create an issue on the source repository.
Text and figures are licensed under Creative Commons Attribution CC BY-SA 4.0. Source code is available at https://github.com//m-clark/m-clark.github.io, unless otherwise noted. The figures that have been reused from other sources don't fall under this license and can be recognized by a note in their caption: "Figure from ...".
For attribution, please cite this work as
Clark (2022, May 1). Michael Clark: Deep Learning for Tabular Data. Retrieved from https://m-clark.github.io/posts/2022-04-01-more-dl-for-tabular/
BibTeX citation
@misc{clark2022deep, author = {Clark, Michael}, title = {Michael Clark: Deep Learning for Tabular Data}, url = {https://m-clark.github.io/posts/2022-04-01-more-dl-for-tabular/}, year = {2022} }