Appendix B — Dataset Descriptions

B.1 Movie reviews

B.2 World Happiness Report

B.3 Heart Failure

Dataset from Davide Chicco, Giuseppe Jurman: “Machine learning can predict survival of patients with heart failure from serum creatinine and ejection fraction alone. BMC Medical Informatics and Decision Making 20, 16 (2020)

https://www.kaggle.com/datasets/andrewmvd/heart-failure-clinical-data

Cardiovascular diseases (CVDs) are the number 1 cause of death globally, taking an estimated 17.9 million lives each year, which accounts for 31% of all deaths worldwide. Heart failure is a common event caused by CVDs and this dataset contains 12 features that can be used to predict mortality by heart failure.

Most cardiovascular diseases can be prevented by addressing behavioral risk factors such as tobacco use, unhealthy diet and obesity, physical inactivity and harmful use of alcohol using population-wide strategies.

People with cardiovascular disease or who are at high cardiovascular risk (due to the presence of one or more risk factors such as hypertension, diabetes, hyperlipidaemia or already established disease) need early detection and management wherein a machine learning model can be of great help.

age: Age

anaemia: Decrease of red blood cells or hemoglobin (boolean)

creatinine_phosphokinase: Level of the CPK enzyme in the blood (mcg/L)

diabetes: If the patient has diabetes (boolean)

ejection_fraction: Percentage of blood leaving the heart at each contraction (percentage)

high_blood_pressure: If the patient has hypertension (boolean)

platelets: Platelets in the blood (kiloplatelets/mL)

serum_creatinine: Level of serum creatinine in the blood (mg/dL)

serum_sodium: Level of serum sodium in the blood (mEq/L)

sex: Woman or man (binary)

smoking: If the patient smokes or not (boolean)

time: Follow-up period (days)

DEATH_EVENT: If the patient deceased during the follow-up period (boolean)

For booleans: Sex - Gender of patient Male = 1, Female =0 (renamed for our data) Age - Age of patient Diabetes - 0 = No, 1 = Yes Anaemia - 0 = No, 1 = Yes High_blood_pressure - 0 = No, 1 = Yes Smoking - 0 = No, 1 = Yes DEATH_EVENT - 0 = No, 1 = Yes

B.4 Heart Disease UCI

Dataset from Kaggle: https://www.kaggle.com/ronitf/heart-disease-uci

This database contains 76 attributes, but all published experiments refer to using a subset of 14 of them. In particular, the Cleveland database is the only one that has been used by ML researchers to this date.

Attribute Information:

    name     role         type demographic                                        description  units missing_values

0 age Feature Integer Age None years no 1 sex Feature Categorical Sex None None no 2 cp Feature Categorical None None None no 3 trestbps Feature Integer None resting blood pressure (on admission to the ho… mm Hg no 4 chol Feature Integer None serum cholestoral mg/dl no 5 fbs Feature Categorical None fasting blood sugar > 120 mg/dl None no 6 restecg Feature Categorical None None None no 7 thalach Feature Integer None maximum heart rate achieved None no 8 exang Feature Categorical None exercise induced angina None no 9 oldpeak Feature Integer None ST depression induced by exercise relative to … None no 10 slope Feature Categorical None None None no 11 ca Feature Integer None number of major vessels (0-3) colored by flour… None yes 12 thal Feature Categorical None None None yes 13 num Target Integer None diagnosis of heart disease None no

Features and target were renamed for our data.

B.5 Fish

State wildlife biologists want to model how many fish are being caught by fishermen at a state park. A very simple data set with a count target variable

fish

A tibble of 250 rows and 6 columns. Good for demonstrating zero-inflated models.

nofish - We’ve never seen this explained. Originally 0 and 1, 0 is equivalent to livebait == ‘yes’, so it may be whether the primary motivation of the camping trip is for fishing or not.

livebait whether live bait was used or not

camper whether or not they brought a camper

persons how many total persons on the trip

child how many children present

count number of fish count