Appendix C — Dataset Descriptions

C.1 Movie Reviews

The movie reviews dataset was a fun way to use an LLM to create movie titles and reviews in a specific way, as well as other features. With features in hand, we then generated a rating outcome with specific feature-target relationships. It has 1000 rows and the following columns:

title: The title of the movie
review_year: The year the review was written
age: The age of the reviewer
children_in_home: The number of children in the reviewer’s home
education: The educational level of the reviewer (Post-Graduate, Completed College, Completed High School)
gender: The gender of the reviewer (Male or Female)
work_status: The work status of the reviewer (Employed, Retired, Unemployed, Student)
genre: The genre of the movie
release_year: The year the movie was released
length_minutes: The length of the movie in minutes
season: The season the movie was released (e.g., Fall, Winter)
total_reviews: The total number of reviews for the movie
rating: The rating of the movie
review_text: The text of the review
word_count: The number of words in the review
review_year_0: The review year starting from 0
release_year_0: The release year starting from 0
*_sc: Scaled (standardized) versions of age, length_minutes, total_reviews, and word_count
rating_good: A binary version of rating, where 1 is a good rating (>= 3) and 0 is a bad rating (<3)

Link:

https://tinyurl.com/moviereviewsdata

Repo File:

data/movie_reviews.csv

Table C.1: Movie Reviews Dataset (String)

variable	empty	n_unique
title	0.0	100.0
education	0.0	3.0
gender	0.0	2.0
work_status	0.0	4.0
genre	0.0	8.0
season	0.0	4.0
review_text	0.0	442.0

Table C.2: Movie Reviews Dataset (Numeric)

variable	mean	sd	min	med	max
review_year	2015.8	5.1	2000.0	2017.0	2022.0
age	46.9	18.3	18.0	47.0	80.0
children_in_home	0.4	0.7	0.0	0.0	3.0
release_year	2008.1	9.6	1983.0	2010.0	2020.0
length_minutes	121.0	11.5	98.0	120.0	147.0
total_reviews	4921.7	2837.9	374.0	4464.0	9926.0
rating	3.1	0.6	1.0	3.1	5.0
word_count	10.3	5.1	2.0	9.0	32.0
rating_good	0.6	0.5	0.0	1.0	1.0

C.2 World Happiness Report

The World Happiness Report is a survey of the state of global happiness that ranks countries by how ‘happy’ their citizens perceive themselves to be. You can also find additional details in their supplemental documentation. Our 2018 data is from what was originally reported at that time (figure 2.2 in the corresponding report), and it also contains a life ladder score from the most recent survey, which is similar and very highly correlated.

The datasets contain the following columns:

country: The country name
year: The year of the survey
life_ladder: The happiness score
log_gdp_per_capita: The log of GDP per capita
social_support: The social support score
healthy_life_expectancy_at_birth: The healthy life expectancy at birth
freedom_to_make_life_choices: The freedom to make life choices score
generosity: The generosity score
perceptions_of_corruption: The perceptions of corruption score
positive_affect: The positive affect score
negative_affect: The negative affect score
confidence_in_national_government: The confidence in national government score
happiness_score: The happiness score
dystopia_residual: The dystopia residual score (difference from a ‘least happy’ country)

In addition, there are standardized/scaled versions of the features, which are suffixed with _sc.

Links:

All years: https://tinyurl.com/worldhappinessallyears
2018: https://tinyurl.com/worldhappiness2018

Repo Files:

data/world_happiness_all_years.csv
data/world_happiness_2018.csv

Table C.3: World Happiness Report Dataset (All Years)

variable	n_missing	mean	sd	min	med	max
year	0.0	2014.2	4.7	2005.0	2014.0	2022.0
happiness_score	0.0	5.5	1.1	1.3	5.4	8.0
log_gdp_per_capita	20.0	9.4	1.2	5.5	9.5	11.7
social_support	13.0	0.8	0.1	0.2	0.8	1.0
healthy_life_expectancy_at_birth	54.0	63.3	6.9	6.7	65.0	74.5
freedom_to_make_life_choices	33.0	0.7	0.1	0.3	0.8	1.0
generosity	73.0	0.0	0.2	−0.3	0.0	0.7
perceptions_of_corruption	116.0	0.7	0.2	0.0	0.8	1.0
positive_affect	24.0	0.7	0.1	0.2	0.7	0.9
negative_affect	16.0	0.3	0.1	0.1	0.3	0.7

Table C.4: World Happiness Report Dataset (2018)

variable	mean	sd	min	med	max
life_ladder	5.6	1.1	2.7	5.5	7.9
log_gdp_per_capita	9.3	1.2	6.6	9.5	11.5
social_support	0.8	0.1	0.5	0.8	1.0
healthy_life_expectancy_at_birth	64.7	6.8	48.2	66.7	75.0
freedom_to_make_life_choices	0.8	0.1	0.4	0.8	1.0
generosity	0.0	0.2	−0.3	0.0	0.5
perceptions_of_corruption	0.7	0.2	0.2	0.8	1.0
positive_affect	0.7	0.1	0.4	0.7	0.9
negative_affect	0.3	0.1	0.2	0.3	0.5
confidence_in_national_government	0.5	0.2	0.1	0.5	1.0
happiness_score	5.4	1.1	3.3	5.4	7.6
dystopia_residual	2.0	0.5	0.3	1.9	2.9

C.3 Heart Disease UCI

This classic dataset comes from the UCI ML repository. We took a version from Kaggle, and features and target were renamed to be more intelligible. Here is a brief description from UCI:

This database contains 76 attributes, but all published experiments refer to using a subset of 14 of them. In particular, the Cleveland database is the only one that has been used by ML researchers to date. The “goal” field refers to the presence of heart disease in the patient. It is integer valued from 0 (no presence) to 4. Experiments with the Cleveland database have concentrated on simply attempting to distinguish presence (values 1,2,3,4) from absence (value 0).

age: Age in years
male: ‘yes’ or ‘no’
chest_pain_type: ‘typical’, ‘atypical’, ‘non-anginal’, ‘asymptomatic’
resting_bp: Resting blood pressure (mm-Hg)
cholesterol: Serum cholesterol (mg/dl)
fasting_blood_sugar: ‘> 120 mg/dl’ or ‘<= 120 mg/dl’
resting_ecg: ‘normal’, ‘left ventricular hypertrophy’, ‘ST-T wave abnormality’
max_heart_rate: Maximum heart rate achieved
exercise_induced_angina: ‘yes’ or ‘no’
st_depression: ST depression induced by exercise relative to rest
slope: ‘upsloping’, ‘flat’, ‘downsloping’
num_major_vessels: Number of major vessels (0-3) colored by fluoroscopy
thalassemia: ‘normal’, ‘fixed defect’, ‘reversible defect’
heart_disease: ‘yes’ or ‘no’

Links:

Processed: https://tinyurl.com/heartdiseaseprocessed
Numeric features only: https://tinyurl.com/heartdiseaseprocessednumeric

Repo Files:

data/heart_disease_processed.csv
data/heart_disease_processed_numeric_sc.csv

Table C.5: Heart Disease UCI Dataset (String)

variable	empty	n_unique
chest_pain_type	0.0	4.0
fasting_blood_sugar	0.0	2.0
resting_ecg	0.0	3.0
exercise_induced_angina	0.0	2.0
slope	0.0	3.0
thalassemia	0.0	3.0
heart_disease	0.0	2.0

Table C.6: Heart Disease UCI Dataset (Numeric)

variable	mean	sd	min	med	max
age	54.5	9.0	29.0	56.0	77.0
male	0.7	0.5	0.0	1.0	1.0
resting_bp	131.7	17.8	94.0	130.0	200.0
cholesterol	247.4	52.0	126.0	243.0	564.0
max_heart_rate	149.6	22.9	71.0	153.0	202.0
st_depression	1.1	1.2	0.0	0.8	6.2
num_major_vessels	0.7	0.9	0.0	0.0	3.0

C.4 Fish

This is a very simple dataset with a count target variable. It’s also good if you want to try your hand at zero-inflated models. The background is that state wildlife biologists want to model how many fish are being caught by fishermen at a state park.

nofish: We’ve never seen this explained. Originally 0 and 1, 0 is equivalent to livebait equals ‘yes’, so it may be whether the primary motivation of the camping trip is for fishing or not.
livebait: Whether or not live bait was used
camper: Whether or not they brought a camper
persons: How many total persons on the trip
child: How many children present
count: Number of fish caught

Link:

https://tinyurl.com/fishcountdata

Repo File:

data/fish.csv

Table C.7: Fish Dataset (String)

variable	empty	n_unique
nofish	0.0	2.0
livebait	0.0	2.0
camper	0.0	2.0

Table C.8: Fish Dataset (Numeric)

variable	mean	sd	min	med	max
persons	2.5	1.1	1.0	2.0	4.0
child	0.7	0.9	0.0	0.0	3.0
count	3.3	11.6	0.0	0.0	149.0