All data can be found in the book’s repo. Depending on when you access it, there may be more or less data available. We’ll try to clean it up to make it more clear eventually, but it’s easiest to use the code in the demonstrations to download the data directly.
Movie reviews
The movie reviews dataset was a fun way to use an LLM to create movie titles and reviews in a specific way, as well as other features. With features in hand, we then generated a rating outcome with specific feature-target relationships. It has 1000 rows and the following columns:
title
: The title of the movie
review_year
: The year the review was written
age
: The age of the reviewer
children_in_home
: The number of children in the reviewer’s home
education
: The education level of the reviewer (Post-Graduate, Completed College, Completed High School)
gender
: The gender of the reviewer (male or female)
work_status
: The work status of the reviewer (Employed, Retired, Unemployed, Student)
genre
: The genre of the movie
release_year
: The year the movie was released
length_minutes
: The length of the movie in minutes
season
: The season the movie was released (e.g., Fall, Winter)
total_reviews
: The total number of reviews for the movie
rating
: The rating of the movie
review_text
: The text of the review
word_count
: The number of words in the review
review_year_0
: The review year starting from 0
release_year_0
: The release year starting from 0
*_sc
: Scaled (standardized) versions of age, length_minutes, total_reviews, and word_count
rating_good
: A binary version of rating, where 1 is a good rating (>= 3) and 0 is a bad rating (< 3)
Short link:
Repo File:
World Happiness Report
The World Happiness Report is a survey of the state of global happiness that ranks countries by how ‘happy’ their citizens perceive themselves to be. You can also find additional details in their supplemental documentation. Our 2018 data is from what was originally reported at that time (figure 2.2 in the corresponding report), and it also contains a life ladder score from the most recent survey, which is similar and very highly correlated.
The datasets contains the following columns:
country
: The country name
year
: The year of the survey
life_ladder
: The happiness score
log_gdp_per_capita
: The log of GDP per capita
social_support
: The social support score
healthy_life_expectancy_at_birth
: The healthy life expectancy at birth
freedom_to_make_life_choices
: The freedom to make life choices score
generosity
: The generosity score
perceptions_of_corruption
: The perceptions of corruption score
positive_affect
: The positive affect score
negative_affect
: The negative affect score
confidence_in_national_government
: The confidence in national government score
happiness_score
: The happiness score
dystopia_residual
: The dystopia residual score (difference from a ‘least happy’ country)
In addition there are standardized/scaled versions of the features, which are suffixed with _sc
.
Short links:
Repo Files:
data/world_happiness_all_years.csv
data/world_happiness_2018.csv
Heart Disease UCI
This classic dataset comes from the UCI ML repository. We took a version from Kaggle, and features and target were renamed to be more intelligible. Here is a brief description from UCI:
This database contains 76 attributes, but all published experiments refer to using a subset of 14 of them. In particular, the Cleveland database is the only one that has been used by ML researchers to date. The “goal” field refers to the presence of heart disease in the patient. It is integer valued from 0 (no presence) to 4. Experiments with the Cleveland database have concentrated on simply attempting to distinguish presence (values 1,2,3,4) from absence (value 0).
age
: Age in years
male
: ‘yes’ or ‘no’
chest_pain_type
: ‘typical’, ‘atypical’, ‘non-anginal’, ‘asymptomatic’
resting_bp
: Resting blood pressure (mm Hg)
cholesterol
: Serum cholesterol (mg/dl)
fasting_blood_sugar
: ‘> 120 mg/dl’ or ‘<= 120 mg/dl’
resting_ecg
: ‘normal’, ‘left ventricular hypertrophy’, ‘ST-T wave abnormality’
max_heart_rate
: Maximum heart rate achieved
exercise_induced_angina
: ‘yes’ or ‘no’
st_depression
: ST depression induced by exercise relative to rest
slope
: ‘upsloping’, ‘flat’, ‘downsloping’
num_major_vessels
: Number of major vessels (0-3) colored by fluoroscopy
thalassemia
: ‘normal’, ‘fixed defect’, ‘reversible defect’
heart_disease
: ‘yes’ or ‘no’
Short links:
Repo Files:
data/heart_disease_processed.csv
data/heart_disease_processed_numeric_sc.csv
Fish
This is a very simple data set with a count target variable. It’s also good if you want to try your hand at zero-inflated models. The background is that state wildlife biologists want to model how many fish are being caught by fishermen at a state park.
nofish
: We’ve never seen this explained. Originally 0 and 1, 0 is equivalent to livebait == ‘yes’, so it may be whether the primary motivation of the camping trip is for fishing or not.
livebait
: Whether live bait was used or not
camper
: Whether or not they brought a camper
persons
: How many total persons on the trip
child
: How many children present
count
: Number of fish caught
Short Link:
Repo File: