Notes: ML Project Checklist
Appendix A: "Machine Learning Project Checklist" from the book Hands-On Machine Learning with Scikit-Learn, Keras & TensorFlow by Aurélion Géron, annotated with details and suggestions from other chapters as well as my own experience. (Caveat: this experience is currently limited to writing notebooks for Kaggle competitions :))
Project Checklist
- Look at the big picture
- Get the data
- Explore and visualize the data to gain insights
- Prepare the data for ML algorithms
- Select a model and train it
- Fine-tune your model
- Present your solution
- Launch, monitor, and maintain your system
Look at the big picture
- Define the problem in business terms
- How will the solution be used?
- What are the current solutions/workarounds (if any)?
- How should you frame this problem?
- Supervised, unsupervised, semi-supervised, self-supervised, reinforcement learning?
- Classification, regression, or something else?
- Single or multiple regression (i.e. inputs), univariate or multivariate (i.e. outputs) regression?
- Batch or online learning?
- Batch or online predictions?
- Interpretability vs. accuracy?
- How should performance be measured?
- RMSE generally preferred for regression
- Classification:
- Accuracy is common, but not great for skewed datasets
- Look at the confusion matrix: true pos, true neg, false pos, false neg
- Precision and recall
- Precision (accuracy of positive predictions): TP / (TP + FP)
- Recall (how many positives were predicted): TP / (TP + FN)
- F1 score (harmonic mean of precision and recall): (P x R)/(P + R)
- Should choose this based on the domain and business needs! Sometimes precision is more important than recall, etc.
- And there's a tradeoff: optimizing for higher precision means lowering recall, etc. Choose a decision threshold by plotting P and R vs. threshold and making a business decision. "Let's do 99% precision." => "At what recall?"
- ROC plots true pos vs. false pos. Compare classifiers with AOC (area under ROC curve)
- When to use which?
- Accuracy if data is balanced and you care about both pos and neg predictions
- Precision/recall curve when positives are rare (or you care more about FP than FN)
- ROC curve and AUC otherwise
- Is the performance measure aligned with the business objective?
- What would be the minimum performance needed to reach the business objective?
- What are comparable problems? Can you reuse experience or tools?
- Is human expertise available?
- How would you solve the problem manually?
- List the assumptions you/others have made so far
- Verify assumptions if possible
Get the data
Note: Automate as much as possible so you can easily get fresh data!
- List the data you need and how much you need
- Find and document where you can get that data
- Could be in a relational DB or other data store, spread across tables, etc.
- Check how much space it will take
- Check legal obligations, and get authorization if necessary
- Get access authorizations
- Create a workspace (with enough storage space)
- Colab etc
- Get the data
- Put this in a function/script
- Maybe schedule a job to do this at regular intervals
- Convert the data to a format you can easily manipulate (without changing the data itself)
- CSV to pandas DataFrame,
pd.read_csv
- CSV to pandas DataFrame,
- Ensure sensitive information is deleted or protected (e.g. anonymized)
- Check the size and type of data (time series, sample, geographical, etc.)
- Not to gain insights about the underlying distribution, but rather to understand what kind of data this is
- Column types, categorical values, numerical stats, nulls, etc.
df.head
,info
,value_counts
,describe
,hist
w/plt.show
- Identifies units for numerical values, caps, scaling, skew
- If your labels have issues (e.g. are capped but you need accurate predictions outside the range in the data, but your model can't learn them), you can:
- Collect proper labels, or
- Remove the examples from the training and test sets
- Sample a test set, put it aside, and never look at it (no data snooping!)
Seriously, stop here and put aside a test set before you look at the data any further!!!
Test set
Why? You might notice some patterns that lead you to choose a certain model, and your generalization error estimate will be too optimistic. This is known as data snooping bias.
- Shuffling and sampling on every execution will produce a different test set every time, and eventually you will see the whole dataset
- Saving the test set in the first run and loading it, or setting a rng seed, don't solve the issue of fetching new data
- Could hash example identifiers and assign to test set if hash value is less than e.g. 20% of maximum hash value
- Random sampling can introduce sampling bias
- Stratified sampling tries to guarantee that the test set is representative of the overall data
train_test_split(stratify=df[cols])
- Based on domain knowledge you may want to create strata by binning continuous features. (But not too many and not too large)
- Can add a new feature column, stratify and split test/train, then delete column
Explore and visualize the data to gain insights
Note: Try to get insights from a field expert for these steps.
- Create a copy of the data for exploration (downsample to manageable size if necessary)
- Create a Jupyter notebook to keep a record of the explorations
- Study each attribute and its characteristics:
- Name
- Type (categorical, int/float, bounded/unbounded, text, unstructured, etc)
- % of missing values
- Noisiness and type of noise (stochastic, outliers, rounding errors, etc)
- Usefuless for the task
- Type of distribution (Gaussian, uniform, logarithmic, etc)
- For supervised learning: identify target attributes
- Visualize the data (
df.plot
,hist
, etc) - Study correlations (
df.corr
,scatter_matrix
) - How would you solve manually?
- Identify promising transformations
- E.g. nonlinear correlation (1/x, etc), attribute combinations
- Identify extra data that would be useful (go back to "Get the data")
- Document what you have learned
Prepare the data for ML algorithms
Note!
- Work on copies of the data (not original dataset)
- Write functions for all transformations, for five reasons:
- Easily handle fresh/new datasets
- Can apply as a library in future projects
- Clean and prepare the test set
- Clean and prepare live data (new data instances) in production
- Easily treat choices as hyperparameters (find best combinations, etc)
Steps:
- Revert to a clean training set and separate predictors from labels
- Clean the data
- Fix or remove outliers (optional)
- Fill in missing values (imputation: w/zero, mean, median, etc,
fillna
), drop the instances (dropna
), or even drop the entire attribute (drop
) - Note that you might need to e.g. impute values for attributes in the live data that didn't need it at training time, so build your transformers to handle this possibility
SimpleImputer
,KNNImputer
,IterativeImputer
- Feature selection (optional)
- Drop attributes with no useful information for the task
- Feature engineering
- Bucket/discretize continuous features
- e.g.
cut
,KBinsDiscretizer
- if multimodal, treat cut bucket as categorical. e.g. home age -> generation bucket (not comparable w.r.t. home value) -> generation as cat attribute
- another option: RBF (
rbf_kernel(data, compare_to, gamma=)
) can compare data to a specific set of fixed values (e.g. metro areas, generation medians, etc) and return the "distance" to those fixed points
- e.g.
- Decompose features (e.g. categorical via one-hot encoding, date/time, etc)
df.apply
with custom functionsOneHotEncoder
will remember the categories it was trained on - useful for prod (rather thandf.get_dummies
)- Large number of categories for an attribute -> can slow down training and prediction. Maybe replace the attribute with a useful numerical feature?
- Options: split it up into other data (country code -> gdp, population, etc?),
category_encoders
from sklearn-contrib, or an embedding in a neural network. This is representation learning
- Add promising transformations of features (based on "explore the data"), e.g. log, sqrt, polynomial, inverse, etc
- Aggregate features into new features (feature crosses)
PolynomialFeatures
- custom transformations
- Feature scaling: ML algorithms don't perform well when the input numerical attributes have very different scales. Only fit to the training data!!
- Normalization: min/max scaling.
MinMaxScaler
- Standardization: (val-mean)/stdev, i.e. value -> # of stdevs away from the mean.
StandardScaler
- What if the feature distribution has a heavy tail (i.e. values far from the mean aren't exponentially rare)? First, transform (sqrt, log) the feature and then scale it
- Normalization: min/max scaling.
- Bucket/discretize continuous features
Note that you may need to transform the target values too! E.g. log
. But
then you'll need to inverse_transform
the scaled predictions to get your final
predictions (or use TransformedTargetRegressor
).
New transformers
- If it doesn't require training:
- write functions f [and g=f^-1] : np.ndarray -> np.ndarray
- build
FunctionTransformer(f, [inverse_func=g])
. E.g. combining features w/ratio, etc
- Else, write a class:
- implement
fit
(setsfeature_names_in_
, returnsself
),transform
, andfit_transform
(or inherit fromTransformerMixin
to get the latter) - inherit from
BaseEstimator
to getget_params
andset_params
to support automatic hyperparameter tuning - implement
get_feature_names_out
andinverse_transform
- use
check_array
andcheck_is_fitted
fromsklearn.utils.validation
intransform
- implement
You can use other estimators in the implementation!
Use check_estimator
to make sure you respect the sklearn API!
Transformation Pipelines
- to visualize:
sklearn.set_config(display="diagram")
- to chain transformers:
sklearn.pipeline.make_pipeline(*tranformers)
Pipeline
has the same interface as the final estimatorfit
calls and propagates results from allfit_transform
methods (but onlyfit
on the last one)transform
andpredict
similar (but call that method on the last one)
- To handle different columns separately, use
sklearn.compose.make_column_transformer(tuples)
: takes a sequence of (transformer, columns) tuples and applies each transformer to its specified columns (or specify withmake_column_selector
)
Select a model and train it
Note:
- If the data is huge, sample smaller training sets. This will penalize complex models though!
- Automate!!
Steps:
- Train many quick-and-dirty models from different categories (e.g. linear, naive Bayes, SVM, random forest, neural net, etc) using standard parameters
- Measure and compare their performance
- Use N-fold cross-validation and compute mean/stdev of perf measure on the folds
cross_val_score(model, X, Y, scoring="neg_root_mean_squared_error", cv=10)
uses 10 subsets (folds), trains w/ 9 and cross-validates with 1, and does this 10 timesDecisionCurveDisplay
,PrecisionRecallCurveDisplay
,RocCurveDisplay
- Analyze the most significant variables of each algorithm
- hyperparameters: scorers, learners, regularization, cross-validation folds, etc
- Analyze the types of errors the models could make/does make
- What data would a human have used to avoid these errors?
- Do we need to gather more training data? Do data augmentation?
- Consider specific slices of validation sets for relevant subgroups (e.g. disadvantaged groups, specific metros, etc)
- Quick round of feature selection and engineering
- Which attributes have the highest weights/importances
df.corr
andscatter_matrix
feature_importances_
for trees- Maybe remove some, add new ones, etc. e.g.
feature_selection
- Repeat the above one or two more times - quickly
- Shortlist the top three to five promising models, preferring models that make different types of errors
Fine-tune your model
Use as much data as possible - and automate!
- Fine-tune the hyperparameters using cross-validation
- Treat data transformation choices as hyperparameters (e.g. impute with zero or median, or drop, etc)
- Prefer
RandomSearchCV
overGridSearchCV
unless there are very few options to test. Consider Bayesian optimization instead if it takes very long
- Try ensemble methods - combining best models can produce better results
- Pick best model
- compare cross-validation scores
- get tuned model and score via
best_estimator_
andbest_score_
fromXSearchCV
- Estimate generalization error by measuring performance on the test set
metrics.accuracy_score
,mean_squared_error(squared=False)
- This will likely be worse than your cross-validation error, since you fine-tuned to the training set
- Don't go tweak hyperparameters to make this number look good - it won't generalize, you'll overfit the test set!
- go back to the beginning, create a new test set, reevaluate all your decisions, build new models
- do this too many times splitting train/test from the same dataset => still overfitting, since with enough splits you'll eventually "see" all the data
Present your solution
- Document what you have done
- Create a nice presentation - highlight the big picture first
- Explain why your solution achieves the business objective
- Present interesting points noticed along the way
- What worked and what didn't?
- List your assumptions and your system's limitations
- Ensure the key findings are communicated through beautiful visualizations and easy-to-remember statements (e.g. "the median income is the #1 predictor of housing prices")
Launch, monitor, and maintain your system
- Get your solution ready for production
- Save the best model:
joblib.dump
- How to use it in production?
- Load it in your app with
joblib.load
and make predictions in your app - Split the loading and predictions into a prediction service/API
- Upload to Cloud options (e.g. Google Vertex AI) that you call via API
- Load it in your app with
- Plug into production data inputs, write unit tests, etc
- Save the best model:
- Write monitoring code to check the system's live performance at regular intervals and trigger alerts when it drops:
- Can monitor via downstream metrics (e.g. product sales)
- May require a human eval pipeline / human raters (crowdsourcing) to monitor performance directly
- Also monitor the input data quality! E.g. a sensor sending bad/random values, an input dependency data becoming stale, mean/stdev drifts too far from training set, more missing values than expected, new categorical values, etc. Particularly important for online learning systems!
- Beware of slow degradation - models tend to "rot" as data evolves
- Retrain the models on a regular basis on fresh data - AUTOMATE!
- Schedule collecting fresh data and label it (e.g. w/human raters again)
- Schedule re-train: load fresh data, train model, fine-tune hyperparameters
- Schedule conditional deployment: compare live model performance with new model, deploy if not worse. If worse, alert and investigate why!
- Keep backups of every model and every dataset version and support easy rollbacks!