Notes: ML Project Checklist

Appendix A: "Machine Learning Project Checklist" from the book Hands-On Machine Learning with Scikit-Learn, Keras & TensorFlow by Aurélion Géron, annotated with details and suggestions from other chapters as well as my own experience. (Caveat: this experience is currently limited to writing notebooks for Kaggle competitions :))

Project Checklist

Look at the big picture
Get the data
Explore and visualize the data to gain insights
Prepare the data for ML algorithms
Select a model and train it
Fine-tune your model
Present your solution
Launch, monitor, and maintain your system

Look at the big picture

Define the problem in business terms
How will the solution be used?
What are the current solutions/workarounds (if any)?
How should you frame this problem?
- Supervised, unsupervised, semi-supervised, self-supervised, reinforcement learning?
- Classification, regression, or something else?
- Single or multiple regression (i.e. inputs), univariate or multivariate (i.e. outputs) regression?
- Batch or online learning?
- Batch or online predictions?
- Interpretability vs. accuracy?
How should performance be measured?
- RMSE generally preferred for regression
- Classification:
  - Accuracy is common, but not great for skewed datasets
  - Look at the confusion matrix: true pos, true neg, false pos, false neg
  - Precision and recall
    - Precision (accuracy of positive predictions): TP / (TP + FP)
    - Recall (how many positives were predicted): TP / (TP + FN)
    - F1 score (harmonic mean of precision and recall): (P x R)/(P + R)
    - Should choose this based on the domain and business needs! Sometimes precision is more important than recall, etc.
    - And there's a tradeoff: optimizing for higher precision means lowering recall, etc. Choose a decision threshold by plotting P and R vs. threshold and making a business decision. "Let's do 99% precision." => "At what recall?"
  - ROC plots true pos vs. false pos. Compare classifiers with AOC (area under ROC curve)
  - When to use which?
    - Accuracy if data is balanced and you care about both pos and neg predictions
    - Precision/recall curve when positives are rare (or you care more about FP than FN)
    - ROC curve and AUC otherwise
Is the performance measure aligned with the business objective?
What would be the minimum performance needed to reach the business objective?
What are comparable problems? Can you reuse experience or tools?
Is human expertise available?
How would you solve the problem manually?
List the assumptions you/others have made so far
Verify assumptions if possible

Get the data

Note: Automate as much as possible so you can easily get fresh data!

List the data you need and how much you need
Find and document where you can get that data
- Could be in a relational DB or other data store, spread across tables, etc.
Check how much space it will take
Check legal obligations, and get authorization if necessary
Get access authorizations
Create a workspace (with enough storage space)
- Colab etc
Get the data
- Put this in a function/script
- Maybe schedule a job to do this at regular intervals
Convert the data to a format you can easily manipulate (without changing the data itself)
- CSV to pandas DataFrame, pd.read_csv
Ensure sensitive information is deleted or protected (e.g. anonymized)
Check the size and type of data (time series, sample, geographical, etc.)
- Not to gain insights about the underlying distribution, but rather to understand what kind of data this is
- Column types, categorical values, numerical stats, nulls, etc.
- df.head, info, value_counts, describe, hist w/plt.show
- Identifies units for numerical values, caps, scaling, skew
- If your labels have issues (e.g. are capped but you need accurate predictions outside the range in the data, but your model can't learn them), you can:
  - Collect proper labels, or
  - Remove the examples from the training and test sets
Sample a test set, put it aside, and never look at it (no data snooping!)

Seriously, stop here and put aside a test set before you look at the data any further!!!

Test set

Why? You might notice some patterns that lead you to choose a certain model, and your generalization error estimate will be too optimistic. This is known as data snooping bias.

Shuffling and sampling on every execution will produce a different test set every time, and eventually you will see the whole dataset
- Saving the test set in the first run and loading it, or setting a rng seed, don't solve the issue of fetching new data
- Could hash example identifiers and assign to test set if hash value is less than e.g. 20% of maximum hash value
Random sampling can introduce sampling bias
Stratified sampling tries to guarantee that the test set is representative of the overall data
- train_test_split(stratify=df[cols])
- Based on domain knowledge you may want to create strata by binning continuous features. (But not too many and not too large)
- Can add a new feature column, stratify and split test/train, then delete column

Explore and visualize the data to gain insights

Note: Try to get insights from a field expert for these steps.

Create a copy of the data for exploration (downsample to manageable size if necessary)
Create a Jupyter notebook to keep a record of the explorations
Study each attribute and its characteristics:
- Name
- Type (categorical, int/float, bounded/unbounded, text, unstructured, etc)
- % of missing values
- Noisiness and type of noise (stochastic, outliers, rounding errors, etc)
- Usefuless for the task
- Type of distribution (Gaussian, uniform, logarithmic, etc)
For supervised learning: identify target attributes
Visualize the data (df.plot, hist, etc)
Study correlations (df.corr, scatter_matrix)
How would you solve manually?
Identify promising transformations
- E.g. nonlinear correlation (1/x, etc), attribute combinations
Identify extra data that would be useful (go back to "Get the data")
Document what you have learned

Prepare the data for ML algorithms

Note!

Work on copies of the data (not original dataset)
Write functions for all transformations, for five reasons:
1. Easily handle fresh/new datasets
2. Can apply as a library in future projects
3. Clean and prepare the test set
4. Clean and prepare live data (new data instances) in production
5. Easily treat choices as hyperparameters (find best combinations, etc)

Steps:

Revert to a clean training set and separate predictors from labels
Clean the data
- Fix or remove outliers (optional)
- Fill in missing values (imputation: w/zero, mean, median, etc, fillna), drop the instances (dropna), or even drop the entire attribute (drop)
- Note that you might need to e.g. impute values for attributes in the live data that didn't need it at training time, so build your transformers to handle this possibility
- SimpleImputer, KNNImputer, IterativeImputer
Feature selection (optional)
- Drop attributes with no useful information for the task
Feature engineering
- Bucket/discretize continuous features
  - e.g. cut, KBinsDiscretizer
  - if multimodal, treat cut bucket as categorical. e.g. home age -> generation bucket (not comparable w.r.t. home value) -> generation as cat attribute
  - another option: RBF (rbf_kernel(data, compare_to, gamma=)) can compare data to a specific set of fixed values (e.g. metro areas, generation medians, etc) and return the "distance" to those fixed points
- Decompose features (e.g. categorical via one-hot encoding, date/time, etc)
  - df.apply with custom functions
  - OneHotEncoder will remember the categories it was trained on - useful for prod (rather than df.get_dummies)
  - Large number of categories for an attribute -> can slow down training and prediction. Maybe replace the attribute with a useful numerical feature?
  - Options: split it up into other data (country code -> gdp, population, etc?), category_encoders from sklearn-contrib, or an embedding in a neural network. This is representation learning
- Add promising transformations of features (based on "explore the data"), e.g. log, sqrt, polynomial, inverse, etc
- Aggregate features into new features (feature crosses)
  - PolynomialFeatures
  - custom transformations
- Feature scaling: ML algorithms don't perform well when the input numerical attributes have very different scales. Only fit to the training data!!
  - Normalization: min/max scaling. MinMaxScaler
  - Standardization: (val-mean)/stdev, i.e. value -> # of stdevs away from the mean. StandardScaler
  - What if the feature distribution has a heavy tail (i.e. values far from the mean aren't exponentially rare)? First, transform (sqrt, log) the feature and then scale it

Note that you may need to transform the target values too! E.g. log. But then you'll need to inverse_transform the scaled predictions to get your final predictions (or use TransformedTargetRegressor).

New transformers

If it doesn't require training:
- write functions f [and g=f^-1] : np.ndarray -> np.ndarray
- build FunctionTransformer(f, [inverse_func=g]). E.g. combining features w/ratio, etc
Else, write a class:
- implement fit (sets feature_names_in_, returns self), transform, and fit_transform (or inherit from TransformerMixin to get the latter)
- inherit from BaseEstimator to get get_params and set_params to support automatic hyperparameter tuning
- implement get_feature_names_out and inverse_transform
- use check_array and check_is_fitted from sklearn.utils.validation in transform

You can use other estimators in the implementation!

Use check_estimator to make sure you respect the sklearn API!

Transformation Pipelines

to visualize: sklearn.set_config(display="diagram")
to chain transformers: sklearn.pipeline.make_pipeline(*tranformers)
Pipeline has the same interface as the final estimator
- fit calls and propagates results from all fit_transform methods (but only fit on the last one)
- transform and predict similar (but call that method on the last one)
To handle different columns separately, use sklearn.compose.make_column_transformer(tuples): takes a sequence of (transformer, columns) tuples and applies each transformer to its specified columns (or specify with make_column_selector)

Select a model and train it

Note:

If the data is huge, sample smaller training sets. This will penalize complex models though!
Automate!!

Steps:

Train many quick-and-dirty models from different categories (e.g. linear, naive Bayes, SVM, random forest, neural net, etc) using standard parameters
Measure and compare their performance
- Use N-fold cross-validation and compute mean/stdev of perf measure on the folds
- cross_val_score(model, X, Y, scoring="neg_root_mean_squared_error", cv=10) uses 10 subsets (folds), trains w/ 9 and cross-validates with 1, and does this 10 times
- DecisionCurveDisplay, PrecisionRecallCurveDisplay, RocCurveDisplay
Analyze the most significant variables of each algorithm
- hyperparameters: scorers, learners, regularization, cross-validation folds, etc
Analyze the types of errors the models could make/does make
- What data would a human have used to avoid these errors?
- Do we need to gather more training data? Do data augmentation?
- Consider specific slices of validation sets for relevant subgroups (e.g. disadvantaged groups, specific metros, etc)
Quick round of feature selection and engineering
- Which attributes have the highest weights/importances
- df.corr and scatter_matrix
- feature_importances_ for trees
- Maybe remove some, add new ones, etc. e.g. feature_selection
Repeat the above one or two more times - quickly
Shortlist the top three to five promising models, preferring models that make different types of errors

Fine-tune your model

Use as much data as possible - and automate!

Fine-tune the hyperparameters using cross-validation
- Treat data transformation choices as hyperparameters (e.g. impute with zero or median, or drop, etc)
- Prefer RandomSearchCV over GridSearchCV unless there are very few options to test. Consider Bayesian optimization instead if it takes very long
Try ensemble methods - combining best models can produce better results
Pick best model
- compare cross-validation scores
- get tuned model and score via best_estimator_ and best_score_ from XSearchCV
Estimate generalization error by measuring performance on the test set
- metrics.accuracy_score, mean_squared_error(squared=False)
- This will likely be worse than your cross-validation error, since you fine-tuned to the training set
- Don't go tweak hyperparameters to make this number look good - it won't generalize, you'll overfit the test set!
  - go back to the beginning, create a new test set, reevaluate all your decisions, build new models
  - do this too many times splitting train/test from the same dataset => still overfitting, since with enough splits you'll eventually "see" all the data

Present your solution

Document what you have done
Create a nice presentation - highlight the big picture first
Explain why your solution achieves the business objective
Present interesting points noticed along the way
- What worked and what didn't?
- List your assumptions and your system's limitations
Ensure the key findings are communicated through beautiful visualizations and easy-to-remember statements (e.g. "the median income is the #1 predictor of housing prices")

Launch, monitor, and maintain your system

Get your solution ready for production
- Save the best model: joblib.dump
- How to use it in production?
  1. Load it in your app with joblib.load and make predictions in your app
  2. Split the loading and predictions into a prediction service/API
  3. Upload to Cloud options (e.g. Google Vertex AI) that you call via API
- Plug into production data inputs, write unit tests, etc
Write monitoring code to check the system's live performance at regular intervals and trigger alerts when it drops:
- Can monitor via downstream metrics (e.g. product sales)
- May require a human eval pipeline / human raters (crowdsourcing) to monitor performance directly
- Also monitor the input data quality! E.g. a sensor sending bad/random values, an input dependency data becoming stale, mean/stdev drifts too far from training set, more missing values than expected, new categorical values, etc. Particularly important for online learning systems!
- Beware of slow degradation - models tend to "rot" as data evolves
Retrain the models on a regular basis on fresh data - AUTOMATE!
- Schedule collecting fresh data and label it (e.g. w/human raters again)
- Schedule re-train: load fresh data, train model, fine-tune hyperparameters
- Schedule conditional deployment: compare live model performance with new model, deploy if not worse. If worse, alert and investigate why!
- Keep backups of every model and every dataset version and support easy rollbacks!