Handbook
Data science & machine learning
Reusable, **project-agnostic** blueprint for **data science and machine learning** — the discipline of extracting knowledge from data and building predictive or generative models that create value.
Data science & machine learning
Reusable, project-agnostic blueprint for data science and machine learning — the discipline of extracting knowledge from data and building predictive or generative models that create value.
Data science answers "how do we extract knowledge and build predictive models from data?" — a question that bridges PDLC hypothesis validation (P2) with SDLC model engineering (Build/Verify) and production model management (Operate).
| Document | Purpose |
|---|---|
| DATA-SCIENCE.md | ML lifecycle, statistical foundations, experiment design, model evaluation, responsible AI, competencies |
| Data Science ↔ SDLC ↔ PDLC bridge | How data science maps across SDLC phases A–F and PDLC phases P1–P6 — discovery, training, validation, monitoring |
| approaches/ | Methodological approaches: CRISP-DM, MLOps maturity model, experiment management, A/B testing |
| techniques/ | ML technique catalog: supervised, unsupervised, deep learning, NLP, computer vision, time series, recommender systems |
Relationship to other packages
| Package | How Data Science relates |
|---|---|
| SDLC blueprint | ML models are software artifacts — they go through SDLC phases (design, build, test, deploy). Model training happens in Build; validation in Verify; model serving in Release. ML adds unique concerns (data versioning, experiment tracking, model registry) that extend standard SDLC. |
| Product development lifecycle (PDLC) | PDLC P2 (Validate) may use ML prototypes for hypothesis testing. P5 (Grow) uses ML models for personalization, recommendation, and prediction. ML model performance metrics feed PDLC outcome measurement. |
| Business analysis (BA) | BA defines the business problems that ML may solve. Business understanding (CRISP-DM phase 1) maps directly to BA Strategy Analysis. BA acceptance criteria may include model performance thresholds. |
| Testing & quality assurance | ML models require specialized testing — data validation, model performance testing, fairness testing, adversarial testing. Traditional software testing (unit, integration) still applies to ML infrastructure code. |
| DevOps | MLOps extends DevOps for ML — CI/CD for models, model monitoring, automated retraining, model registry. |
| Big data & data engineering | Data engineering provides the data infrastructure that data science depends on — feature stores, training data pipelines, data quality, and governance. |
Scope
This package covers data science as a discipline — not just model training notebooks. It includes:
- ML lifecycle — problem framing, data understanding, feature engineering, model training, evaluation, deployment, monitoring
- Statistical foundations — hypothesis testing, experimental design, causal inference
- Model evaluation — metrics selection, cross-validation, bias-variance trade-off, model comparison
- MLOps — model versioning, experiment tracking, automated retraining, model serving, monitoring
- Responsible AI — fairness, explainability, privacy, bias detection, governance
- Experiment management — A/B testing, multi-armed bandits, online experimentation platforms
Reference bodies of knowledge: CRISP-DM, MLOps maturity model (Google), Responsible AI practices.
Keep project-specific ML documentation in docs/architecture/ (model architecture decisions) and docs/product/ (ML feature descriptions), not in this file.
Canonical source
Edit https://github.com/autowww/blueprints/blob/main/disciplines/data/data-science/README.md first; regenerate with docs/build-handbook.py.