Data science & machine learning

Reusable, **project-agnostic** blueprint for **data science and machine learning** — the discipline of extracting knowledge from data and building predictive or generative models that create value.

Data science & machine learning

Reusable, project-agnostic blueprint for data science and machine learning — the discipline of extracting knowledge from data and building predictive or generative models that create value.

Data science answers "how do we extract knowledge and build predictive models from data?" — a question that bridges PDLC hypothesis validation (P2) with SDLC model engineering (Build/Verify) and production model management (Operate).

Document Purpose
DATA-SCIENCE.md ML lifecycle, statistical foundations, experiment design, model evaluation, responsible AI, competencies
Data Science ↔ SDLC ↔ PDLC bridge How data science maps across SDLC phases A–F and PDLC phases P1–P6 — discovery, training, validation, monitoring
approaches/ Methodological approaches: CRISP-DM, MLOps maturity model, experiment management, A/B testing
techniques/ ML technique catalog: supervised, unsupervised, deep learning, NLP, computer vision, time series, recommender systems

Relationship to other packages

Package How Data Science relates
SDLC blueprint ML models are software artifacts — they go through SDLC phases (design, build, test, deploy). Model training happens in Build; validation in Verify; model serving in Release. ML adds unique concerns (data versioning, experiment tracking, model registry) that extend standard SDLC.
Product development lifecycle (PDLC) PDLC P2 (Validate) may use ML prototypes for hypothesis testing. P5 (Grow) uses ML models for personalization, recommendation, and prediction. ML model performance metrics feed PDLC outcome measurement.
Business analysis (BA) BA defines the business problems that ML may solve. Business understanding (CRISP-DM phase 1) maps directly to BA Strategy Analysis. BA acceptance criteria may include model performance thresholds.
Testing & quality assurance ML models require specialized testing — data validation, model performance testing, fairness testing, adversarial testing. Traditional software testing (unit, integration) still applies to ML infrastructure code.
DevOps MLOps extends DevOps for ML — CI/CD for models, model monitoring, automated retraining, model registry.
Big data & data engineering Data engineering provides the data infrastructure that data science depends on — feature stores, training data pipelines, data quality, and governance.

Scope

This package covers data science as a discipline — not just model training notebooks. It includes:

  • ML lifecycle — problem framing, data understanding, feature engineering, model training, evaluation, deployment, monitoring
  • Statistical foundations — hypothesis testing, experimental design, causal inference
  • Model evaluation — metrics selection, cross-validation, bias-variance trade-off, model comparison
  • MLOps — model versioning, experiment tracking, automated retraining, model serving, monitoring
  • Responsible AI — fairness, explainability, privacy, bias detection, governance
  • Experiment management — A/B testing, multi-armed bandits, online experimentation platforms

Reference bodies of knowledge: CRISP-DM, MLOps maturity model (Google), Responsible AI practices.


Keep project-specific ML documentation in docs/architecture/ (model architecture decisions) and docs/product/ (ML feature descriptions), not in this file.

Canonical source

Edit https://github.com/autowww/blueprints/blob/main/disciplines/data/data-science/README.md first; regenerate with docs/build-handbook.py.