Handbook

Big data & data engineering

Reusable, **project-agnostic** blueprint for **big data and data engineering** — the discipline of designing, building, and maintaining the systems and infrastructure for collecting, storing, processi

Big data & data engineering

Reusable, project-agnostic blueprint for big data and data engineering — the discipline of designing, building, and maintaining the systems and infrastructure for collecting, storing, processing, and governing data at scale.

Data engineering answers "how do we engineer, govern, and process data at scale?" — a question that intersects SDLC delivery (building data pipelines) and PDLC strategy (data as a product asset).

Document	Purpose
BIGDATA.md	Data engineering principles, data governance, data quality dimensions, data lifecycle, DMBOK alignment, competencies
Big Data ↔ SDLC ↔ PDLC bridge	How data engineering maps across SDLC phases A–F and PDLC phases P1–P6 — schema in Specify, pipelines in Build, validation in Verify, analytics in Grow
architectures/	Data architecture patterns: Lambda, Kappa, data mesh, data lakehouse, medallion architecture
technologies/	Processing framework taxonomy, storage systems, streaming platforms, orchestration tools
techniques/	Operational data modeling — relational, NoSQL, indexing, transactions, migration, polyglot persistence

Relationship to other packages

Package	How Big Data relates
SDLC blueprint	Data pipelines are software — they go through SDLC phases (design, build, test, deploy). Schema design happens in Specify; pipeline implementation in Build; data validation in Verify.
Product development lifecycle (PDLC)	PDLC P1–P3 (Discover, Validate, Strategize) define data strategy and analytics requirements. P5 (Grow) relies on data infrastructure for usage analytics, A/B testing, and outcome measurement.
Business analysis (BA)	BA Business Intelligence perspective covers data requirements, analytics, and data quality. Data engineering provides the infrastructure that makes BI possible.
Software architecture	Data architecture is a subset of system architecture. Storage choices, processing patterns, and data flow design are architectural decisions.
DevOps	DataOps applies DevOps principles to data pipelines — CI/CD for data, data quality gates, pipeline observability, infrastructure as code for data infrastructure.
Data science & machine learning	Data engineering provides the infrastructure and data preparation that data science depends on. Feature stores, training data pipelines, and model serving infrastructure bridge the two disciplines.

Scope

This package covers data engineering as a discipline — not just database administration or ETL scripting. It includes:

Data architecture — storage systems, processing patterns, data flow design
Data pipelines — batch processing, stream processing, ELT/ETL patterns
Data governance — data ownership, classification, lineage, cataloging, access control
Data quality — accuracy, completeness, consistency, timeliness, validity, uniqueness
Data lifecycle — creation, storage, usage, archival, deletion, compliance (GDPR, retention)
DataOps — CI/CD for data, automated testing of data pipelines, data observability

Reference bodies of knowledge: DAMA DMBOK (Data Management Body of Knowledge), data mesh principles (Zhamak Dehghani).

Keep project-specific data documentation in docs/product/data/ and data architecture decisions in docs/adr/, not in this file.

Canonical source

Edit https://github.com/autowww/blueprints/blob/main/disciplines/data/bigdata/README.md first; regenerate with docs/build-handbook.py.