Handbook
Big Data ↔ SDLC ↔ PDLC bridge
This document maps **data engineering** practices to the two lifecycle frameworks:
Big Data ↔ SDLC ↔ PDLC bridge
Purpose
This document maps data engineering practices to the two lifecycle frameworks:
- PDLC — "Are we building the right product?"
- SDLC — "Are we building the product right?"
- Data Engineering — "How do we collect, process, store, and govern data to serve both?"
Data engineering provides the infrastructure that product analytics (PDLC P5) and software features (SDLC) depend on.
Canonical sources: BIGDATA.md (this package) · Product development lifecycle (PDLC) · Software development lifecycle (SDLC).
Document map
| Section | Contents |
|---|---|
| Purpose | Why this bridge exists; how data engineering relates to PDLC and SDLC |
| Comparison table | Data engineering vs SDLC vs PDLC across core dimensions |
| When one is missing | Consequences when any of the three lenses is absent |
| Data engineering across the lifecycle | Phase-by-phase activities and outputs (P1–P6, A–F) |
| Role mapping | Who owns data engineering decisions; alignment with SDLC archetypes and PDLC roles |
| Artifact flow | Handoffs between data engineering, delivery, and product |
| Calibration | When to deepen or lighten data engineering investment |
| Anti-patterns | Common failures when the bridge is weak or ignored |
| Worked example | End-to-end scenario: recommendation analytics and serving path |
| Related reading | Authoritative docs in this package and lifecycle packages |
Comparison table
| Dimension | Data engineering | SDLC | PDLC |
|---|---|---|---|
| Core question | How do we collect, process, store, and govern data so consumers can trust it and ship on it? | Are we building the product right? | Are we building the right product? |
| Scope | Ingestion through governed consumption: pipelines, storage tiers, metadata, quality, access, cost | Discovery through release: specify, design, build, verify, deploy | Problem discovery through sunset: validate, strategize, launch, grow, retire |
| Primary owner | Data engineering lead, data platform team, or domain data owner (e.g. data mesh) | Engineering / delivery team | Product manager / product trio |
| Timeline | Platform and pipeline roadmaps; cadence often tied to product increments and compliance cycles | Sprint, iteration, or release train | Quarters to years on the product horizon |
| Success metric | Quality (accuracy, completeness, consistency, timeliness, validity, uniqueness), pipeline SLAs, freshness, lineage coverage, cost per useful dataset | Velocity, defect rate, DORA-style delivery metrics, release health | Adoption, retention, revenue, experiment outcomes, strategic fit |
| Key artifacts | Schemas, data contracts, pipeline definitions, catalogs, quality rules, DataOps runbooks, IaC for data infrastructure | Specifications, code, tests, release artifacts | Research synthesis, metrics definitions, experiment readouts, roadmap |
| Risk focus | Silent corruption, schema drift, privacy and residency breaches, vendor lock-in, runaway compute/storage cost | Defects, security flaws, operational incidents | Wrong problem, weak fit, timing, viability |
| Failure mode | Data swamp; analytics nobody trusts; pipelines that block every release | Rework; production firefighting; mounting tech debt | Shipping features that do not move outcomes |
When one is missing
| Scenario | What happens |
|---|---|
| Data engineering without PDLC | Platforms and pipelines optimize for technical completeness while product bets lack clear outcomes; investment in lakehouse, mesh, or medallion layers does not trace to validated customer or business value. |
| Data engineering without SDLC | Data changes bypass versioned delivery: schema drift in production, migrations without rollback discipline, and no shared Definition of Done for data path changes alongside application releases. |
| SDLC without data engineering | Features ship without durable event capture, conformed dimensions, or governed interfaces; each team builds one-off ETL; quality and lineage are unknown until incidents or wrong decisions surface. |
| PDLC without data engineering | Discovery and growth rely on ad-hoc extracts and spreadsheet analytics; experiments lack trustworthy assignment, metrics, or timeliness; strategy debates proceed without a reliable data substrate. |
| PDLC and SDLC without shared data engineering standards | Product and delivery each define “truth” differently; dashboards disagree with the app; reconciliation waste replaces confident decisions. |
| All three practiced | Instrumented product learning, governed data products (DMBOK-style accountability), and coordinated pipeline/schema releases that support both validation and reliable shipping. |
Lifecycle labels used in this bridge
| Framework | Phase | Full name (for traceability to Software development lifecycle (SDLC) and Product development lifecycle (PDLC)) |
|---|---|---|
| PDLC | P1 | Discover Problem |
| PDLC | P2 | Validate Solution |
| PDLC | P3 | Strategize |
| PDLC | P4 | Launch |
| PDLC | P5 | Grow |
| PDLC | P6 | Mature / Sunset |
| SDLC | A | Discover |
| SDLC | B | Specify |
| SDLC | C | Design |
| SDLC | D | Build |
| SDLC | E | Verify |
| SDLC | F | Release |
Data engineering across the lifecycle
| Phase | Data engineering role | Key activities | Outputs |
|---|---|---|---|
| P1 Discover | Data assessor | Assess existing data landscape; identify data sources for research | Data source inventory, data availability assessment |
| P2 Validate | Data prototype builder | Build data prototypes for hypothesis testing; quick analytics pipelines | Prototype pipelines, experimental data sets |
| P3 Strategize | Data strategist | Define data strategy; estimate data infrastructure needs; data architecture options | Data strategy document, infrastructure cost model |
| A Discover | Requirements analyst | Identify data requirements for features; define data entities and relationships | Data requirements, entity-relationship models |
| B Specify | Data modeler | Design schemas; define data contracts; specify quality rules; plan migrations | Schema designs, data contracts, quality specifications |
| C Design | Pipeline architect | Design data flow; select processing patterns (batch/stream); integration architecture | Data flow diagrams, pipeline architecture, integration design |
| D Build | Pipeline builder | Implement data pipelines; build ETL/ELT jobs; configure data infrastructure | Working pipelines, schema migrations, IaC for data infra |
| E Verify | Data quality gatekeeper | Run data quality checks; validate schema compliance; reconciliation testing | Quality reports, validation results, data test suites |
| F Release | Migration operator | Execute data migrations; verify data integrity post-deployment | Migration scripts, integrity verification, rollback plan |
| P4 Launch | Analytics enabler | Enable production analytics; set up tracking pipelines; configure dashboards | Analytics pipeline, tracking events, launch dashboards |
| P5 Grow | Data platform operator | Scale data infrastructure; optimize costs; maintain quality; enable experimentation | Capacity plans, cost optimization reports, A/B test infrastructure |
| P6 Sunset | Data archivist | Data archival; retention compliance; cleanup of deprecated pipelines | Archival plan, retention compliance report |
Role mapping
Data engineering decisions should be explicit at phase boundaries so accountability matches Software development lifecycle (SDLC) delivery and Product development lifecycle (PDLC) product governance. The table below maps who typically owns data engineering judgment to PDLC roles and SDLC role archetypes from Roles, archetypes & methodology titles.
| Phase | Data engineering decision focus | Typical data engineering owner | PDLC role(s) | SDLC archetype emphasis |
|---|---|---|---|---|
| P1 Discover | Which sources exist, what is usable for research, legal/ethical constraints | Data engineer or analyst embedded with discovery | PM, UX research | Demand & value |
| P2 Validate | Fast, disposable pipelines vs production-grade paths for experiments | Data engineer + analyst | PM, experiment owner | Demand & value |
| P3 Strategize | Target architecture (lakehouse, mesh, batch/stream), cost envelope, governance model | Data architect / DE lead + finance partner | PM, exec sponsor | Steer & govern; Demand & value |
| A Discover | Data entities, relationships, and feature-level data needs | DE lead with product and engineering | PM (priorities) | Demand & value; Build & integrate |
| B Specify | Contracts, schemas, quality rules, migration approach | Data modeler / DE lead | Owner (scope), Implementer | Build & integrate |
| C Design | ETL/ELT vs streaming, Lambda/Kappa fit, integration boundaries | Pipeline architect / tech lead | Implementer, Architect | Build & integrate |
| D Build | Pipeline implementation, orchestration, IaC | Data / software engineers | Implementer | Build & integrate |
| E Verify | Data tests, reconciliation, contract validation in CI/CD | DE + QA | Implementer, QA | Assure & ship |
| F Release | Migrations, cutover, rollback drills for data paths | DE + release / SRE | Implementer, release | Assure & ship |
| P4 Launch | Production analytics, event pipelines, dashboard SLAs | Analytics engineer / DE | GTM, PM | Demand & value |
| P5 Grow | Scale, cost, freshness, experimentation plumbing | Platform DE / data SRE | PM, analytics | Flow & improvement; Demand & value |
| P6 Sunset | Retention, archival, decommissioning pipelines | DE + compliance | PM, legal | Steer & govern |
In small teams, one engineer may wear data modeling, pipeline build, and analytics hats; the table still defines where decisions must be made, even when the same person carries them across phases.
Artifact flow
Data engineering → SDLC
| Artifact | SDLC destination | Usage |
|---|---|---|
| Data requirements, ER views | A–B | Backlog and specification inputs for features touching master or event data |
| Data contracts, schema definitions | B–C | Design-time agreement between producers and consumers; breaking-change policy |
| Pipeline architecture, lineage notes | C–D | Implementation of batch/stream jobs, orchestration, and integration |
| Data test suites, quality reports | E | Gates for correctness, completeness, uniqueness, and timeliness before release |
| Migration scripts, rollback plans | F | Coordinated schema and data movement with application deploy |
| Data lineage and dependency map | C–F | Impact analysis for schema changes; on-call context for pipeline failures |
| Observability spec (alerts, SLOs for freshness) | D–F | DataOps feedback for late partitions, failed loads, and consumer lag |
Data engineering → PDLC
| Artifact | PDLC destination | Usage |
|---|---|---|
| Data source inventory, availability assessment | P1 | Grounds problem discovery in what can be measured |
| Prototype pipelines, experiment datasets | P2 | Supports hypothesis tests without over-investing in production paths |
| Data strategy, cost model | P3 | Informs investment and architecture choices at stage gates |
| Analytics pipelines, event specs, dashboards | P4–P5 | Launch metrics and growth loops; A/B infrastructure |
| Archival and retention evidence | P6 | Defensible sunset and compliance |
| Domain ownership matrix (data mesh) | P3 | Clarifies who approves schema and quality bar per bounded context |
| Metric definitions (“single source of truth” spec) | P2–P5 | Aligns experiment readouts and dashboards with conformed dimensions |
Feedback loops
| Source | Data engineering usage |
|---|---|
| PDLC P5 metrics and experiment results | Prioritize pipeline features, freshness SLAs, and domain data products |
| SDLC E/F defects tied to data | Feed quality rules, monitoring, and contract tests |
| Incidents and drift alerts | Drive DataOps improvements, retraining of checks, and catalog updates |
| Product sunset decisions | Trigger deprecation of pipelines and datasets per governance policy |
| Customer support data disputes | Surface contract gaps, missing keys, or timezone handling bugs for permanent tests |
| Finance / unit economics reviews | Refine partitioning, tiering, and job schedules to control storage and compute cost |
Calibration
Not every initiative warrants the same depth of data engineering. Calibrate using initiative shape, regulatory exposure, and how central data is to the bet.
By initiative type
| Initiative type | Data engineering investment | Reasoning |
|---|---|---|
| Data-heavy product (recommendations, personalization, fraud, search ranking) | Heavy — production contracts, strong quality dimensions, near-real-time or batch SLAs, clear lineage | Wrong or late data directly erodes the core value proposition |
| Analytics platform (self-serve BI, metrics store, experimentation hub) | Heavy — medallion or equivalent layering, catalog, access control, cost governance | Many consumers amplify the cost of ambiguity and poor metadata |
| Standard SaaS feature (CRUD workflows with reporting) | Medium — solid event capture, dimensional modeling for key entities, standard ELT | Balance speed with enough structure to avoid one-off extracts |
| Maintenance / bugfix | Light to minimal — touch pipelines only when schemas or migrations are in scope | Avoid gold-plating pipelines for changes that do not move data boundaries |
Signals to deepen or lighten
| Signal | Adjust |
|---|---|
| Frequent “numbers do not match” escalations | Deepen contracts, reconciliation tests, and catalog investment |
| High privacy or regulatory surface | Deepen governance, retention, and access patterns per DMBOK-style stewardship |
| Low data surface area and stable schema | Lighten ceremony; keep minimal contracts and monitoring |
| Rapid P2 experiments | Prefer disposable pipelines with a defined promotion path to production patterns |
Regulatory posture and data sensitivity
| Context | Typical adjustment |
|---|---|
| Highly regulated (finance, health, minors’ data) | Stronger stewardship, access logging, retention proofs, and separation of duties on schema change |
| Internal-only analytics | Lighter external compliance; still enforce contracts and quality to protect operational decisions |
| Cross-border residency | Architecture and pipeline design choices fixed early in P3 and SDLC C; mistakes are expensive to unwind |
How architecture choices shift the burden
| Pattern | When it helps | Trade-off |
|---|---|---|
| Lakehouse + medallion | Many consumers need curated, reusable datasets | Operational overhead to maintain bronze/silver/gold contracts |
| Data mesh | Scale beyond a central team; domain ownership of data products | Requires mature federated governance and platform enablement |
| Lambda / Kappa | Latency-sensitive features or streaming analytics | Higher complexity in operations, testing, and exactly-once semantics |
Anti-patterns
| Anti-pattern | Description | Fix |
|---|---|---|
| Data swamp | Data lake with no governance — everything dumped in, nothing findable | Implement data catalog, quality gates on ingestion, ownership per data domain |
| Pipeline spaghetti | Complex, undocumented dependencies between data pipelines | Data lineage tooling; pipeline dependency graphs; modular pipeline design |
| Schema-on-hope | No schema enforcement; consumers discover structure through trial and error | Define and enforce contracts; schema registry; breaking change process |
| All data, no insight | Collecting everything without defined use cases or consumers | Start from consumer needs; define data products; retire unused pipelines |
Worked example
Scenario: A B2C product team wants a recommendation strip on the home screen. Success depends on trustworthy behavioral signals, a governed feature store or serving path, and product experimentation to validate lift. The tables below are illustrative; names and tools vary by organization.
End-to-end artifact trail (representative)
| Lifecycle anchor | Representative data engineering outputs | Who consumes them |
|---|---|---|
| P1 | Source inventory, PII and consent constraints | PM, research, legal |
| P2 | Sandbox joins, prototype feature extract | DS / PM, experiment owner |
| P3 | Target architecture note, TCO model, domain ownership sketch | Exec sponsor, finance, platform |
| SDLC A–B | Data requirements packet, contracts, migration plan | Owner, Implementer, QA |
| SDLC C–D | Pipeline design, orchestration code, IaC modules | Implementer, SRE |
| SDLC E–F | Data test reports, migration runbook, post-deploy reconciliation | QA, release manager |
| P4–P5 | Event dictionaries, dashboard datasets, experiment exposure tables | PM, analytics, GTM |
| P6 | Retention evidence, archival manifest, deprecation checklist | Compliance, PM |
P1–P2 (discover and validate)
Product and research confirm that clickstream and purchase history exist but event definitions differ between web and mobile. Data engineering produces a short-lived prototype pipeline that joins the two sources for a sandbox cohort. P2 runs an offline evaluation: can a simple collaborative filter beat the baseline? Outcome: hypothesis is promising; gaps in event completeness and timeliness are documented. The team explicitly decides which quality dimensions must improve before production promotion (for example, stricter validity rules on item IDs and uniqueness on impression keys).
P3 (strategize)
The team chooses a lakehouse landing with a medallion-style refinement (bronze raw events, silver conformed sessions, gold recommendation-ready features), batch scoring with a path toward near-real-time updates later. Data engineering supplies an infrastructure cost model and data governance notes: PII handling, retention, and which domains own which datasets. A Kappa-style stream path is captured as a future option if latency becomes a product constraint; v1 stays batch-first to reduce operational risk.
SDLC A–B (discover and specify)
Engineering and product specify data requirements: impression events, click events, item metadata, and consent flags. Data engineering authors data contracts for each event, quality rules (e.g. uniqueness of event IDs, validity of timestamps), and a migration plan for adding new fields without breaking existing consumers. Non-functional expectations (freshness SLAs, partition strategy) are written so E Verify can automate checks rather than rely on manual spot checks.
SDLC C–D (design and build)
Architecture selects ELT into the lakehouse, orchestrated jobs for silver/gold layers, and a Lambda-leaning pattern: batch features for training and serving v1, with hooks for a faster layer later. Application teams implement instrumentation; data engineering implements pipelines and IaC alongside application services. Lineage from raw clickstream to scored recommendations is captured so breaking changes surface in review, not only in production.
SDLC E–F (verify and release)
Data quality gatekeeper activities run in CI: schema compatibility, row-count reconciliation between sources and bronze, and sampling for consistency across web and mobile. Example checks mapped to dimensions:
| Dimension | Example automated check |
|---|---|
| Completeness | Null rate thresholds on mandatory fields; session stitching coverage |
| Timeliness | Bronze partition landed within SLA; lag vs application clock |
| Consistency | Aggregated CTR from events matches reporting cube within tolerance |
| Uniqueness | Duplicate event keys rejected or quarantined |
Release includes a migration operator checklist: backfill window, rollback for feature table versions, and integrity checks post-deploy.
P4–P5 (launch and grow)
Analytics enabler work wires launch dashboards and experiment assignment data so PDLC can read lift on recommendation CTR and downstream revenue. In P5, DataOps practices monitor freshness and drift; cost and capacity reviews prevent runaway scans. Product uses the same conformed metrics for growth experiments, closing the loop from PDLC back to pipeline priorities. If readouts disagree, the team traces through lineage to determine whether the gap is product logic, assignment, or data pipeline regression.
P6 (mature / sunset)
If the feature retires, data archivist activities archive historical training and serving tables per retention policy, drop obsolete pipelines, and update the catalog so consumers do not attach to deprecated datasets. Lessons feed the next initiative: which contracts prevented incidents, which checks caught drift early, and which prototypes should become platform templates.
Related reading
| Doc | Why |
|---|---|
BIGDATA.md |
Principles, governance, quality, pipeline patterns, DataOps |
| Data architectures (blueprint) | Lambda, Kappa, data mesh, data lakehouse |
| Data technologies (blueprint) | Processing frameworks, storage systems, orchestration |
| Software development lifecycle (SDLC) | Delivery phases A–F, DoD |
| Product development lifecycle (PDLC) | Product phases P1–P6, metrics framework |
Canonical source
Edit https://github.com/autowww/blueprints/blob/main/disciplines/data/bigdata/BIGDATA-SDLC-PDLC-BRIDGE.md first; regenerate with docs/build-handbook.py.