Data technologies (blueprint)

Purpose: Taxonomy and selection guidance for data engineering technologies. Each category describes the problem space, common tools, and selection criteria.

Why it matters: Technology selection should be driven by processing requirements — latency, throughput, state, semantics, and operational model — not by hype or a single vendor stack. Use the categories below to narrow options, then confirm with proofs of concept and cost models.

Audience: Teams adopting Big data & data engineering; project-specific tool choices are documented as ADRs in docs/adr/.

Processing paradigm spectrum

Linear flow diagram template

Interactive analytics often consumes outputs of batch or stream pipelines rather than replacing them.

Storage, streaming buses, orchestration, catalogs, and quality tools compose with engines: the same Spark or Flink job may land in Delta, register in DataHub, and be scheduled by Airflow. Start from workload shape, then fill in the surrounding toolchain.


Category Scope Guide
Processing engines & platforms Batch (Spark), micro-batch (Spark Structured Streaming), stream (Flink, Kafka Streams), portable pipelines (Beam), interactive SQL warehouses Data processing engines & platforms
Storage systems Relational (PostgreSQL, MySQL), columnar (ClickHouse, DuckDB), object (S3, GCS, Azure Blob), table formats (Delta Lake, Apache Iceberg, Apache Hudi) BIGDATA.md §3
Streaming platforms Apache Kafka, Apache Pulsar, AWS Kinesis, Azure Event Hubs, Google Pub/Sub BIGDATA.md §3
Orchestration Apache Airflow, Dagster, Prefect, Mage — DAG-based pipeline scheduling and monitoring BIGDATA.md §4
Data catalogs DataHub, OpenMetadata, Apache Atlas, Amundsen — metadata management and discovery BIGDATA.md §2
Data quality Great Expectations, dbt tests, Soda, Monte Carlo — automated data quality validation BIGDATA.md §2

Core knowledge: Big data & data engineering body of knowledge — principles, governance, quality, pipeline patterns.

Architectures: Data architectures (blueprint) — Lambda, Kappa, data mesh, data lakehouse, medallion; deep dives in Lambda, Kappa & unified data architectures and Data mesh: domain-oriented decentralized architecture.

Cross-reference: Big data & data engineering body of knowledge ties principles, governance, quality dimensions, and pipeline patterns to delivery practice; use it when justifying tool choices to stakeholders outside pure engineering.

When a category has no dedicated guide yet, treat Big data & data engineering body of knowledge and vendor-neutral overviews as the interim source; prefer ADRs for versioned tool pins (e.g., “we use Iceberg + Trino for zone X”).


Keep project-specific data architecture decisions in docs/adr/ and pipeline documentation in docs/development/, not in this file.