The Stack Has Settled (Mostly)
After years of fragmentation, the modern data stack has started to converge around a set of well-understood components. Here is what we currently deploy for production data platforms, with honest notes on trade-offs.
Ingestion: Airbyte (Self-Hosted) > Fivetran
Fivetran is polished and low-maintenance but expensive at scale ($1/MAR model). Airbyte OSS covers 300+ connectors, runs on Kubernetes, and can be extended with custom connectors in Python. For clients with <50 sources, Fivetran's simplicity wins. For everyone else, Airbyte Cloud or self-hosted Airbyte.
Storage: Apache Iceberg on S3/GCS
Iceberg has effectively won the table format war (over Hudi and Delta Lake) for net-new deployments. The Apache Iceberg REST catalog (used by Polaris / Nessie) makes it engine-agnostic — you can query the same table with Spark, Trino, DuckDB, and Snowflake External Tables simultaneously.
Transformation: dbt Core
dbt remains the gold standard for SQL-based transformation. Use dbt with model contracts and data tests on every model. The dbt Semantic Layer (with MetricFlow) is maturing fast and eliminates the "different numbers in different dashboards" problem.
Orchestration: Dagster over Airflow
Airflow is battle-tested but its DAG-centric model makes software engineering practices (testing, typing, incremental development) awkward. Dagster's asset-centric model aligns perfectly with dbt and makes lineage first-class. For greenfield projects, Dagster wins every time.
Query Engine: DuckDB for Local, Trino for Distributed
DuckDB is the most exciting development in data tooling in years — a zero-dependency, embeddable OLAP engine that reads Parquet/Iceberg from S3 at remarkable speed. We use it for local development, CI data testing, and single-node analytics (up to ~1TB). Trino for anything requiring distributed compute.
BI: Evidence.dev or Metabase
For engineering-led companies: Evidence.dev (code-first, version-controlled reports in Markdown + SQL). For business-user self-serve: Metabase (simple, RBAC, embedded analytics support).

