The Modern Data Engineering Stack: From Ingestion to Analytics

The data engineering landscape has transformed dramatically over the past few years. The monolithic ETL tools that once dominated enterprise data teams have given way to a composable stack of specialized, open-source-first tools. Data lakehouses have blurred the line between data lakes and data warehouses. Orchestration has moved beyond simple scheduling into full-fledged data platform management. This guide walks through each layer of the modern data engineering stack, helping you make informed decisions about the tools and architectures that power analytics, machine learning, and operational intelligence.

ELT vs ETL: The Paradigm Shift

Traditional ETL (Extract, Transform, Load) processes data before loading it into the warehouse. The modern approach, ELT (Extract, Load, Transform), flips the last two steps by loading raw data first and transforming it inside the warehouse using SQL.

This shift became possible because cloud data warehouses like Snowflake, BigQuery, and Redshift made compute cheap and elastic. There is no longer a compelling reason to transform data on expensive middleware when the warehouse itself can handle it at scale.

The practical benefits of ELT are significant:

Raw data preservation: Loading data before transformation means the original data is always available. When business requirements change, you re-transform rather than re-extract.
SQL-based transformations: Analysts who know SQL can own transformation logic. This reduces the bottleneck on data engineering teams and speeds up iteration.
Simplified pipelines: Ingestion tools focus only on reliable extraction and loading. Transformation logic lives separately, is version-controlled, and is testable.

ELT is the right default for most analytics workloads. ETL still makes sense when you need to filter or redact sensitive data before it reaches the warehouse, when source data volumes are massive and you want to reduce storage costs, or when real-time transformation is required before loading.

Ingestion: Getting Data In

The ingestion layer handles extracting data from source systems and loading it into your storage layer. The two dominant approaches are managed platforms and open-source tools.

Fivetran is the leading managed ingestion platform. It offers 300+ pre-built connectors, automated schema migration handling, and minimal configuration. The tradeoff is cost, which scales with monthly active rows and can become significant at high volumes.

Airbyte provides a similar connector-based approach but is open-source with a self-hosted option. It has a growing connector catalog and an active community building custom connectors. Self-hosting reduces per-row costs but adds operational burden.

Custom ingestion is warranted for sources without pre-built connectors or when you need fine-grained control over extraction logic. Python libraries like Singer taps, or simple scripts using source APIs, handle these cases.

A practical ingestion architecture often combines tools: Fivetran or Airbyte for standard SaaS sources (Salesforce, HubSpot, Stripe), custom scripts for internal APIs and databases, and event streaming (Kafka or Kinesis) for real-time event data.

Loading diagram...

Transformation with dbt

dbt (data build tool) has become the standard for managing SQL-based transformations. It brings software engineering practices to analytics code: version control, testing, documentation, and modularity.

A dbt project organizes transformations into layers:

Staging models clean and rename raw source tables into a consistent format
Intermediate models join and aggregate staging models to produce business entities
Mart models create the final tables optimized for specific analytics consumers (dashboards, ML features, operational reports)

Each model is a SQL SELECT statement stored as a file. dbt handles the DDL, dependency resolution, and incremental processing.

-- models/marts/finance/monthly_revenue.sql
WITH orders AS (
    SELECT * FROM {{ ref('stg_orders') }}
),
payments AS (
    SELECT * FROM {{ ref('stg_payments') }}
)
SELECT
    DATE_TRUNC('month', orders.ordered_at) AS revenue_month,
    SUM(payments.amount) AS total_revenue,
    COUNT(DISTINCT orders.order_id) AS total_orders
FROM orders
INNER JOIN payments ON orders.order_id = payments.order_id
WHERE payments.status = 'completed'
GROUP BY 1

dbt's testing framework catches data quality issues before they propagate downstream. Built-in tests validate not-null constraints, uniqueness, referential integrity, and accepted values. Custom tests handle business logic validation.

dbt Mesh, introduced for large organizations, enables multiple dbt projects to share models across teams through a contract-based interface, solving the governance challenges that arise when a single dbt project grows to hundreds of models.

The Data Lakehouse Architecture

The data lakehouse combines the low-cost, flexible storage of a data lake with the performance and ACID transaction guarantees of a data warehouse. This is achieved through open table formats that add structure to files stored in object storage.

Apache Iceberg has emerged as the leading open table format, with adoption by Snowflake, Dremio, Spark, Trino, and major cloud providers. It provides schema evolution, hidden partitioning, time travel, and efficient metadata management that enables query engines to skip irrelevant data files.

Delta Lake, created by Databricks, offers similar capabilities with deep integration into the Spark ecosystem. It excels within the Databricks platform but has narrower support from third-party engines compared to Iceberg.

The lakehouse architecture is compelling because it stores data once (in Parquet files on S3 or GCS) and allows multiple compute engines to access it. Your Spark jobs, Trino queries, and Snowflake external tables all read from the same underlying data.

When to adopt the lakehouse pattern:

You have large data volumes where warehouse storage costs are prohibitive
You need to support both SQL analytics and ML training on the same data
You want to avoid vendor lock-in by keeping data in open formats
You have streaming workloads that benefit from unified batch and stream processing

For teams running primarily on a cloud warehouse with moderate data volumes, the traditional warehouse approach remains simpler to operate. The lakehouse adds value when scale, cost, or workload diversity demands it.

Orchestration: Airflow, Dagster, and Prefect

Orchestration tools manage the execution of data pipelines: scheduling runs, handling dependencies, retrying failures, and providing observability into pipeline health.

Apache Airflow remains the most widely deployed orchestrator. It is battle-tested, has an enormous ecosystem of operators and providers, and is available as a managed service (MWAA on AWS, Cloud Composer on GCP). Its weaknesses are a complex local development experience and a DAG authoring model that can become unwieldy for large pipelines.

Dagster takes a fundamentally different approach by organizing pipelines around data assets rather than tasks. Each asset declares what data it produces and what it depends on. This asset-centric model provides a clearer mental model, better lineage tracking, and built-in data quality checks. Dagster also offers a significantly better local development experience than Airflow.

from dagster import asset
 
@asset(deps=["raw_orders"])
def cleaned_orders(context):
    """Clean and validate raw order data."""
    raw = load_from_warehouse("raw_orders")
    cleaned = raw.dropna(subset=["order_id", "customer_id"])
    cleaned = cleaned[cleaned["total"] > 0]
    write_to_warehouse(cleaned, "cleaned_orders")
    context.log.info(f"Processed {len(cleaned)} orders")

Prefect positions itself as the simplest orchestrator, with a Python-native API and minimal boilerplate. It is a strong choice for teams that want orchestration without the operational complexity of Airflow.

For new projects, Dagster offers the best combination of developer experience, asset-based modeling, and data quality integration. Airflow remains the pragmatic choice when your team already has Airflow expertise or when managed Airflow services align with your cloud provider.

Data Quality and Real-Time Streaming

Data quality is not a feature to add later. It must be embedded throughout the pipeline.

Implement checks at multiple levels: schema validation on ingestion (are expected columns present and correctly typed), row-level tests in dbt (uniqueness, not-null, referential integrity), statistical checks on transformation outputs (row count changes within expected ranges, value distributions stable), and freshness monitoring (are tables being updated on schedule).

Tools like Great Expectations, Soda, and dbt's built-in testing provide frameworks for defining and executing these checks. Alert on failures immediately. A dashboard showing stale data without alerting is a dashboard that will be ignored.

For real-time streaming, the modern stack centers on Apache Kafka or cloud-native equivalents (Amazon Kinesis, Confluent Cloud) for event transport, combined with stream processing engines for transformation.

Apache Flink has become the leading stream processing engine, offering exactly-once semantics, event-time processing, and the ability to handle both streaming and batch workloads. Kafka Streams is a simpler alternative for applications that process data from and back to Kafka without needing a separate cluster.

The decision between batch and streaming should be driven by business requirements. Most analytics workloads are well-served by hourly or daily batch processing. Reserve streaming for use cases where latency genuinely matters: fraud detection, real-time personalization, operational monitoring, and event-driven architectures.