Data Mesh Paradigm Shift in Data Platform Architecture (DDD EU 2020 talk summary)

Paradigm Shift — a term coined by Thomas Kuhn (The structure of scientific revolutions)

Normal science:

people work within boundaries with not much critical thinking going on

Anomalies arise:

Things that cannot be explained using current models

Crisis phase:

When anomalies become more and more frequent, there comes Doubt on the correctness of current models.

This leads to Revolutionary Science

There is a crisis in how we deal with data

So much investment but not many proven results (there are exceptions mostly with big players)

The industry as a whole generally fails — specifically in transformational measures (treating data as a product)

Data technology solutions today

The great divide between:

  • operational systems architecture (devops, ddd, infrastructure, running the business …)
  • big data analytical-architecture (insights, optimizing the business, analytical data, reports, ml …)

Big data analytical-architecture

A couple of generations of approaches:

  • Data Warehouse (date back to seventies)
  • Data Lake / Hub (2010)
  • Data Lake in cloud

What data warehouse wanted to solve

ETL (extract transform load)

Humans do analyses and create reports and dashboards

SQL-ish interface

Transformed to a single schema (bottleneck)

eg. BigQuery

Data Lake (solves the bottleneck issue)

ELT

From many sources

Transformed into multiple lakeshore marts

Raw file, api, or downstream db access

But then analysts spend most of the time cleaning up data and preparing it (bottleneck moved downstream)

eg. Amazon s3, Azure Data Lake …

Data Lake in the cloud

Rely on central cloud storage

Why is this not working?

Centralization is an issue! (Always big data platform in the middle — producers, and consumers hang on the outside)

Where are the domain boundaries? (they get lost this way)

Building good old gigantic monolith — we squeeze all of the different domains into one.

Instead of nice domain boundaries we have: ingest -> process -> serve (no domains but separation by technology — if tilted, reminds of the good old layered architecture which we are so much trying to get rid of)

When introducing new data capability we have to touch all of the layers — not good!

We want our nice localized vertical slices.

Data platform engineers teams are centralized and mostly detached from the domain itself and even the consumers

They spend most of their time fixing things they don’t know much about

Feel disconnected from data scientists

Hiring data engineers is really tough

The situation is similar to dev / ops separation we had in the past — need cross-functional teams in the data space

Amount of data is continuously increasing -> on consuming side more and more pressure to make something of it

Seems like we are stuck in this “Normal Science” phase (Thomas Kuhn) for the last 40 years

Where do we go from here?

Data mesh (concept) principles:

  • data
  • ddd distributed architecture
  • self-serve infrastructure as a platform
  • product thinking
  • ecosystem governance

It is a synthesis of the approaches from top productive clients + applying what we are already doing in the operational space

DDD distributed architecture

On one side: source-oriented domain data (eg. orders, ui interaction, payments etc … — bounded contexts)

On the other side consumer-oriented data sets — aggregations, projections for different purposes eg. shopping recommendations, customers lifetime values etc … (these change more frequently

Data pipeline is now distributed between these two

Domains become first-class concerns

Data pipelines are second class

Domain datasets are immutable

Product thinking

We build datasets as a product

Think about consumers

Where do they want the data (apis, buckets …)

Build the product FOR the consumer

What is a data product:

  • shared / discoverable
  • addressable
  • trustworthy (defined and monitored SLOs)
  • self-describing
  • inter-operable
  • secure

Need cross-functional teams:

data engineer

software developer

infra developer

Domain Data Product Owner — we need these — different than product owners

Data infrastructure as a platform — build it as thin as possible and as domain agnostic as possible (like cloud platforms) — it’s about self-service and enablement

There is still a good reason to have data lakes / warehouses — but are partitioned / modularized, it’s all about a shift of a perspective

Software Engineer