Data Mesh Paradigm Shift in Data Platform Architecture (DDD EU 2020 talk summary)
Paradigm Shift — a term coined by Thomas Kuhn (The structure of scientific revolutions)
Normal science:
people work within boundaries with not much critical thinking going on
Anomalies arise:
Things that cannot be explained using current models
Crisis phase:
When anomalies become more and more frequent, there comes Doubt on the correctness of current models.
This leads to Revolutionary Science
There is a crisis in how we deal with data
So much investment but not many proven results (there are exceptions mostly with big players)
The industry as a whole generally fails — specifically in transformational measures (treating data as a product)
Data technology solutions today
The great divide between:
- operational systems architecture (devops, ddd, infrastructure, running the business …)
- big data analytical-architecture (insights, optimizing the business, analytical data, reports, ml …)
Big data analytical-architecture
A couple of generations of approaches:
- Data Warehouse (date back to seventies)
- Data Lake / Hub (2010)
- Data Lake in cloud
What data warehouse wanted to solve
ETL (extract transform load)
Humans do analyses and create reports and dashboards
SQL-ish interface
Transformed to a single schema (bottleneck)
eg. BigQuery
Data Lake (solves the bottleneck issue)
ELT
From many sources
Transformed into multiple lakeshore marts
Raw file, api, or downstream db access
But then analysts spend most of the time cleaning up data and preparing it (bottleneck moved downstream)
eg. Amazon s3, Azure Data Lake …
Data Lake in the cloud
Rely on central cloud storage
Why is this not working?
Centralization is an issue! (Always big data platform in the middle — producers, and consumers hang on the outside)
Where are the domain boundaries? (they get lost this way)
Building good old gigantic monolith — we squeeze all of the different domains into one.
Instead of nice domain boundaries we have: ingest -> process -> serve (no domains but separation by technology — if tilted, reminds of the good old layered architecture which we are so much trying to get rid of)
When introducing new data capability we have to touch all of the layers — not good!
We want our nice localized vertical slices.
Data platform engineers teams are centralized and mostly detached from the domain itself and even the consumers
They spend most of their time fixing things they don’t know much about
Feel disconnected from data scientists
Hiring data engineers is really tough
The situation is similar to dev / ops separation we had in the past — need cross-functional teams in the data space
Amount of data is continuously increasing -> on consuming side more and more pressure to make something of it
Seems like we are stuck in this “Normal Science” phase (Thomas Kuhn) for the last 40 years
Where do we go from here?
Data mesh (concept) principles:
- data
- ddd distributed architecture
- self-serve infrastructure as a platform
- product thinking
- ecosystem governance
It is a synthesis of the approaches from top productive clients + applying what we are already doing in the operational space
DDD distributed architecture
On one side: source-oriented domain data (eg. orders, ui interaction, payments etc … — bounded contexts)
On the other side consumer-oriented data sets — aggregations, projections for different purposes eg. shopping recommendations, customers lifetime values etc … (these change more frequently
Data pipeline is now distributed between these two
Domains become first-class concerns
Data pipelines are second class
Domain datasets are immutable
Product thinking
We build datasets as a product
Think about consumers
Where do they want the data (apis, buckets …)
Build the product FOR the consumer
What is a data product:
- shared / discoverable
- addressable
- trustworthy (defined and monitored SLOs)
- self-describing
- inter-operable
- secure
Need cross-functional teams:
data engineer
software developer
infra developer
Domain Data Product Owner — we need these — different than product owners
Data infrastructure as a platform — build it as thin as possible and as domain agnostic as possible (like cloud platforms) — it’s about self-service and enablement
There is still a good reason to have data lakes / warehouses — but are partitioned / modularized, it’s all about a shift of a perspective