Imagine entering a library just to find all the books lying on the floor, with no covers, bookshelves, or librarian! By the time you find the title you are looking for, it would be hard to care about its content. Unfortunately, this is an everyday reality when working with data.
Without the correct metadata - the properties that help us understand the data itself - the real work of gaining insights from data can become an unbearable task.
How many times have we seen data analysts using the wrong tables? Or backend teams silently changing schemas only to discover the fire a few weeks later? Can we trust the data powering our company’s core KPIs?
Collecting and storing data is no longer a struggle. The main challenge nowadays is making sense of all topics, tables, pipelines, dashboards, and ML models (to name a few!) that teams have been gathering and creating throughout the years.
Data practitioners have now realized that unlocking the value of data requires creating products that go beyond data and focus on people:
- Discovering the right assets for their use cases,
- Collaborating to make the best decisions, and
- Trusting their data.
While these ideas might resonate, actually achieving them means integrating multiple tools that may not naturally talk to each other.
Use case: Integrating multiple data streams in a centralized hub
Data platforms have multiple specialized teams - backend, data engineering, data scientists, and data analysts. Each of them focuses on different aspects of data. However, real value and understanding only comes from providing proper context.
As an example:
- A backend team generating events and managing them with Redpanda
- An online database as a snapshot of the real-time data
- A set of ETLs curating the live data into a Data Warehouse
- Dashboards showing end users how the business is going
Our goal is to break out of knowledge silos and share as much information as possible to bring joy to any data consumer - at any stage - by showing how all the pieces interact. Breaking the barriers among teams is the first step to a healthier, more profitable, and scalable data platform, and that can only happen through transparency and collaboration.
To achieve that, we will use OpenMetadata to ingest the metadata from the services and provide a single place to discover and collaborate. In this blog, we will focus on integrating Redpanda, a Kafka API-compatible streaming data platform, that is easy to use and super fast in terms of both latency and throughput.
Setting up OpenMetadata
The easiest way to spin up OpenMetadata on your local computer is by using the
metadata CLI, as shown in the Quickstart Guide. In a nutshell, it requires us to run the following steps:
pip3 install --upgrade "openmetadata-ingestion[docker]" metadata docker --start
This will spin up the OpenMetadata server, a MySQL instance as the metadata store, ElasticSearch for querying capabilities, and the OpenMetadata Ingestion container, which will be used to extract metadata from your sources.
Setting up Redpanda
To prepare the source, we will use this repository. Clone it locally and run
docker compose up
The result will be a container with Redpanda brokers and a schema registry where we have fed some sample topics.
You can follow these steps to configure and deploy the Redpanda metadata ingestion.
Note: In our setup, both Redpanda and OpenMetadata are in a separate Docker Compose deployment. Therefore, we need to access the sources via the local network. We can achieve this by configuring
host.docker.internalas the hostname.
The OpenMetadata UI will walk us through two main steps:
- First, creating the Messaging Service: the Entity representing the source system that contains the metadata we want to extract. All the Entities - in this case, Topics - that we ingest from this service will belong to it. This allows us to locate the origin of the metadata we manage efficiently.
- Creating and deploying the Ingestion Pipelines: these are internally handled by OpenMetadata using an Ingestion Framework - a Python library holding the logic to connect to multiple sources, translate their original metadata into the OpenMetadata standard, and send it to the server using the APIs.
From the UI, we can directly interact with the service and the pipelines it has deployed without managing any other dependencies.
Moreover, engineers can directly import and use the Ingestion Framework package to configure and host their own ingestion processes. On top of that, any operation happening in the UI or Ingestion Framework is open and supported by the server APIs. This means full automation possibilities for any metadata activity, which can be achieved directly via REST or using the OpenMetadata SDKs.
Call to arms
Data is enjoyed the most in good company, and to help all teams and consumers, we need to have updated and clear descriptions of the assets. If we check the metadata that has been ingested, we will find the following list after navigating to the service page:
If we then access any of the topics:
We can observe properties such as the partitions, replication factor, and schema definition. This is great, but only if you already know what this topic is about. On the other hand, imagine finding this same entity with the following information:
In the updated entity, we have:
- Added a proper description of the asset purpose and properties.
- Assigned the asset to an owner, enabling users to ask questions to the right person or team.
- Flagged the asset as Tier 1 to inform users that this is a business-critical source.
- Added tags about PII data to consumers and governance teams understand that data needs to be treated carefully.
Applying this same curation process to the rest of the data platform will help existing teams share information and collaborate more effectively while reducing the onboarding time for new members.
The data lifecycle
How often has a dashboard broken or an ETL pipeline failed because of planned (or surprise) schema changes? Teams and needs evolve, and data has to evolve with them. Unfortunately, this is a reality that won’t change. However, we can improve how we detect and communicate said changes. The goal is to minimize - and, when possible, prevent - any downtime or time-consuming activities such as backfilling.
Being able to deploy and schedule regular metadata ingestion workflows will take care of flagging differences between versions. A new column has been added? The Table version gets bumped by 0.1. Deleting columns can be scary, so the version increases by 1.0, the same way as the software does.
The best part is that all the versioning evolution is stored and explorable. On top of that, all change events can be consumed by setting up a Webhook. Out of the box, OpenMetadata offers linking MS Teams or Slack to notify teams. Moreover, pushing changes to a Redpanda topic would be a possible approach to have fine-grained response control for different events.
In this post, we have explored the modern challenges of data teams, aiming to close the gap between data and people.
Setting up Redpanda together with OpenMetadata, we have prepared a metadata ingestion process and explored how to curate assets’ information, bringing context to where each piece of the architecture is positioned within the data platform.
Finally, we have also presented how data evolution impacts teams and reduces the value they can generate. With the help of OpenMetadata and Redpanda, engineers can detect changes early and automate business flows based on data evolution.
Take Redpanda for a test drive here. Check out the documentation to understand the nuts and bolts of how the platform works, or read more Redpanda blogs to see the plethora of ways to integrate with Redpanda. To ask Redpanda Solution Architects and Core Engineers questions and interact with other Redpanda users, join the Redpanda Community on Slack.
Let's keep in touch
Subscribe and never miss another blog post, announcement, or community event. We hate spam and will never sell your contact information.