Schema Registry: The event is the API
Schema registry provides tools for describing your events.
Heads up: there's a newer version of this post. Read it here!
Introduction
Highly scalable, loosely coupled architectures often use an asynchronous event-driven design. In such systems, the contract between the producer and the consumer is the event - the event is the API.
It's important to document the API, and it's important to be able to evolve the API. This is often done using schema, such as Apache Avro, JSON Schema, or Protobuf.
We're pleased to announce the beta release of the schema registry subsystem of Redpanda that provides an interface for managing schema.
Built into the Redpanda binary, schemas are stored on the same raft-based storage engine and the RESTful interface is available on every broker. You get the same high availability as your data so there's nothing new to deploy, and it's available today.
To take advantage of the schema registry in an existing Redpanda installation, make sure you update to the latest version. Otherwise, follow the instructions in the Linux, MacOS, Kubernetes, or Docker quick start guides to spin up a new Redpanda instance.
If you want to leave the infrastructure issues to us, sign up for Redpanda Cloud for the simplest way to run Redpanda.
To get down to business, skip ahead to the example.
Overview
A loosely coupled architecture not only reduces dependencies in the code, it also reduces communication overhead between and within teams. By defining the API, or in this case the event, with a schema, disparate teams can start work on the subsystems that produce and consume those events with minimal communication overhead.
Operational complexity
At Redpanda, we like to make things simple. Redpanda is an Apache Kafka®-compatible event streaming platform that eliminates Zookeeper® and the JVM, autotunes itself for modern hardware, and ships in a single binary.
We've built the schema registry directly into Redpanda; there are no new binaries to install, no new services to deploy and maintain, and the default configuration just works.
Schemas are stored in a standard compacted topic, we use optimistic concurrency control at the topic level to allow mutating REST calls to any broker. There's no need to configure leadership or failover strategies, every broker is symmetric.
Schema
A schema can be used as human readable documentation for an API, to verify data conforms to that API, to generate serialisers for the data, and to evolve the API with predefined levels of compatibility, allowing new versions of services to be rolled out independently.
Some data encodings are somewhat self-describing, but that can make them verbose. Some encodings are extensible. JSON for example, has a property name and a property value. The name isn't part of the information, but it allows new fields to be easily added by the producer and ignored by the consumer.
A schema is an external mechanism to describe the data and its encoding, allowing a reduction in the amount of data transmitted, while keeping the same information. It also allows defaults for new fields, which means that it's possible to decouple the rollout of producers and consumers.
Example
Start Redpanda
Let's jump right in and start Redpanda using Docker on Linux:
docker network create redpanda-sr
docker volume create redpanda-sr
docker run \
--pull=always \
--name=redpanda-sr \
--net=redpanda-sr \
-v "redpanda-sr:/var/lib/redpanda/data" \
-p 8081:8081 \
-p 8082:8082 \
-p 9092:9092 \
--detach \
docker.vectorized.io/vectorized/redpanda start \
--overprovisioned \
--smp 1 \
--memory 1G \
--reserve-memory 0M \
--node-id 0 \
--check=false \
--pandaproxy-addr 0.0.0.0:8082 \
--advertise-pandaproxy-addr 127.0.0.1:8082 \
--kafka-addr 0.0.0.0:9092 \
--advertise-kafka-addr redpanda-sr:9092
Now we're ready to start using the schema registry!
Endpoints are documented with Swagger at http://localhost:8081/v1
or on SwaggerHub
I'm using jq
to prettify and process the JSON responses.
We'll use the popular requests module (pip install requests
).
For the rest of the guide, we'll assume the following for an interactive python session:
import requests
import json
def pretty(text):
print(json.dumps(text, indent=2))
base_uri = "http://localhost:8081"
Schemas
The currently supported schema type is AVRO
, we plan to support JSON
and PROTOBUF
.
You can query the schema registry for that:
- Curl
- Python
curl -s \
"http://localhost:8081/schemas/types" \
| jq .
[
"AVRO"
]
Publish a schema
Schemas are registered against a subject
, typically in the form {topic}-key
or {topic}-value
.
Let's register an example Avro schema which represents a measurement from a sensor for the value of the sensor
topic.
{
"type": "record",
"name": "sensor_sample",
"fields": [
{
"name": "timestamp",
"type": "long",
"logicalType": "timestamp-millis"
},
{
"name": "identifier",
"type": "string",
"logicalType": "uuid"
},
{
"name": "value",
"type": "long"
}
]
}
We need to POST
the AVRO
schema to /subjects/sensor-value/versions
endpoint with the Content-Type
of application/vnd.schemaregistry.v1+json
:
- Curl
- Python
curl -s \
-X POST \
"http://localhost:8081/subjects/sensor-value/versions" \
-H "Content-Type: application/vnd.schemaregistry.v1+json" \
-d '{"schema": "{\"type\":\"record\",\"name\":\"sensor_sample\",\"fields\":[{\"name\":\"timestamp\",\"type\":\"long\",\"logicalType\":\"timestamp-millis\"},{\"name\":\"identifier\",\"type\":\"string\",\"logicalType\":\"uuid\"},{\"name\":\"value\",\"type\":\"long\"}]}"}' \
| jq
{
"id": 1
}
The id
is unique for the schema in the Redpanda cluster.
Retrieve the schema by its ID
We can retrieve the schema directly using its ID:
- Curl
- Python
curl -s \
"http://localhost:8081/schemas/ids/1" \
| jq .
{
"schema": "{\"type\":\"record\",\"name\":\"sensor_sample\",\"fields\":[{\"name\":\"timestamp\",\"type\":\"long\",\"logicalType\":\"timestamp-millis\"},{\"name\":\"identifier\",\"type\":\"string\",\"logicalType\":\"uuid\"},{\"name\":\"value\",\"type\":\"long\"}]}"
}
List the subjects
Now that a schema is associated with a subject, let's list the subjects:
- Curl
- Python
curl -s \
"http://localhost:8081/subjects" \
| jq .
[
"sensor-value"
]
Cool! We knew that, but now anyone can discover them.
Retrieve the schema versions for the subject
Schemas associated with subjects are versioned. That's how your API can evolve.
Let's query the versions for the sensor-value
subject:
- Curl
- Python
curl -s \
"http://localhost:8081/subjects/sensor-value/versions" \
| jq .
[
1
]
Retrieve a schema for the subject
If we know the subject and the version we want, we can query directly:
- Curl
- Python
curl -s \
"http://localhost:8081/subjects/sensor-value/versions/1" \
| jq .
{
"subject": "sensor-value",
"id": 1,
"version": 1,
"schema": "{\"type\":\"record\",\"name\":\"sensor_sample\",\"fields\":[{\"name\":\"timestamp\",\"type\":\"long\",\"logicalType\":\"timestamp-millis\"},{\"name\":\"identifier\",\"type\":\"string\",\"logicalType\":\"uuid\"},{\"name\":\"value\",\"type\":\"long\"}]}"
}
Instead of a specific version, we can ask for the latest:
- Curl
- Python
curl -s \
"http://localhost:8081/subjects/sensor-value/versions/latest" \
| jq .
{
"subject": "sensor-value",
"id": 1,
"version": 1,
"schema": "{\"type\":\"record\",\"name\":\"sensor_sample\",\"fields\":[{\"name\":\"timestamp\",\"type\":\"long\",\"logicalType\":\"timestamp-millis\"},{\"name\":\"identifier\",\"type\":\"string\",\"logicalType\":\"uuid\"},{\"name\":\"value\",\"type\":\"long\"}]}"
}
It's also possible to query for just the schema by appending /schema
to the query path. That unwraps the escaped schema:
- Curl
- Python
curl -s \
"http://localhost:8081/subjects/sensor-value/versions/latest/schema" \
| jq .
{
"type": "record",
"name": "sensor_sample",
"fields": [
{
"name": "timestamp",
"type": "long",
"logicalType": "timestamp-millis"
},
{
"name": "identifier",
"type": "string",
"logicalType": "uuid"
},
{
"name": "value",
"type": "long"
}
]
}
Compatibility
There are several types of compatibility:
BACKWARDS
- Allows consumers of the new version to read the previous versionFORWARDS
- Allows consumers of the previous version to read the new versionFULL
- Forwards and backwards compatibility is ensured.
Each of these will check against the most recent version. To check against all registered versions for a subject, they can have _TRANSITIVE
appended.
NONE
- No compatibility is required.
The default global compatibility is backwards.
Compatibility can be set explicitly for a subject:
- Curl
- Python
curl -s \
-X PUT \
"http://localhost:8081/config/sensor-value" \
-H "Content-Type: application/vnd.schemaregistry.v1+json" \
-d '{"compatibility": "BACKWARD"}' \
| jq .
{
"compatibility": "BACKWARD"
}
Evolving a schema
Posting a backwards incompatible change to a subject will fail.
For example, changing the type of the value
field from long
to int
:
{
"type": "record",
"name": "sensor_sample",
"fields": [
{
"name": "timestamp",
"type": "long",
"logicalType": "timestamp-millis"
},
{
"name": "identifier",
"type": "string",
"logicalType": "uuid"
},
{
"name": "value",
"type": "int"
}
]
}
- Curl
- Python
curl -s \
-X POST \
"http://localhost:8081/subjects/sensor-value/versions" \
-H "Content-Type: application/vnd.schemaregistry.v1+json" \
-d '{"schema": "{\"type\":\"record\",\"name\":\"sensor_sample\",\"fields\":[{\"name\":\"timestamp\",\"type\":\"long\",\"logicalType\":\"timestamp-millis\"},{\"name\":\"identifier\",\"type\":\"string\",\"logicalType\":\"uuid\"},{\"name\":\"value\",\"type\":\"int\"}]}"}' \
| jq
{
"error_code": 409,
"message": "Schema being registered is incompatible with an earlier schema for subject \"{sensor-value}\""
}
A backwards compatible change would be changing it from a long
to a double
:
{
"type": "record",
"name": "sensor_sample",
"fields": [
{
"name": "timestamp",
"type": "long",
"logicalType": "timestamp-millis"
},
{
"name": "identifier",
"type": "string",
"logicalType": "uuid"
},
{
"name": "value",
"type": "double"
}
]
}
- Curl
- Python
curl -s \
-X POST \
"http://localhost:8081/subjects/sensor-value/versions" \
-H "Content-Type: application/vnd.schemaregistry.v1+json" \
-d '{"schema": "{\"type\":\"record\",\"name\":\"sensor_sample\",\"fields\":[{\"name\":\"timestamp\",\"type\":\"long\",\"logicalType\":\"timestamp-millis\"},{\"name\":\"identifier\",\"type\":\"string\",\"logicalType\":\"uuid\"},{\"name\":\"value\",\"type\":\"double\"}]}"}' \
| jq
{
"id": 2
}
Cleanup
Now we can cleanup:
docker stop redpanda-sr
docker rm redpanda-sr
docker volume remove redpanda-sr
docker network remove redpanda-sr
Conclusion
We'll be adding more endpoints and more encodings. For an up-to-date list of features and their status see the schema registry features meta-issue on GitHub.
The schema registry is built on the same principles as Redpanda, but has not yet been optimized for performance. We are continuing to work on the schema registry, so make sure you join our slack community to get updates on the progress!
Let's keep in touch
Subscribe and never miss another blog post, announcement, or community event. We hate spam and will never sell your contact information.