In traditional data platforms, data engineers often dump production databases periodically, then load them to data warehouses to analyze. However, this approach suffers from a lack of real-time data capture, which is sometimes critical for software companies.
Let's say you're analyzing how users book hotel rooms in a booking platform using a conventional data platform, and there's a production issue that is preventing users from booking a hotel room. In a conventional data platform, this problem will only emerge after an extended period of time, when the analyzed data updates and shows a noticeable and unexplained decline in booking numbers.
This is why change data capture (CDC) is so important. Enabling CDC in the database, instead of periodically dumping the data, you can capture the data changes immediately. Combining this with Redpanda and visualization tools gives data engineers much better visibility of the current state of the data in real time. The following are some example use cases of applying CDC for software projects:
- Auditing databases : CDC can be used to record what data has been changed and who changed it and then save them to a persistent database for auditing purposes if needed.
- Invalidating cache: The caching mechanism is used to boost performance of the application and is often applied for APIs that retrieve data from a database. CDC can help to invalidate old caching data and replace it with the updated data, which ensures users always get the updated data while still maintaining appropriate application performance.
- Real time data for user behavior: CDC could also be used to gather real-time data about new product purchases in an e-commerce store, allowing you to see user behavior and trends as they happen.
CDC is also a key component of event-driven architectures, which enable a bunch of loosely coupled services to communicate with each other by sending events over a common data streaming platform, such as Redpanda. New data captures generated by the database are sent immediately to the streaming platform and the services listening for those events can take action.
This article demonstrates how CDC works in SQL Server and how to implement Redpanda with Debezium and SQL Server for CDC.
How CDC works in SQL Server
SQL Server has supported CDC for its enterprise product since its 2008 edition and for all editions since 2016. SQL Server uses SQL Server Agent to track create, update, and delete executions on the database tables.
To use CDC in SQL Server, you'll need to enable it first, using a specific SQL query and specifying the database where you want to capture changes. SQL Server will create additional data tables in order to store the changed logs.
Depending on the action (create, update, or delete query), SQL Server will capture the changes differently. With the update query, two records will be created to store the data before and after the change. With insert and delete statements, SQL Server will create one record for the data that was inserted or deleted.
To understand how CDC can be used, let's expand on the use cases mentioned in the introduction.
Use case: Capturing new purchases of products
A company could integrate a CDC pipeline into a downstream ETL application so that data changes are continuously collected from SQL Server when customers purchase their products.
When a user buys a new product, a new row is added in the order table of SQL Server. As a result, by applying CDC features, the ETL application will be automatically updated with new values of the order table without the need to periodically dump SQL Server to check whether there is any update or not. From this, business owners will be able to see the user behavior and make business plans accordingly.
Use case: Capturing availability
For capturing availability, let's take the example of a booking platform that lists available rooms in hotels for users. When they search for available rooms, hotels with less than five free rooms will present a warning that "There are only x rooms left."
With the use of CDC in SQL Server, when there are less than five rooms left, an event is sent from the room management service to the web frontend service with information about the available rooms that the hotel has.
The frontend service already applied websocket technology to continuously listen to the room management service for any updated events. When the frontend service receives the event message, it will show a notification to alert users about which hotels are running out of rooms.
CDC workflow with SQL
Building on the example of capturing availability, the workflow would look like the following: After enabling CDC for the room table, a corresponding change table for it will be automatically created. When there is an update to the room table, the SQL agent will read the transaction logs and insert these changes to the change table. These changes can be retrieved to store in different storage using SQL statements. The retention policy for the change table is configurable. The default value is three days and can be set up to one hundred years.
Implementing Redpanda with Debezium and SQL Server for CDC
Manually querying the change table in SQL Server is not practical for any real software projects. You should implement a data pipeline to capture changes as a sequence of streams to visualize data changes or sending events to the service that needs action. Let's see how you can do it with the help of Redpanda and Debezium.
For the demo and tutorial that follows, you can access the full code anytime in this GitHub repository.
The above diagram demonstrates the integration amongst SQL server, Debezium, and Redpanda in real world use cases. Web API sends requests to SQL server to update data in the table. The data change is captured and is sent to Debezium. Debezium then forwards the captured data to Redpanda. Finally, Redpanda can use that captured data to send an event to a web service in an event-driven architecture or the captured data can be visualized using Grafana.
Redpanda is a modern streaming platform that is API-compatible with Apache KafkaⓇ and can be installed as a standalone application without the need for ZooKeeper and Java virtual machines.
Debezium is an open source platform for change data capture, with support for a number of databases such as SQL Server, MySQL, and MongoDB. Debezium continuously checks for change data captures. Even if the database is down, it still checks for changes. So, new changes can still be captured successfully when the database goes back online, avoiding the loss of important changes.
Using the same example from the previous sections, capturing hotel room availability, this tutorial demonstrates how you can use Redpanda with Debezium and SQL Server for CDC.
Prerequisites
Make sure you already have Docker and Docker Compose installed on your system. Docker and Docker Compose help launch containers for SQL Server, Debezium, and Redpanda, which will be explained in detail in the next section.
Integrate SQL Server, Debezium, and Redpanda
You should note that the following steps may differ depending on the operating system you're working on. This tutorial used the Ubuntu 20.04 machine.
Create a new directory for storing the configuration and code
From your home directory, create a new directory to store configurations for SQL Server, Debezium, and Redpanda:
mkdir integrate-sql-server-debezium-redpanda
Navigate to that folder:
cd integrate-sql-server-debezium-redpanda
Configure files for Docker Compose
Create a Docker Compose file that includes the needed information to launch all the new services for SQL Server, Debezium, and Redpanda. First, create the file docker-compose.yaml using the following code:
nano docker-compose.yaml
Copy the following content to the docker-compose.yaml file:
version: '2'
services:
redpanda:
image: vectorized/redpanda:v22.2.2
ports:
- "9092:9092"
- "29092:29092"
command:
- redpanda
- start
- --overprovisioned
- --node-id
- "0"
- --kafka-addr
- PLAINTEXT://0.0.0.0:29092,OUTSIDE://0.0.0.0:9092
- --advertise-kafka-addr
- PLAINTEXT://redpanda:29092,OUTSIDE://redpanda:9092
- --check=false
sqlserver:
image: mcr.microsoft.com/mssql/server:2019-latest
ports:
- 1433:1433
environment:
- ACCEPT_EULA=Y
- MSSQL_PID=Standard
- SA_PASSWORD=Password!
- MSSQL_AGENT_ENABLED=true
connect:
image: quay.io/debezium/connect:1.9
ports:
- 8083:8083
links:
- redpanda
- sqlserver
environment:
- BOOTSTRAP_SERVERS=redpanda:9092
- GROUP_ID=1
- CONFIG_STORAGE_TOPIC=redpanda_connect_configs
- OFFSET_STORAGE_TOPIC=redpanda_connect_offsets
- STATUS_STORAGE_TOPIC=redpanda_connect_statuses
Explanation for Redpanda
Let's go into the details of the docker-compose.yaml file in order to understand why we need it.
For the Redpanda service, you'll use the Docker image version v21.11.15. 9092:9092
and 29092:29092
are used to map Docker container ports to the host machine ports.
The running commands for Redpanda specify the parameters for the overprovisioned
, smp
, memory
, reserve-memory
, node-id
, kafka-addr
, advertise-kafka-addr
, and check
values.
For details on how to set up Redpanda, please see the official documentation](https://docs.redpanda.com/docs/quickstart/quick-start-docker/) for the most up-to-date installation instructions.
Configuring SQL Server
You're using SQL Server with the Docker image version mcr.microsoft.com/mssql/server:2019-latest
. The configuration for the ports is 1433:1433
, to map Docker's port to the host machine's port. You will also set up the environment variables for SQL Server. ACCEPT_EULA=Y
means that you accept the user agreement. MSSQL_PID=Standard
is for using a free standard version of SQL Server. Use SA_PASSWORD=Password!
to set a strong password for the root account. MSSQL_AGENT_ENABLED=true
enables SQL Server Agent, which will be used to capture changes in SQL Server.
Configuring Debezium
The final service you'll set up is debezium/connect. This is used to connect Redpanda and SQL Server with Debezium so that data capture changes will be sent to debezium/connect, then forwarded to Redpanda. quay.io/debezium/connect:1.9
is the version of the Docker image you'll be using for debezium/connect. 8083:8083
maps the ports. links: redpanda, sqlserver
connects Debezium with Redpanda and debezium/connect with SQL Server. The following environment variables are needed: BOOTSTRAP_SERVERS=redpanda:9092
specifies the bootstrap server value for Redpanda; GROUP_ID=1
sets the group_id to identify the Redpanda clusters and workers you're working on; and CONFIG_STORAGE_TOPIC
and OFFSET_STORAGE_TOPIC
and STATUS_STORAGE_TOPIC
are required values to connect with Redpanda.
Launch the Docker services
That's about it for the setup; let's launch these three services using the following Docker Compose command:
docker-compose up
It will take a few minutes to launch all the services. You should be able to see the following output after running the command:
connect_1 | 2022-08-17 23:06:49,488 INFO || Started o.e.j.s.ServletContextHandler@32ab408e{/,null,AVAILABLE} [org.eclipse.jetty.server.handler.ContextHandler]
connect_1 | 2022-08-17 23:06:49,489 INFO || REST resources initialized; server is started and ready to handle requests [org.apache.kafka.connect.runtime.rest.RestServer]
connect_1 | 2022-08-17 23:06:49,489 INFO || Kafka Connect started [org.apache.kafka.connect.runtime.Connect]
connect_1 | 2022-08-17 23:06:49,516 INFO || [Worker clientId=connect-1, groupId=1] Session key updated [org.apache.kafka.connect.runtime.distributed.DistributedHerder]
The three containers should be up and running now. Run the following command to list all the running containers in Docker to verify whether all the three services are running:
docker ps
Entering the above code should produce an output similar to the following:
CONTAINER ID IMAGE COMMAND CREATED STATUS PORTS NAMES
8ef9201f0044 quay.io/debezium/connect:1.9 "/docker-entrypoint.…" 25 minutes ago Up 25 minutes 0.0.0.0:8083->8083/tcp, :::8083->8083/tcp, 9092/tcp integrate-sql-server-debezium-redpanda_connect_1
329eea59f8f3 vectorized/redpanda:v21.11.15 "/entrypoint.sh redp…" 25 minutes ago Up 25 minutes 8081-8082/tcp, 0.0.0.0:9092->9092/tcp, :::9092->9092/tcp, 9644/tcp, 0.0.0.0:29092->29092/tcp, :::29092->29092/tcp integrate-sql-server-debezium-redpanda_redpanda_1
d28cb7e32dbf mcr.microsoft.com/mssql/server:2019-latest "/opt/mssql/bin/perm…" 25 minutes ago Up 25 minutes 0.0.0.0:1433->1433/tcp, :::1433->1433/tcp integrate-sql-server-debezium-redpanda_sqlserver_1
Now that all the three services are up and running, let's move on to setting up CDC.
Using CDC to capture hotel room availability
Let's try capturing changes for the hotel room availability use case.
Create SQL generator file
First create the SQL file to generate data for your SQL Server database:
nano sql-generator.sql
Copy the following content to the sql-generator.sql file:
-- Create the sqlCDC database for demonstrating capturing data changes
CREATE DATABASE sqlCDC;
GO
USE sqlCDC;
EXEC sys.sp_cdc_enable_db;
-- Create and populate rooms
CREATE TABLE rooms (
id INTEGER IDENTITY(101,1) NOT NULL PRIMARY KEY,
hotel_id VARCHAR(255) NOT NULL,
name VARCHAR(255) NOT NULL,
description VARCHAR(512),
total_rooms INTEGER,
used_rooms INTEGER,
left_rooms INTEGER
);
INSERT INTO rooms(hotel_id,name,description,total_rooms,used_rooms,left_rooms)
VALUES ('hiltonhn','king-size','A room of king size',20,5,15);
INSERT INTO rooms(hotel_id,name,description,total_rooms,used_rooms,left_rooms)
VALUES ('hiltonhn','queen-size','A room of queen size',40,10,30);
INSERT INTO rooms(hotel_id,name,description,total_rooms,used_rooms,left_rooms)
VALUES ('pullmanhn','queen-size','A room of queen size',10,2,8);
INSERT INTO rooms(hotel_id,name,description,total_rooms,used_rooms,left_rooms)
VALUES ('pullmanhn','single room','A room for single',6,1,5);
EXEC sys.sp_cdc_enable_table @source_schema = 'dbo', @source_name = 'rooms', @role_name = NULL, @supports_net_changes = 0;
GO
The above code creates a new database named sqlCDC
using CREATE DATABASE sqlCDC;
. Please note, to execute the SQL query, you need to have GO
command after the SQL query. To enable change capture, you need to run EXEC sys.sp_cdc_enable_db;
.
Additionally, to capture changes for the table rooms, the following command is also needed:
EXEC sys.sp_cdc_enable_table @source_schema = 'dbo', @source_name = 'rooms', @role_name = NULL, @supports_net_changes = 0;
Generate data to SQL server database
Let's apply these updates to the SQL Server container.
Execute the queries in the sql-generator.sql file in the sql-server container:
cat sql-generator.sql | docker exec -i integrate-sql-server-debezium-redpanda_sqlserver_1 bash -c '/opt/mssql-tools/bin/sqlcmd -U sa -P Password!'
You should see an output similar to the following:
Changed database context to 'sqlCDC'.
(1 rows affected)
(1 rows affected)
(1 rows affected)
(1 rows affected)
Job 'cdc.sqlCDC_capture' started successfully.
Job 'cdc.sqlCDC_cleanup' started successfully.
(1 rows affected)
Now check the topic list in the Redpanda container. First you need to go inside the docker container for Redpanda using the following command:
docker exec -it integrate-sql-server-debezium-redpanda_redpanda_1 /bin/bash
Once you're inside the container, run the following:
rpk topic list
You should see an output similar to the following:
NAME PARTITIONS REPLICAS
redpanda_connect_configs 1 1
redpanda_connect_offsets 25 1
redpanda_connect_statuses 5 1
Expose Debezium API to capture changes
For Debezium to capture changes from SQL Server and forward them to Redpanda, you need to use an exposed Debezium API to apply needed configuration. Let's create the register-sqlserver.json file first:
nano register-sqlserver.json
Copy the following content to the file:
{
"name": "inventory-connector",
"config": {
"connector.class" : "io.debezium.connector.sqlserver.SqlServerConnector",
"tasks.max" : "1",
"database.server.name" : "server1",
"database.hostname" : "sqlserver",
"database.port" : "1433",
"database.user" : "sa",
"database.password" : "Password!",
"database.dbname" : "sqlCDC",
"database.history.kafka.bootstrap.servers" : "redpanda:9092",
"database.history.kafka.topic": "schema-changes.inventory"
}
}
Here you set the database password and database name, along with other information, such as bootstrap server and Kafka topic.
Let's apply these configurations using curl to call the exposed Debezium REST API:
curl -i -X POST -H "Accept:application/json" -H "Content-Type:application/json" http://localhost:8083/connectors/ -d @register-sqlserver.json
Check the topics in Redpanda after exposing Debezium REST API
Go back to the Redpanda container and check the topic list again:
rpk topic list
Now you should see more topics, including topics about the rooms
table:
NAME PARTITIONS REPLICAS
redpanda_connect_configs 1 1
redpanda_connect_offsets 25 1
redpanda_connect_statuses 5 1
schema-changes.inventory 1 1
server1 1 1
server1.dbo.rooms 1 1
Consume the events in the Redpanda topic
You can use the following command to check the changes in the room table in SQL server by consuming the topic server1.dbo.rooms
:
rpk topic consume server1.dbo.rooms
You should see the following output:
{
"topic": "server1.dbo.rooms",
"key": "{"schema":{"type":"struct","fields":[{"type":"int32","optional":false,"field":"id"}],"optional":false,"name":"server1.dbo.rooms.Key"},"payload":{"id":104}}",
"value": "{"schema":{"type":"struct","fields":[{"type":"struct","fields":[{"type":"int32","optional":false,"field":"id"},{"type":"string","optional":false,"field":"hotel_id"},{"type":"string","optional":false,"field":"name"},{"type":"string","optional":true,"field":"description"},{"type":"int32","optional":true,"field":"total_rooms"},{"type":"int32","optional":true,"field":"used_rooms"},{"type":"int32","optional":true,"field":"left_rooms"}],"optional":true,"name":"server1.dbo.rooms.Value","field":"before"},{"type":"struct","fields":[{"type":"int32","optional":false,"field":"id"},{"type":"string","optional":false,"field":"hotel_id"},{"type":"string","optional":false,"field":"name"},{"type":"string","optional":true,"field":"description"},{"type":"int32","optional":true,"field":"total_rooms"},{"type":"int32","optional":true,"field":"used_rooms"},{"type":"int32","optional":true,"field":"left_rooms"}],"optional":true,"name":"server1.dbo.rooms.Value","field":"after"},{"type":"struct","fields":[{"type":"string","optional":false,"field":"version"},{"type":"string","optional":false,"field":"connector"},{"type":"string","optional":false,"field":"name"},{"type":"int64","optional":false,"field":"ts_ms"},{"type":"string","optional":true,"name":"io.debezium.data.Enum","version":1,"parameters":{"allowed":"true,last,false,incremental"},"default":"false","field":"snapshot"},{"type":"string","optional":false,"field":"db"},{"type":"string","optional":true,"field":"sequence"},{"type":"string","optional":false,"field":"schema"},{"type":"string","optional":false,"field":"table"},{"type":"string","optional":true,"field":"change_lsn"},{"type":"string","optional":true,"field":"commit_lsn"},{"type":"int64","optional":true,"field":"event_serial_no"}],"optional":false,"name":"io.debezium.connector.sqlserver.Source","field":"source"},{"type":"string","optional":false,"field":"op"},{"type":"int64","optional":true,"field":"ts_ms"},{"type":"struct","fields":[{"type":"string","optional":false,"field":"id"},{"type":"int64","optional":false,"field":"total_order"},{"type":"int64","optional":false,"field":"data_collection_order"}],"optional":true,"field":"transaction"}],"optional":false,"name":"server1.dbo.rooms.Envelope"},"payload":{"before":null,"after":{"id":104,"hotel_id":"pullmanhn","name":"single room","description":"A room for single","total_rooms":6,"used_rooms":1,"left_rooms":5},"source":{"version":"1.9.5.Final","connector":"sqlserver","name":"server1","ts_ms":1660780662026,"snapshot":"last","db":"sqlCDC","sequence":null,"schema":"dbo","table":"rooms","change_lsn":null,"commit_lsn":"00000025:000009f8:0001","event_serial_no":null},"op":"r","ts_ms":1660780662026,"transaction":null}}",
"timestamp": 1660780662683,
"partition": 0,
"offset": 3
}
From the above captured data, you can see the changes in the room table, with information about before and after changes:
"payload":{"before":null,"after":{"id":104,"hotel_id":"pullmanhn","name":"single room","description":"A room for single","total_rooms":6,"used_rooms":1,"left_rooms":5}
Update a row to test that CDC is working
Let's try updating a row in the table rooms
to check whether you can see the changes from the Redpanda container.
Go to the SQL Server container:
docker exec -it redpanda-debezium-code_sqlserver_1 /bin/bash
Then run the following command to log in to SQL Server inside the container:
/opt/mssql-tools/bin/sqlcmd -S localhost -U SA -P 'Password!'
Use the following code to update the room table:
USE sqlCDC;
GO
UPDATE rooms set used_rooms=2, left_rooms=4 where id=104;
GO
Return to the Redpanda container, and you'll see that the change has been captured:
{
"topic": "server1.dbo.rooms",
"key": "{"schema":{"type":"struct","fields":[{"type":"int32","optional":false,"field":"id"}],"optional":false,"name":"server1.dbo.rooms.Key"},"payload":{"id":104}}",
"value": "{"schema":{"type":"struct","fields":[{"type":"struct","fields":[{"type":"int32","optional":false,"field":"id"},{"type":"string","optional":false,"field":"hotel_id"},{"type":"string","optional":false,"field":"name"},{"type":"string","optional":true,"field":"description"},{"type":"int32","optional":true,"field":"total_rooms"},{"type":"int32","optional":true,"field":"used_rooms"},{"type":"int32","optional":true,"field":"left_rooms"}],"optional":true,"name":"server1.dbo.rooms.Value","field":"before"},{"type":"struct","fields":[{"type":"int32","optional":false,"field":"id"},{"type":"string","optional":false,"field":"hotel_id"},{"type":"string","optional":false,"field":"name"},{"type":"string","optional":true,"field":"description"},{"type":"int32","optional":true,"field":"total_rooms"},{"type":"int32","optional":true,"field":"used_rooms"},{"type":"int32","optional":true,"field":"left_rooms"}],"optional":true,"name":"server1.dbo.rooms.Value","field":"after"},{"type":"struct","fields":[{"type":"string","optional":false,"field":"version"},{"type":"string","optional":false,"field":"connector"},{"type":"string","optional":false,"field":"name"},{"type":"int64","optional":false,"field":"ts_ms"},{"type":"string","optional":true,"name":"io.debezium.data.Enum","version":1,"parameters":{"allowed":"true,last,false,incremental"},"default":"false","field":"snapshot"},{"type":"string","optional":false,"field":"db"},{"type":"string","optional":true,"field":"sequence"},{"type":"string","optional":false,"field":"schema"},{"type":"string","optional":false,"field":"table"},{"type":"string","optional":true,"field":"change_lsn"},{"type":"string","optional":true,"field":"commit_lsn"},{"type":"int64","optional":true,"field":"event_serial_no"}],"optional":false,"name":"io.debezium.connector.sqlserver.Source","field":"source"},{"type":"string","optional":false,"field":"op"},{"type":"int64","optional":true,"field":"ts_ms"},{"type":"struct","fields":[{"type":"string","optional":false,"field":"id"},{"type":"int64","optional":false,"field":"total_order"},{"type":"int64","optional":false,"field":"data_collection_order"}],"optional":true,"field":"transaction"}],"optional":false,"name":"server1.dbo.rooms.Envelope"},"payload":{"before":{"id":104,"hotel_id":"pullmanhn","name":"single room","description":"A room for single","total_rooms":6,"used_rooms":1,"left_rooms":5},"after":{"id":104,"hotel_id":"pullmanhn","name":"single room","description":"A room for single","total_rooms":6,"used_rooms":2,"left_rooms":4},"source":{"version":"1.9.5.Final","connector":"sqlserver","name":"server1","ts_ms":1660781847687,"snapshot":"false","db":"sqlCDC","sequence":null,"schema":"dbo","table":"rooms","change_lsn":"00000026:00000400:0002","commit_lsn":"00000026:00000400:0003","event_serial_no":2},"op":"u","ts_ms":1660781850390,"transaction":null}}",
"timestamp": 1660781850561,
"partition": 0,
"offset": 4
}
You can see that there is a before and after value for room
with id 104
:
"payload":{"before":{"id":104,"hotel_id":"pullmanhn","name":"single room","description":"A room for single","total_rooms":6,"used_rooms":1,"left_rooms":5},"after":{"id":104,"hotel_id":"pullmanhn","name":"single room","description":"A room for single","total_rooms":6,"used_rooms":2,"left_rooms":4}
And that's how changes in SQL Server are captured in real time using Debezium and Redpanda.
Conclusion
By applying CDC, data engineers can continuously monitor and visualize data streams in real time so that they can provide insights to business owners about how customers are interacting with the product.
Moreover, for backend engineers or software engineers in general, CDC also helps to efficiently apply the event-driven architecture. With every data change, software engineers can implement a mechanism to send a new event to the receiver service so that this service can execute an appropriate action according to which data has been changed.
As a reminder, you can access the code used in this tutorial anytime in this GitHub repo.
Now that you know how to integrate SQL Server with Debezium and Redpanda to capture data changes and store them, you might be interested in learning how to integrate other storage solutions. Redpanda supports sending data from many different storage options, including PostgreSQL, Snowflake, and Amazon S3 and Google Cloud.
Take Redpanda for a test drive here. Check out our documentation to understand the nuts and bolts of how the platform works, or read more of our blogs to see the plethora of ways to integrate with Redpanda. To ask our Solution Architects and Core Engineers questions and interact with other Redpanda users, join the Redpanda Community on Slack.
Let's keep in touch
Subscribe and never miss another blog post, announcement, or community event. We hate spam and will never sell your contact information.