14. May 2024 By Christian Del Monte
Keep data changes under control with Change Data Capture
In a distributed software system, data changes always pose a challenge. How is it possible to track the change history of data located in one part of the system in order to synchronise connected data stores in other subsystems? Change Data Capture (CDC) offers an answer to this question.
CDC is a technique that captures all data changes in a database, collects them and prepares them for transfer and replication to other systems, either as a batch process or as a stream. An illustrative example is a system with a central database and an associated search index. CDC ensures synchronisation between the two by capturing and logging changes and applying them to the search index in the same order so that the data is synchronised at the end of the process.
CDC and Informatica
CDC has become increasingly important over the last ten years, which is why many software manufacturers have developed special CDC solutions. With products such as PowerExchange and PowerCenter, Informatica offers a platform for the capture, transformation and integration of data. PowerExchange acts as a capture agent, identifying and capturing data changes through techniques such as reading database transaction logs to minimise system performance. PowerCenter processes this data through transformations and mappings before forwarding it to the target systems. These systems can optionally be integrated with streaming platforms such as Apache Kafka. An example of this would be an insurance company using Informatica PowerExchange and PowerCenter to capture data changes in a PostgreSQL database and send them to Kafka.
- 1. Data capture with PowerExchange: PowerExchange connects to the PostgreSQL database and monitors transaction logs to capture changes such as new policies and payments in real time.
- 2. dData processing in PowerCenter: Captured changes are transmitted to PowerCenter where they are validated, normalised and enriched with further information. PowerCenter also implements specific business rules such as contract status updates.
- 3. Publishing to Kafka: The transformed data is sent as structured messages to Kafka, which serves as the central platform for distributing this data to various internal consumers.
- 4. Data consumption: Various systems subscribe to relevant Kafka topics to consume the data and update their systems. For example, the claims management system processes new claims, while analytics dashboards update performance metrics in real time.
This approach enables the insurance company to maintain efficient and accurate synchronisation between systems, reduce the time between data collection and availability for analysis, and improve responsiveness and operational efficiency by accessing timely and accurate data in real time. In addition, Informatica's processing and integration capabilities improve data quality and compliance with corporate policies prior to distribution.
CDC with Kafka
Informatica's CDC solution uses proprietary software for a centralised approach, with PowerCenter centrally managing the transformations and data flows. In contrast, Apache Kafka is suitable for a decentralised, open source-based solution. With the help of connectors such as Debezium, Kafka also supports CDC: Debezium records changes from the transaction logs of the source databases and forwards them to Kafka. There, the data is normalised and enriched using Kafka Streams and KSQL, which enables scalable real-time processing. The processed data is then distributed to relevant topics and utilised by the subscribed systems.
The above example of an insurance company can be adapted as follows for a Kafka-based solution to effectively monitor, transform and distribute data changes. The system uses Apache Kafka, Debezium for Kafka Connect and Kafka Streams:
- 1. Data capture with Debezium and Kafka Connect: Debezium is connected to the PostgreSQL database to capture changes in the transaction logs in real time and deliver them to Kafka.
- 2. Data processing with Kafka Streams: The captured changes are processed in real time, including format validation, normalisation and enrichment of the data. Kafka Streams can implement specific business rules, such as updating the contract status.
- 3. Dispatch in Kafka Topics: Transformed data is sent and distributed as structured messages in Kafka Topics.
- 4. Data consumption: Various systems subscribe to these topics and update their databases and applications, for example the claims management system, which assesses and resolves claims faster, and analytics dashboards, which update performance metrics in real time.
Comparing the CDC solutions Informatica and Kafka
When comparing the two CDC solutions, some differences become clear:
- Source system impact: both use a log-based approach to minimise impact, but their integration with the source system and overhead can vary depending on the specific implementation and database.
- Data processing: Informatica offers more comprehensive data transformation capabilities geared towards complex business logic. In contrast, Kafka is optimised for real-time event stream processing with simpler transformations.
- Architecture and scalability: Kafka is distributed from the ground up and designed to process large amounts of data in real time. Informatica uses a more centralised architecture suitable for traditional and complex data integration scenarios.
- Schema Management: Kafka's Schema Registry provides a robust system for managing schema compatibility in data streams. Informatica manages the schema by defining mappings in PowerCenter, with a more traditional and design-oriented approach.
The decision between Informatica and Kafka depends on the specific technical requirements, the complexity of the transformations, the existing architecture and the preferences for system management. Kafka offers clear advantages in terms of scalability and resilience with its decentralised, horizontally scalable architecture that can efficiently process large amounts of data. However, Kafka Streams and KSQL are primarily designed for processing event streams and are less suitable for complex business logic or batch transformations. In such cases, Informatica is probably the better choice.
Best Practices
Although CDC is useful, critical aspects such as performance management in high-availability environments and data discrepancy management need to be considered, which require specific measures.
Regarding the first aspect, efficient management of network and processing resources is critical in systems with CDC. It is important to design the infrastructure in such a way that load peaks in data acquisition and transmission can be handled without affecting the performance of the source system:
- Techniques such as data partitioning and efficient indexing can significantly reduce overloads.
- In addition, the use of load balancing techniques to evenly distribute data traffic and database queries can prevent bottlenecks. This is particularly important when data is collected and sent to multiple users simultaneously.
- Finally, in high-availability environments, data replication can help reduce the risk of data loss and improve read performance.
- Implementing synchronous or asynchronous replication, depending on consistency and latency requirements, can help maintain data integrity throughout the system.
Using transactions that group CDC operations into logical units of work that can either be fully executed or cancelled in the event of an error, as well as performing regular data integrity testing and validation to ensure that the transferred data is correct and complete, is critical to managing data discrepancies. In addition, the implementation of error detection algorithms to identify deviations in the transmitted data, such as checksums or hashes, is crucial. As soon as errors are recognised, automatic or semi-automatic correction mechanisms can be activated to restore the correctness of the data. Maintaining detailed CDC logs and continuously monitoring data integrity using telemetry tools and dashboards can also provide immediate alerts of anomalies and help to quickly identify and resolve potential discrepancies.
These measures not only increase the resilience and efficiency of CDC systems, but also ensure the accuracy and reliability of data, minimising the risk of incorrect business decisions based on inaccurate data.
Conclusions
Change Data Capture is an efficient method for automatically synchronising data changes in distributed software systems. By monitoring, capturing and replicating changes, CDC promotes data consistency across different subsystems. Well thought-out planning and implementation of CDC can help to improve data consistency, increase operational efficiency and thus improve responsiveness to customer requirements and market changes.
Would you like to find out more about exciting topics from the world of adesso? Then take a look at our previous blog posts.