THE CHALLENGE: With the adoption of an event-driven architecture. Our customer, a federal agency within the Department of Homeland Security, have achieved a desirable temporal and location decoupling of its services. Yet the data structures exchanged between producers and consumers leave a form of coupling that, while unavoidable, requires careful management. Inevitably requirements will change and the implicit contract that binds asynchronous services will need to evolve accordingly. It’s extremely important that these contracts are explicitly formalized such that that this coupling does not hinder the independent release of the services bound by them. While the agency have adopted JSON Schema for defining and publishing these contracts, it is predominately limited to exchanges to and from the Intake domain. More importantly, there is no formal and strictly enforced process to insure that any changes to these schemas do not result in breaking changes that would impact downstream consumers. The currently embedded point-to-point mindset at the agency gives false comfort in the informal coordination of these changes and belies the fact that the such an architecture implies consumers are unknown. As the number of Kafka topics continues to grow, the event streams they represent will find an increasing number of consumers that will want to leverage these rich data sources in any number of ways. Without a deliberate approach to schema management, an intractable situation can arise making it exceedingly difficult to navigate change in the data ecosystem.

OUR APPROACH:

Schemas are invaluable for describing a data structure in a language and platform-neutral manner and there are a number of formats available including XML, JSON, Avro, Thrift, and Protobuf. The advantages and disadvantages of each usually boil down to robustness of expression, support for reuse, on-wire compactness, serialization efficiencies, simplicity, readability etc. Generally the agency have standardized on JSON Schema for form data and this a good choice for developing common data models as reuse is well supported via references. In a few cases where performance is the primary concern Avro would be a better choice given its compact on-wire format and much faster serialization. Whatever the format, defining and publishing schemas to begin with is the first step in managing their change over time. At present, with few exceptions, schemas are only being leveraged for messaging between the intake and case management domains. Our recommendations is to widen their use such that schemas are formally published for all event streams across all Kafka topics.
Governance and standardization is important in the development of common data models to insure uniformity of data structures in schemas being defined across the enterprise. To that aim, the data working group and the CDO have worked closely with domain stakeholders to bring this this to bear on the form schemas being developed in support of draft processing, ingest, and adjudication of immigrant benefits. While form data represents a large portion of the data being published across event streams there is a great deal of non-form data that could benefit from this same rigor. We are beginning to see the appearance of other schema repositories, and we recommend centralizing schemas into a central repository for visibility, collaboration, and governance. At the very least, the maintainers of these new repositories should engage with data SME’s early and include them on all PR’s as reviewers.
As they say, the only certainty is change and without question our schemas will need to change as business requirements evolve. In the face of this inevitability we need to be sure that downstream consumers are unaffected as we publish events that conform to a newer version of our schema. A key advantage in the adoption of a micro-service architecture is the ability to deploy services independently. A lock step release of services to accommodate a change in their contract would be an unacceptable step back requiring downtime. This means that we must insure that the changes we propose to our contracts do not introduce “breaking” changes. We should evolve our schemas in a compatible manner such that consumers are able to process events written with both the old and new schemas. This will decouple consumers from producers allowing consumers to avail themselves of the business benefit of the latest contract when they are ready to do so. As the number of event streams and their consumers proliferates at the agency it will become increasingly imperative to adopt a strategy for schema evolution.
The agency have invested considerably in the Confluent Platform and it offers us a Schema Registry a robust toolset for schema management that enables us to evolve our schemas safely over time.
Schema Registry provides us with a Rest API for managing and querying our schemas. It allows us to assign a Compatability Type to our schema that dictates how Schema Registry compares previous versions of a schema with a new schema. It will apply the relevant rules to determine if the changes you are making to the schema will guarantee the type of compatibility you are looking for.
Schema Registry provides SerDes components that enable your producers and consumers to integrate with schema registry at run-time. Schema validation, if enabled, will prevent producers from polluting topics with topics non-compliant data as well as allow consumers to validate records against a schema at the framework level without having to call back on an internally hosted service as is the case today at the agency. Finally, the agency’s kafka-commons library now provides Crypto-enabled versions of these components allowing local encryption/decryption in conjunction with schemas.
While further discussion of the Confluent Schema Registry is beyond the scope of this discussion we believe adopting a schema registry will help our customer tackle organizational challenges around data management especially as it moves further into the realm of stream processing. Different projects need to collaborate on the same data and teams need access to the metadata that describes the structure and data types of the topics they are interested in. Reference examples detailing how we can use Schema Registry at the agency can be found in the Kafka-Patterns GHE repository. It is a valuable tool that is available now to teams at the agency that has yet to be taken advantage of. We believe we can be of great help in seeing to its successful adoption.

Schema Management & Data Integration Strategy: an important for collaboration and communication around the different projects that bind your producers and consumers