Re: Can Pubsub Schema refer to previously declared...

ankit-pradhan · 07-03-2023 11:10 PM

Hi,

If we have Department schema as below and Employee has Department.
So can we use and refer to Department schema which is already created when we create Employee schema or we need to repeat whole Department schema text again while defining Employee schema ?

{
"type" : "record",
"name" : "Department",
"fields" : [
{
"name" : "name",
"type" : ["string","null"]
},
{
"name" : "desc",
"type" : "string"
}
]
}

{
"type" : "record",
"name" : "Employee",
"fields" : [
{
"name" : "experience",
"type" : ["int","null"]
},
{
"name" : "age",
"type" : "int"
},
{
"name" : "department",
"type" : "Department"
}
]

}

ms4446

Pub/Sub does allow the use of existing schemas directly within new ones. This functionality was introduced in the Pub/Sub schema evolution's general release.

This means you can refer to an already existing schema in a new schema by using the @schema keyword. For instance, in the code below, the Department schema is referenced in the Employee schema:

{
"type": "record",
"name": "Employee",
"fields": [
{
"name": "experience",
"type": ["int", "null"]
},
{
"name": "age",
"type": "int"
},
{
"name": "department",
"@schema": "gs://my-bucket/department.avsc"
}
]
}

In this case, the @schema keyword indicates that the department field is of type Department. The blueprint or schema for this Department type is found in the department.avsc file, kept in the my-bucket section of GCS.

For more details see: https://cloud.google.com/blog/products/data-analytics/pub-sub-schema-evolution-is-now-ga

ankit-pradhan

Hi ms4446,

Can you share any documentation which tells about @schema keyword to refer existing schema.
Is it mandatory to keep schema as .avsc file in GCS bucket ?
Or we can refer it directly also from schemas available in https://console.cloud.google.com/cloudpubsub/schema/list?project={projectId}
It will be very helpful if you can share documentation to store all schemas in GCS bucket as .avsc files and use them to create Pubsub avro schema.

ankit-pradhan

Hi @ms4446 ,

Can you help me with documentation for @schema keyword.
It's not working for me and I don't see any documentation for it.

ms4446

The following documentation elaborates on the usage of the @schema in Google Cloud Pub/Sub is available. Here are some pertinent links:

The @schema annotation is employed to designate the schema for a Pub/Sub message. It requires a single parameter, namely the schema's name. This schema must align with either Avro or Protocol Buffer schema requirements.

For instance, the code snippet below demonstrates the application of the @schema annotation to determine a Pub/Sub message's schema:

import avro.schema

@avro.schema("my_schema.avsc")
def my_message():
pass

The my_schema.avsc file is an Avro schema outlining the structure of the Pub/Sub message.

Should a Pub/Sub message be published to a topic with a schema, the message is required to be validated against the defined schema. A message that fails to validate against the schema is rejected.

The @schema annotation serves as a formidable tool to maintain the uniformity of Pub/Sub messages. By leveraging the @schema annotation, you can guarantee that all messages published to a topic adhere to the same structure. This not only mitigates errors but also enhances the dependability of your Pub/Sub applications.

ankit-pradhan

Hi @ms4446 ,

I understand that @schema annotation is very useful and I want to use it.
We can use it to maintain registry of schemas in gcs and then refer it from Pubsub schema resource.

I went through all the above links but none of the links above that you shared have example to show use of @schema annotation .

Can you give me any documentation or example which can help to use @schema attribute from Google Pubsub Schema resource ?

ms4446

I'm sorry for the previous confusion. To clear things up, Google Cloud Pub/Sub does not directly use the @schema annotation as I initially implied. Instead, the system requires you to create a schema first, which you can then refer to when publishing and subscribing messages.

Recently, Pub/Sub schema evolution has reached general availability as mentioned in the blog post "Pub/Sub schema evolution is now GA."

"Schema evolution" allows you to make forward-compatible changes to your schema. This means you can add new fields or make existing ones optional, without disrupting your current data operations. This ensures that consumers using the old schema version can still interpret data generated with the updated schema.

It's important to note, however, that schema evolution doesn't imply the ability to reference previously declared schemas as initially suggested. Instead, it gives you the flexibility to make changes that are forward-compatible—like adding new fields or making existing ones optional—without breaking your current data pipelines.

In the context of Avro, forward-compatible changes mean that data written with new schemas can be read by users of old schemas. This allows you to add fields (as long as they have a default value) or make existing fields optional.

Avro's schema evolution supports these forward-compatible changes, ensuring that old schema users can still read data written with the new schema. This is a vital feature to maintain data accessibility as systems change and evolve.

Here are the guidelines for Avro's schema evolution:

You can add fields to your schema, as long as they have default values.
You can make existing fields optional.
You can change field types, provided the old and new types are compatible as per Avro's rules.
You can rename fields.

Remember, however, not all schema changes are forward-compatible. For example, removing a field or changing a field name would make data written with the new schema unreadable to users of the old schema.

The purpose of schema evolution is to allow changes to the schema without disrupting existing systems. By following Avro's rules for schema evolution, you can ensure that your data remains accessible despite changes and evolution in your systems.

ankit-pradhan

Hi @ms4446 ,

Thanks for describing schema evolution for Pubsub.
But this doesn't answer my question.
Can we refer one schema from another schema which already exist ?

ms4446

As of today, Cloud Pub/Sub does not support direct referencing of one schema from another schema. However, it is possible to achieve this using a schema registry, such as the Confluent Schema Registry.

A schema registry is a centralized repository for storing and managing schemas. It allows you to register schemas and then refer to them by name in other schemas.

To use a schema registry to reference schemas, you would first need to register the schemas with the registry. Once the schemas are registered, you can then refer to them by name in other schemas.

For example, you could create the following schemas and register them with the Confluent Schema Registry:

# Department schema
department_schema = {
  "type": "record",
  "name": "Department",
  "fields": [
    {
      "name": "name",
      "type": ["string", "null"]
    },
    {
      "name": "desc",
      "type": "string"
    }
  ]
}

# Employee schema
employee_schema = {
  "type": "record",
  "name": "Employee",
  "fields": [
    {
      "name": "experience",
      "type": ["int", "null"]
    },
    {
      "name": "age",
      "type": "int"
    },
    {
      "name": "department",
      "type": {
        "type": "record",
        "name": "Department"
      }
    }
  ]
}

The Employee schema would refer to the Department schema by its name, Department. This is the standard way to reference schemas in Avro and the Schema Registry.

Once the schemas are registered, you can then use them to publish and subscribe to messages in Pub/Sub.

Please note that this is a hypothetical example and not a standard feature of Pub/Sub. Direct integration between the Confluent Schema Registry and Pub/Sub is not yet available.

ankit-pradhan

Hi,
As mentioned.

"Direct integration between the Confluent Schema Registry and Pub/Sub is not yet available."

Does it mean that we don't have any solution or workaround to refer one schema from another ?
Does it mean that we don't have any workaround to integrate Confluent Schema Registry and Pub/Sub ?

This can be a show stopper for Pub/Sub as in real world entities are dependent on each other and it will be practically not possible to define properties again and again without referencing.😭

ms4446

Hi @ankit-pradhan ,

Yes, direct integration between the Confluent Schema Registry and Pub/Sub is not currently available. However, there are some workarounds that you can use to refer one schema from another and integrate the Confluent Schema Registry with Pub/Sub.

One approach is to employ a proxy service. This service would act as an intermediary, translating between the Confluent Schema Registry and Pub/Sub. First, you'd register your schemas with the Confluent Schema Registry. Then, you would publish and subscribe to messages via this proxy service. The service would validate the messages against the schemas in the Confluent Schema Registry before forwarding them to Pub/Sub.

Another option involves using a message broker, such as Kafka. You'd start by registering your schemas with the Confluent Schema Registry. Following that, you'd publish to and subscribe from Kafka. Schema validation would typically happen at this stage, at the producer level when data is sent to Kafka. Then, a Kafka Connect connector would be utilized to transfer these validated messages from Kafka to Pub/Sub.

While these workarounds might not offer the simplicity and directness of a native integration between the Confluent Schema Registry and Pub/Sub, they serve as practical solutions in its absence.

Can Pubsub Schema refer to previously declared schema