Mastering Avro: Defining Schemas with Object Lists

Apache Avro is a powerful data serialization system that enables data exchange between applications in a highly efficient manner. It’s widely used in the Hadoop ecosystem but is versatile enough for various data streaming scenarios. In this blog post, we will delve into defining schemas for objects in Avro, focusing on object lists—a common scenario in data processing.

What is Avro?

Avro is a row-oriented remote procedure call and data serialization framework developed within the Apache Hadoop project. The primary advantages of Avro are:

Compact: Stored data is binary and thus more space-efficient.
Schema Evolution: You can modify schemas without breaking data compatibility.
Language Agnostic: You can create data in one programming language and read it in another.

For a deeper dive into Apache Avro, check the official documentation.

Understanding Avro Schemas

At the heart of Avro lies its schema definition. Avro schemas are defined using JSON and describe the structure of the data, including its types, names, and the relationships between different data structures.

Schema Basics

An Avro schema consists of the following key components:

Types: The basic data types supported by Avro include null, boolean, int, long, float, double, bytes, and string.
Records: This is the fundamental building block, where you define complex types.
Enums: For a finite set of named values.
Arrays: To represent a list of items.
Maps: A collection of key-value pairs.

Defining an Avro Schema

Let’s create a schema for a simple employee record where each employee might have a list of skills represented as a string array. Here's how to define that schema:

{
  "type": "record",
  "name": "Employee",
  "fields": [
    {"name": "id", "type": "int"},
    {"name": "name", "type": "string"},
    {"name": "skills", "type": {"type": "array", "items": "string"}}
  ]
}

Schema Breakdown

type: We define it as a record because an employee has multiple fields.
fields: Each field has a name and its corresponding Avro type.
- The skills field is defined as an array of strings.

This simple structure is flexible and future-proof, accommodating potential changes without compromising data integrity.

Creating an Object List in Avro

Avro is particularly strong in representing hierarchical data through nested records and arrays.

Nested Object Lists

Suppose we want to extend our schema to include the concept of departments, where an employee belongs to a specific department, and each department can have multiple employees. This requires an additional layer with a department schema.

Here’s how we could structure this:

{
  "type": "record",
  "name": "Department",
  "fields": [
    {"name": "deptId", "type": "int"},
    {"name": "deptName", "type": "string"},
    {"name": "employees", "type": {"type": "array", "items": "Employee"}}
  ]
}

Complete Employee Schema

Combining both schemas, we have:

{
  "type": "record",
  "name": "Employee",
  "fields": [
    {"name": "id", "type": "int"},
    {"name": "name", "type": "string"},
    {"name": "skills", "type": {"type": "array", "items": "string"}}
  ]
}

{
  "type": "record",
  "name": "Department",
  "fields": [
    {"name": "deptId", "type": "int"},
    {"name": "deptName", "type": "string"},
    {"name": "employees", "type": {"type": "array", "items": "Employee"}}
  ]
}

Explanation of the Hierarchical Model

Each Department contains an array of Employee objects.
Both Employee and Department can be extended with additional fields as the application requirements evolve.

Implementing in Java

To leverage Avro in Java, you must include the Avro library in your Maven or Gradle project.

Maven Dependency

<dependency>
    <groupId>org.apache.avro</groupId>
    <artifactId>avro</artifactId>
    <version>1.11.0</version> <!-- Use the latest stable version -->
</dependency>

Generating Java Classes from Schema

Avro provides a tool to generate Java classes from your schema files. You might run a command like this:

java -jar avro-tools-1.11.0.jar compile schema employee.avsc .

This will generate Java classes corresponding to the defined schemas.

Example Code: Writing Avro Data

Now let's write some Avro data using our defined schemas:

import org.apache.avro.Schema;
import org.apache.avro.generic.GenericData;
import org.apache.avro.io.Encoder;
import org.apache.avro.io.EncoderFactory;
import org.apache.avro.specific.SpecificDatumWriter;
import org.apache.avro.specific.SpecificRecordBase;
import org.apache.avro.file.DataFileWriter;

import java.io.File;
import java.io.IOException;

public class AvroExample {
    public static void main(String[] args) {
        // First define the Schema
        String employeeSchemaString = "{...}"; // Add your Employee schema JSON here
        String departmentSchemaString = "{...}"; // Add your Department schema JSON here
        
        Schema employeeSchema = new Schema.Parser().parse(employeeSchemaString);
        Schema departmentSchema = new Schema.Parser().parse(departmentSchemaString);

        // Create employee
        GenericData.Record employee1 = new GenericData.Record(employeeSchema);
        employee1.put("id", 1);
        employee1.put("name", "John Doe");
        employee1.put("skills", Arrays.asList("Java", "Spring", "Hibernate"));

        // Create department
        GenericData.Record department = new GenericData.Record(departmentSchema);
        department.put("deptId", 101);
        department.put("deptName", "Software Engineering");
        department.put("employees", Arrays.asList(employee1));

        // Write to file
        try {
            DataFileWriter<GenericData.Record> dataFileWriter = new DataFileWriter<>(new SpecificDatumWriter<>());
            dataFileWriter.create(departmentSchema, new File("departments.avro"));
            dataFileWriter.append(department);
            dataFileWriter.close();
        } catch (IOException e) {
            e.printStackTrace();
        }
    }
}

Code Commentary

Schema Parsing: We parse the Avro schema from JSON strings.
Record Creation: We create GenericData.Record instances for both Employee and Department.
Data Writing: Finally, we use DataFileWriter to write the records to an Avro file.

Reading Avro Data

Reading Avro data is equally straightforward. Using the following snippet, you can read the data back:

import org.apache.avro.file.DataFileReader;
import org.apache.avro.specific.SpecificDatumReader;

import java.io.File;
import java.io.IOException;

public class AvroReadExample {
    public static void main(String[] args) {
        File file = new File("departments.avro");
        SpecificDatumReader<GenericData.Record> datumReader = new SpecificDatumReader<>();
        
        try (DataFileReader<GenericData.Record> dataFileReader = new DataFileReader<>(file, datumReader)) {
            while (dataFileReader.hasNext()) {
                GenericData.Record department = dataFileReader.next();
                System.out.println("Department: " + department);
            }
        } catch (IOException e) {
            e.printStackTrace();
        }
    }
}

Bringing It All Together

Avro provides an excellent way to define complex schemas, including lists of objects, which is invaluable in real-world applications involving nested data structures. By understanding how to set up Avro schemas and use them in Java, you can work with data more effectively.

For further reading, explore the Avro Maven Plugin and learn how to integrate Avro with more complicated systems, such as Apache Kafka for stream processing.

By mastering Avro and its schema definitions, you can unlock the potential of efficient, structured data interchange, a core part of modern data architecture. Happy coding!