Mastering Avro: Defining Schemas with Object Lists
- Published on
Mastering Avro: Defining Schemas with Object Lists
Apache Avro is a powerful data serialization system that enables data exchange between applications in a highly efficient manner. It’s widely used in the Hadoop ecosystem but is versatile enough for various data streaming scenarios. In this blog post, we will delve into defining schemas for objects in Avro, focusing on object lists—a common scenario in data processing.
What is Avro?
Avro is a row-oriented remote procedure call and data serialization framework developed within the Apache Hadoop project. The primary advantages of Avro are:
- Compact: Stored data is binary and thus more space-efficient.
- Schema Evolution: You can modify schemas without breaking data compatibility.
- Language Agnostic: You can create data in one programming language and read it in another.
For a deeper dive into Apache Avro, check the official documentation.
Understanding Avro Schemas
At the heart of Avro lies its schema definition. Avro schemas are defined using JSON and describe the structure of the data, including its types, names, and the relationships between different data structures.
Schema Basics
An Avro schema consists of the following key components:
- Types: The basic data types supported by Avro include
null
,boolean
,int
,long
,float
,double
,bytes
, andstring
. - Records: This is the fundamental building block, where you define complex types.
- Enums: For a finite set of named values.
- Arrays: To represent a list of items.
- Maps: A collection of key-value pairs.
Defining an Avro Schema
Let’s create a schema for a simple employee record where each employee might have a list of skills represented as a string array. Here's how to define that schema:
{
"type": "record",
"name": "Employee",
"fields": [
{"name": "id", "type": "int"},
{"name": "name", "type": "string"},
{"name": "skills", "type": {"type": "array", "items": "string"}}
]
}
Schema Breakdown
- type: We define it as a record because an employee has multiple fields.
- fields: Each field has a name and its corresponding Avro type.
- The
skills
field is defined as an array of strings.
- The
This simple structure is flexible and future-proof, accommodating potential changes without compromising data integrity.
Creating an Object List in Avro
Avro is particularly strong in representing hierarchical data through nested records and arrays.
Nested Object Lists
Suppose we want to extend our schema to include the concept of departments, where an employee belongs to a specific department, and each department can have multiple employees. This requires an additional layer with a department schema.
Here’s how we could structure this:
{
"type": "record",
"name": "Department",
"fields": [
{"name": "deptId", "type": "int"},
{"name": "deptName", "type": "string"},
{"name": "employees", "type": {"type": "array", "items": "Employee"}}
]
}
Complete Employee Schema
Combining both schemas, we have:
{
"type": "record",
"name": "Employee",
"fields": [
{"name": "id", "type": "int"},
{"name": "name", "type": "string"},
{"name": "skills", "type": {"type": "array", "items": "string"}}
]
}
{
"type": "record",
"name": "Department",
"fields": [
{"name": "deptId", "type": "int"},
{"name": "deptName", "type": "string"},
{"name": "employees", "type": {"type": "array", "items": "Employee"}}
]
}
Explanation of the Hierarchical Model
- Each
Department
contains an array ofEmployee
objects. - Both
Employee
andDepartment
can be extended with additional fields as the application requirements evolve.
Implementing in Java
To leverage Avro in Java, you must include the Avro library in your Maven or Gradle project.
Maven Dependency
<dependency>
<groupId>org.apache.avro</groupId>
<artifactId>avro</artifactId>
<version>1.11.0</version> <!-- Use the latest stable version -->
</dependency>
Generating Java Classes from Schema
Avro provides a tool to generate Java classes from your schema files. You might run a command like this:
java -jar avro-tools-1.11.0.jar compile schema employee.avsc .
This will generate Java classes corresponding to the defined schemas.
Example Code: Writing Avro Data
Now let's write some Avro data using our defined schemas:
import org.apache.avro.Schema;
import org.apache.avro.generic.GenericData;
import org.apache.avro.io.Encoder;
import org.apache.avro.io.EncoderFactory;
import org.apache.avro.specific.SpecificDatumWriter;
import org.apache.avro.specific.SpecificRecordBase;
import org.apache.avro.file.DataFileWriter;
import java.io.File;
import java.io.IOException;
public class AvroExample {
public static void main(String[] args) {
// First define the Schema
String employeeSchemaString = "{...}"; // Add your Employee schema JSON here
String departmentSchemaString = "{...}"; // Add your Department schema JSON here
Schema employeeSchema = new Schema.Parser().parse(employeeSchemaString);
Schema departmentSchema = new Schema.Parser().parse(departmentSchemaString);
// Create employee
GenericData.Record employee1 = new GenericData.Record(employeeSchema);
employee1.put("id", 1);
employee1.put("name", "John Doe");
employee1.put("skills", Arrays.asList("Java", "Spring", "Hibernate"));
// Create department
GenericData.Record department = new GenericData.Record(departmentSchema);
department.put("deptId", 101);
department.put("deptName", "Software Engineering");
department.put("employees", Arrays.asList(employee1));
// Write to file
try {
DataFileWriter<GenericData.Record> dataFileWriter = new DataFileWriter<>(new SpecificDatumWriter<>());
dataFileWriter.create(departmentSchema, new File("departments.avro"));
dataFileWriter.append(department);
dataFileWriter.close();
} catch (IOException e) {
e.printStackTrace();
}
}
}
Code Commentary
- Schema Parsing: We parse the Avro schema from JSON strings.
- Record Creation: We create
GenericData.Record
instances for both Employee and Department. - Data Writing: Finally, we use
DataFileWriter
to write the records to an Avro file.
Reading Avro Data
Reading Avro data is equally straightforward. Using the following snippet, you can read the data back:
import org.apache.avro.file.DataFileReader;
import org.apache.avro.specific.SpecificDatumReader;
import java.io.File;
import java.io.IOException;
public class AvroReadExample {
public static void main(String[] args) {
File file = new File("departments.avro");
SpecificDatumReader<GenericData.Record> datumReader = new SpecificDatumReader<>();
try (DataFileReader<GenericData.Record> dataFileReader = new DataFileReader<>(file, datumReader)) {
while (dataFileReader.hasNext()) {
GenericData.Record department = dataFileReader.next();
System.out.println("Department: " + department);
}
} catch (IOException e) {
e.printStackTrace();
}
}
}
Bringing It All Together
Avro provides an excellent way to define complex schemas, including lists of objects, which is invaluable in real-world applications involving nested data structures. By understanding how to set up Avro schemas and use them in Java, you can work with data more effectively.
For further reading, explore the Avro Maven Plugin and learn how to integrate Avro with more complicated systems, such as Apache Kafka for stream processing.
By mastering Avro and its schema definitions, you can unlock the potential of efficient, structured data interchange, a core part of modern data architecture. Happy coding!