Batch data processing is too slow for real-time AI: How open-source Apache Airflow 3.0 solves the challenge with event-driven data orchestration

Moving data from diverse sources to the right location for AI use is a challenging task. That’s where data orchestration technologies like Apache Airflow fit in.

Today, the Apache Airflow community is out with its biggest update in years, with the debut of the 3.0 release. The new release marks the first major version update in four years. Airflow has been active, though, steadily incrementing on the 2.x series, including the 2.9 and 2.10 updates in 2024, which both had a heavy focus on AI.

In recent years, data engineers have adopted Apache Airflow as their de facto standard tool. Apache Airflow has established itself as the leading open-source workflow orchestration platform with over 3,000 contributors and widespread adoption across Fortune 500 companies. There are also multiple commercial services based on the platform, including Astronomer Astro, Google Cloud Composer, Amazon Managed Workflows for Apache Airflow (MWAA) and Microsoft Azure Data Factory Managed Airflow, among others.

As organizations struggle to coordinate data workflows across disparate systems, clouds and increasingly AI workloads, organizations have growing needs. Apache Airflow 3.0 addresses critical enterprise needs with an architectural redesign that could improve how organizations build and deploy data applications.

“To me, Airflow 3 is a new beginning, it is a foundation for a much greater sets of capabilities,” Vikram Koka, Apache Airflow PMC (project management committee ) member and Chief Strategy Officer at Astronomer, told VentureBeat in an exclusive interview. “This is almost a complete refactor based on what enterprises told us they needed for the next level of mission-critical adoption.”

Enterprise data complexity has changed data orchestration needs

As businesses increasingly rely on data-driven decision-making, the complexity of data workflows has exploded. Organizations now manage intricate pipelines spanning multiple cloud environments, diverse data sources and increasingly sophisticated AI workloads.

Airflow 3.0 emerges as a solution specifically designed to meet these evolving enterprise needs. Unlike previous versions, this release breaks away from a monolithic package, introducing a distributed client model that provides flexibility and security. This new architecture allows enterprises to:

Execute tasks across multiple cloud environments.
Implement granular security controls.
Support diverse programming languages.
Enable true multi-cloud deployments.

Airflow 3.0’s expanded language support is also interesting. While previous versions were primarily Python-centric, the new release natively supports multiple programming languages.

Airflow 3.0 is set to support Python and Go with planned support for Java, TypeScript and Rust. This approach means data engineers can write tasks in their preferred programming language, reducing friction in workflow development and integration.

Event-driven capabilities transform data workflows

Airflow has traditionally excelled at scheduled batch processing, but enterprises increasingly need real-time data processing capabilities. Airflow 3.0 now supports that need.

“A key change in Airflow 3 is what we call event-driven scheduling,” Koka explained.

Instead of running a data processing job every hour, Airflow now automatically starts the job when a specific data file is uploaded or when a particular message appears. This could include data loaded into an Amazon S3 cloud storage bucket or a streaming data message in Apache Kafka.

The event-driven scheduling capability addresses a critical gap between traditional ETL [Extract, Transform and Load] tools and stream processing frameworks like Apache Flink or Apache Spark Structured Streaming, allowing organizations to use a single orchestration layer for both scheduled and event-triggered workflows.

Airflow will accelerate enterprise AI inference execution and compound AI

The event-driven data orchestration will also help Airflow to support rapid inference execution.

As an example, Koka detailed a use case where real-time inference is used for professional services like legal time tracking. In that scenario, Airflow can be used to help collect raw data from sources like calendars, emails and documents. A large language model (LLM) can be used to transform unstructured information into structured data. Another pre-trained model can then be used to analyze the structured time tracking data, determine if the work is billable, then assign appropriate billing codes and rates.

Koka referred to this approach as a compound AI system – a workflow that strings together different AI models to complete a complex task efficiently and intelligently. Airflow 3.0’s event-driven architecture makes this type of real-time, multi-step inference process possible across various enterprise use cases.

Compound AI is an approach that was first defined by the Berkeley Artificial Intelligence Research Center in 2024 and is a bit different from agentic AI. Koka explained that agentic AI allows for autonomous AI decision making, whereas compound AI has predefined workflows that are more predictable and reliable for business use cases.

Playing ball with Airflow, how the Texas Rangers look to benefit

Among the many users of Airflow is the Texas Rangers major league baseball team.

Oliver Dykstra, full-stack data engineer at the Texas Rangers Baseball Club, told VentureBeat that the team uses Airflow hosted on Astronomer’s Astro platform as the ‘nerve center’ of baseball data operations. He noted that all player development, contracts, analytics and of course, game data is orchestrated through Airflow.

“We’re looking forward to upgrading to Airflow 3 and its enhancements to event-driven scheduling, observability and data lineage,” Dykstra stated. “As we already rely on Airflow to manage our critical AI/ML pipelines, the added efficiency and reliability of Airflow 3 will help increase trust and resiliency of these data products within our entire organization.”

What this means for enterprise AI adoption

For technical decision-makers evaluating data orchestration strategy, Airflow 3.0 delivers actionable benefits that can be implemented in phases.

The first step is evaluating current data workflows that would benefit from the new event-driven capabilities. Organizations can identify data pipelines that currently trigger scheduled jobs, but event-based triggers could be managed more efficiently. This shift can significantly reduce processing latency while eliminating wasteful polling operations.

Next, technology leaders should assess their development environments to determine if Airflow’s new language support could consolidate fragmented orchestration tools. Teams currently maintaining separate orchestration tools for different language environments can begin planning a migration strategy to simplify their technology stack.

For enterprises leading the way in AI implementation, Airflow 3.0 represents a critical infrastructure component that can address a significant challenge in AI adoption: orchestrating complex, multi-stage AI workflows at enterprise scale. The platform’s ability to coordinate compound AI systems could help enable organizations to move beyond proof-of-concept to enterprise-wide AI deployment with proper governance, security and reliability.

The post Batch data processing is too slow for real-time AI: How open-source Apache Airflow 3.0 solves the challenge with event-driven data orchestration appeared first on Venture Beat.