Big data processing can take place in many contexts. Sometimes you’re prototyping new pipelines, and at other times you’re deploying them to run at scale. Sometimes you’re working on-premises, and at other times you’re in the cloud. Sometimes you care most about speed of execution, and at other times you want to optimize for the lowest possible processing cost. The best deployment option often depends on this context. It also changes over time; new data processing engines become available, each optimized for specific needs — from the venerable Hadoop MapReduce to Storm, Spark, Tez or Flink, all in open source, as well as cloud-native services. Today’s optimal choice of big data runtime might not be tomorrow’s.
But in all these cases, what remains true is that you need an easy-to-use, powerful and flexible programming model that makes developers productive. And no one wants to have to rewrite their algorithm for a specific runtime.
We believe the Dataflow programming model, based on years of experience at Google, can provide maximum developer productivity and seamless portability. That’s why in December we open sourced the Cloud Dataflow SDK, which offers a set of primitives for large-scale distributed computing, including rich semantics for stream processing. This allows the same program to execute either in stream or batch mode.
Today, we’re taking the next step in ensuring the portability of the Dataflow programming model by working with Cloudera to make Dataflow run on Spark. There are currently three runners available to allow Dataflow programs to execute in different environments:
- Direct Pipeline: The “Direct Pipeline” runner executes the program on the local machine.
- Google Cloud Dataflow: The Google Cloud Dataflow service is a hosted and fully managed execution environment for Dataflow programs on Google Cloud Platform. Programs can be deployed on it via a runner. This service is currently in alpha phase and available to a limited number of users; you can apply here.
- Spark: Thanks to Cloudera, the Spark runner allows the same Dataflow program to execute on a Spark cluster, whether in the cloud or on-premises. The runner is part of the Cloudera Labs effort and is available in this GitHub repo. You can find out more about Dataflow and the Spark runner from Cloudera’s Josh Wills in this blog post.
We are delighted that Cloudera is joining us, and we look forward to the future growth of the Dataflow ecosystem. We’re confident that Dataflow programs will make data more useful in an ever-growing number of environments, in cloud or on-premises. Please join us – whether by using the Dataflow SDK (deploying via one of the three runners listed above) for your own data processing pipelines, or by creating a new Dataflow runner for your favorite big data runtime.
-Posted by William Vambenepe, Product Manager
Feed Source: Google Cloud Platform Blog
Article Source: Easily run Dataflow Big Data pipelines anywhere, thanks to Cloudera