Dataflow Flex Template

It is possible to run a jar file you created on Dataflow using Cloud Shell. However, the best solution for both manageability and triggering from the interface is to create a Flex Template. This way, parameters can be provided effectively from the outside, and workflows can be created. Additionally, using Artifact Registry allows for versioning, creating a more robust process.

Source: Flex Template Github, Dataflow Flex Template video, Flex Template parameters

Implementation

For the intended operation, a local project is created and code blocks are written. Then, a jar file is created.

 mvn clean package

A metadata file is created for this jar. The metadata file includes the name, description, and other basic information of the Flex Template. Additionally, the parameters required for the job to run are defined here. Parameters can be specified as either required or optional.

A bucket is created in GCP to store the necessary files.

export BUCKET="your-bucket"
gcloud storage buckets create gs://$BUCKET

Next, an Artifact Registry Repository is created. This allows for the secure storage and management of packages, container images, and other software artifacts used in the development process. This process can also be done through the interface.

export REGION="your-region"
export REPOSITORY="your-repository"
gcloud artifacts repositories create $REPOSITORY \
--repository-format=docker \
--location=$REGION

Once the steps are completed, all details, including versions, can be viewed in the Artifact Registry.

As the final step, the jar file and metadata file are uploaded via CLI, and the code block that creates the template is executed.

gcloud dataflow flex-template build gs://path/dataflow_flex_template_test.json \
--image-gcr-path "path/dataflow_flex_template_test:1.1" \
--sdk-language "JAVA" \
--flex-template-base-image JAVA8 \
--metadata-file "metadata.json" \
--jar "dataflow_flex_template_test.jar" \
--env FLEX_TEMPLATE_JAVA_MAIN_CLASS="org.example.analytics.DecryptionPoc"

The location of the template to be created is specified with flex-template build gs://path/dataflow_flex_template_test.json. The Artifact Registry address and version are given as “path/dataflow_flex_template_test:1.1”. The SDK language is set to “JAVA”, and the flex-template base image is chosen as JAVA8. The metadata file name and jar file name are provided as “metadata-file “metadata.json” and “Dataflow_flex_template_test.jar”, respectively. The main class to be executed is specified with env FLEX_TEMPLATE_JAVA_MAIN_CLASS=”org.example.analytics.DecryptionPoc”.

Execution

When this code is executed, an image will be created under the Artifact Registry, and the JSON file will be generated under the specified bucket. With this JSON file, jobs can be executed either through Cloud Shell or the interface.

gcloud dataflow flex-template run "flex-`date +%Y%m%d-%H%M%S`"  \
--template-file-gcs-location "gs://$BUCKET/getting_started_java.json" \
--region $REGION \
--parameters output="gs://$BUCKET/output-"

required fields, select “Custom template,” and provide the path to the generated JSON file.

After selecting JSON, the parameters created in the metadata file will be displayed on the screen. If any optional parameters were added, they will be visible in that section. Fill in the required parameters and proceed to execute the job.

During execution, the steps and parameters are displayed. You can monitor the progress by viewing the details from the interface.

Author: Gökay Solak, Senior Data Engineer, Oredata

Oredata is a premier Google Cloud Partner specialized in

  • Cloud Migration Services
  • Data & Analytics Services
  • Infrastructure Services
  • Google Workspace

If you are interested joining us, feel free to apply our job openings: https://www.linkedin.com/company/oredata/jobs/

 

Contact us