Dataflow is a web service for capturing and processing data on Google Cloud. In this tutorial, you'll learn how Datastream seamlessly integrates with Dataflow through simplified Dataflow templates to implement up-to-date materialized views in BigQuery to perform analytics.

2024/06/1002:23:34 technology 1453

Author: MeshCloud Public Cloud Architect Yu Wenbao

Background

For enterprises with many independent data sources, accessing enterprise data throughout the organization (especially real-time access) is not easy. This results in limited and slow data access, making it impossible for organizations to perform inspections.

Datastream provides near real-time access, allowing you to make changes to data in a variety of on-premises and cloud data sources to create access to your organization's data. Datastream provides a simple setup experience and unified usage API, providing organizations with universal access to the latest enterprise data available within the organization to support integrated near real-time scenarios.

One scenario is to move data from a source database to a cloud storage service or messaging queue, and then transform the data into something that can be read by other applications and services that communicate with that storage service or messaging queue (such as Dataflow) form. Dataflow is a web service for capturing and processing data on Google Cloud.

In this tutorial, you will learn how Datastream seamlessly integrates with Dataflow through simplified Dataflow templates to implement the latest materialized views in BigQuery to perform analysis.

You will learn how to use Dataflow to stream changes (inserted, updated, or deleted data) from a source MySQL database to a folder in a Cloud Storage bucket.

You will configure the Cloud Storage bucket to send notifications to Dataflow about any new files that contain changes to data that Datastream streams from the source database. A Dataflow job then processes these files and transfers the changes to BigQuery.

Architecture diagram

Dataflow is a web service for capturing and processing data on Google Cloud. In this tutorial, you'll learn how Datastream seamlessly integrates with Dataflow through simplified Dataflow templates to implement up-to-date materialized views in BigQuery to perform analytics. - DayDayNews

Preparation work

  1. Enable Datastream API.
  2. Make sure you have assigned the Datastream Admin role to your user account.
  3. Make sure you have a source database that the Datastream can access. This tutorial uses the MySQL database as the source.
  4. Ensure that the data, tables and schema are available in the source database.
  5. Configure the source database to allow incoming connections from the Datastream public IP address. For a list of all Datastream zones and their associated public IP addresses, see IP permission list and zone .
  6. Set up change data capture (CDC) for the source database. For more information, see configuration source MySQL database .
  7. Make sure you have configured a target Cloud Storage bucket that Datastream can access.
  8. Make sure you meet all prerequisites to enable Pub/Sub notifications for Cloud Storage.
  9. You will create a target bucket in Cloud Storage and enable Pub/Sub notifications for the bucket. Once set up this way, Dataflow can receive notifications about new files Datastream writes to the bucket. These files contain the data changes that Dataflow streams from the source database to the bucket.

requires the

Datastream to provide various source options , target options , and network connection methods .

In this tutorial, we assume that you are using a standalone MySQL database and target Cloud Storage service. For the source database, you should be able to configure the network to add inbound firewall rules. The source database can be on-premises or on a cloud provider. For Cloud Storage destinations, there is no need to configure a connection.

Because we don't know the specific details of your environment, we can't provide detailed steps for network configuration.

In this tutorial, you will select the IP permission list as the network connection method. IP permission lists are a security feature typically used to restrict and control access to data in your source database to only trusted users. You can use IP permission lists to create a list of trusted IP addresses or IP address ranges through which your users and other Google Cloud services (such as Datastream) can access this data. To use IP permission lists, you must open a source database or firewall to incoming connections from Datastream.

Create a bucket in Cloud Storage

In this section, you create a bucket in Cloud Storage. DataStream streams the schema, tables, and data from the source MySQL database to the destination bucket.

  1. Go to the browser page of Cloud Storage in the Google Cloud Console.
  2. click to create bucket . The create bucket page will be displayed.
  3. In the text field of the Name the bucket area, enter my-integration-bucket-001 and click to continue .
  4. Please accept the default settings for every remaining area of ​​the page. To do this, click to proceed to at the bottom of each area.
  5. click to create .

Dataflow is a web service for capturing and processing data on Google Cloud. In this tutorial, you'll learn how Datastream seamlessly integrates with Dataflow through simplified Dataflow templates to implement up-to-date materialized views in BigQuery to perform analytics. - DayDayNews

Enabling Pub/Sub notifications for a Cloud Storage bucket

In this section, you enable Pub/Sub notifications for the Cloud Storage bucket you created. This way, you can configure the bucket to send notifications that Dataflow can use to learn about any new files that Datastream writes to the bucket. These files contain Datastream changes to the data streamed from the source MySQL database to the bucket.

Before enabling notifications for a bucket, make sure that all prerequisites for are met.

  1. Access the Cloud Storage bucket you created. The bucket details page will be displayed.
  2. Click the button in the upper right corner of the page to activate the Cloud Shell button.
  3. At the prompt, enter the following command:
  4. gsutil notification create -t ​​my_integration_notifs -f json gs://my-integration-bucket-001
  5. By entering this command, you will create the my_integration_notifs topic in Pub/Sub. This topic will be configured to send notifications to Dataflow about changes to the data Datastream streams to the Cloud Storage bucket (my-integration-bucket-001).
  6. Any changes made to this data will be captured into the Pub/Sub theme. This information is available to any subscriber to this topic (such as Dataflow).
  7. is optional. If you see the Authorization Cloud Shell window, click Authorize .
  8. Verify that you see the following line of code:

Dataflow is a web service for capturing and processing data on Google Cloud. In this tutorial, you'll learn how Datastream seamlessly integrates with Dataflow through simplified Dataflow templates to implement up-to-date materialized views in BigQuery to perform analytics. - DayDayNews

  1. project-name is a placeholder for your Google Cloud project name.
  2. Go to the theme page of Pub/Sub in Google Cloud Console.

Dataflow is a web service for capturing and processing data on Google Cloud. In this tutorial, you'll learn how Datastream seamlessly integrates with Dataflow through simplified Dataflow templates to implement up-to-date materialized views in BigQuery to perform analytics. - DayDayNews

  1. Click on the my_integration_notifs topic created during this process.
  2. is on the my_integration_notifs page, scroll to the bottom of the page. The subscription to the tab is active. In addition, there will be a message that has no subscriptions to display for .
  3. You will create a subscription for the my_integration_notifs topic. Applications that subscribe to this subscription (such as Dataflow) can obtain the information in the topic. This information is associated with the source database data changes that Datastream streams to the Cloud Storage bucket.
  4. Click Create Subscription , and then select the Create Subscription item from the drop-down menu that appears.
  5. Populate the Add Subscription page to the topic as follows:
  1. In the Subscription ID field, enter the ID of the subscription. For this tutorial, enter my_integration_notifs_sub in the field.
  2. retains all other default values ​​on the page.
  3. click to create .

Dataflow is a web service for capturing and processing data on Google Cloud. In this tutorial, you'll learn how Datastream seamlessly integrates with Dataflow through simplified Dataflow templates to implement up-to-date materialized views in BigQuery to perform analytics. - DayDayNews

Later in this tutorial, you will create a Dataflow job from to . When creating this job, you need to specify Dataflow as a subscriber to the my_integration_notifs_sub subscription. When specified, Dataflow can receive notifications about new files Datastream writes to Cloud Storage, process those files, and transfer data changes to BigQuery.

Creating a dataset in BigQuery

In this section, you create a dataset in BigQuery. BigQuery uses datasets to contain the data it receives from Dataflow. This data represents changes in the source MySQL database that Datastream will stream to the Cloud Storage bucket.

  1. Go to the SQL workspace page of BigQuery in the Google Cloud Console.
  2. In the explorer pane, click the view actions button to the right of your Google Cloud project name. This button looks like a vertical ellipsis.
  3. Create the dataset by selecting from the drop-down menu that appears.
  4. Populate the Create Dataset window as follows:
  1. In the Dataset ID field, enter the ID of the dataset. For this tutorial, enter My_integration_dataset_log in the field.
  2. retains all other defaults in the window.
  3. Click to create the data set .

Dataflow is a web service for capturing and processing data on Google Cloud. In this tutorial, you'll learn how Datastream seamlessly integrates with Dataflow through simplified Dataflow templates to implement up-to-date materialized views in BigQuery to perform analytics. - DayDayNews

  1. Dataflow will use the My_integration_dataset_log dataset to stage data changes received from the Datastream.
  2. In the explorer pane, click the node icon to the left of the Google Cloud project name and verify that you can see the dataset you created.
  3. Follow the steps in this procedure to create a second dataset: My_integration_dataset_final.
  4. In the My_integration_dataset_final dataset, changes staged in the My_integration_dataset_log dataset will be merged to create a one-to-one copy of the table in the source database.
  5. Expand the nodes to the left of each dataset.
  6. Verify each data set is empty.

Dataflow is a web service for capturing and processing data on Google Cloud. In this tutorial, you'll learn how Datastream seamlessly integrates with Dataflow through simplified Dataflow templates to implement up-to-date materialized views in BigQuery to perform analytics. - DayDayNews

After Datastream streams the data changes from the source database to the Cloud Storage bucket, the Dataflow job processes the file containing the changes and transfers the changes to the BigQuery dataset.

Create connection profiles in Datastream

In this section, you create connection profiles in Datastream for the source database and target location. When creating the connection profile, you need to select MySQL as the source connection profile type and Cloud Storage as the destination connection profile type.

Dataflow connects to both the source and target locations using the information defined in the connection configuration file so that data can be streamed from the source database to the target bucket in Cloud Storage.

Create a source connection profile for the MySQL database

  1. Go to Datastream's connection profile page in the Google Cloud Console.
  2. Click to create the configuration file .
  3. In the Create Connection Profile page, click the MySQL profile type (because you want to create a source connection profile for the MySQL database).
  4. The following information is available in the Define Connection Settings section of the Create MySQL Profile page:
  • Enter My Source Connection Profile as the connection profile name for the source database.
  • retains the automatically generated connection profile ID.
  • selects the region used to store connection profiles.
  • Like all resources, connection profiles are saved in a zone, and a data flow can only use a connection profile stored in the same zone as the data flow. Region selection does not affect whether Datastream can connect to the source or destination location, but it may affect availability if there is downtime in the region.
  • Enter the connection details :
  • In the Hostname or IP field, enter the hostname or public IP address that Datastream can use to connect to the source database. You will provide a public IP address because the IP permission list will be used as the networking method for this tutorial.
  • In the port field, enter the port number reserved for the source database. For MySQL databases, the default port is usually 3306.
  • Enter the username and password to authenticate to the source database.

Dataflow is a web service for capturing and processing data on Google Cloud. In this tutorial, you'll learn how Datastream seamlessly integrates with Dataflow through simplified Dataflow templates to implement up-to-date materialized views in BigQuery to perform analytics. - DayDayNews

  1. In the define connection settings section, click to continue . creates a MySQL profile page where the secure connection to the source part is active.
  2. From the encryption type menu, select no .To learn more about this menu, see Creating a connection profile for a MySQL database.
  3. In the Securing connections to sources section, click to continue . creates the definition connection method section of the MySQL configuration file page that is active.
  4. Select the network method you want to use when establishing a connection between the source database and the target bucket in Cloud Storage. For this tutorial, use the connection method dropdown to select the IP permission list as the network method.
  5. Configure the source database to allow incoming connections from the exposed Datastream public IP address.
  6. In the define connection method section, click to continue . The test connection profile portion of the Create MySQL Profile page is active.
  7. Click to run test to verify whether the source database and DataStream can communicate with each other.
  8. Verify that you see the "Test Passed" status.
  9. If a test fails, you can fix the problem in the appropriate part of the flow and retest.

Dataflow is a web service for capturing and processing data on Google Cloud. In this tutorial, you'll learn how Datastream seamlessly integrates with Dataflow through simplified Dataflow templates to implement up-to-date materialized views in BigQuery to perform analytics. - DayDayNews

  1. Click to create .

Create a target connection profile for Cloud Storage

  1. Go to Datastream's connection profile page in the Google Cloud Console.
  2. Click to create the configuration file .
  3. In the Create Connection Profile page, click the Cloud Storage profile type (because you want to create a target connection profile for Cloud Storage).
  4. Provide the following information in the Create Cloud Storage Profile page:
  • Enter My Destination Connection Profile as the connection profile name for the target Cloud Storage service.
  • retains the automatically generated connection profile ID.
  • selects the region used to store connection profiles.
  • In the connection details pane, click Browse to select your my-integration-bucket-001 that you created for earlier in this tutorial. Datastream will transfer data from the source database into this bucket. After completing the selection, click to select .
  • Your bucket will appear in the Bucket Name field in the connection details pane.
  • In the connection profile path prefix field, provide the prefix to be appended to the path of the bucket name when Datastream streams data to the destination. For this tutorial, enter /integration/tutorial in the field.
  • Note : Any path you enter in this field must begin with slash (/).
  1. Click to create .

Dataflow is a web service for capturing and processing data on Google Cloud. In this tutorial, you'll learn how Datastream seamlessly integrates with Dataflow through simplified Dataflow templates to implement up-to-date materialized views in BigQuery to perform analytics. - DayDayNews

After you create the source connection profile for the MySQL database and the destination connection profile for Cloud Storage, you can use them to create a data flow.

Creating a data stream in a Datastream

In this section, you create a data stream. This data flow moves data from the source MySQL database to the target bucket in Cloud Storage.

Creating a data flow involves:

  • Defining the settings of the data flow.
  • select the connection profile you created for the source database (source connection profile) . In this tutorial we are using the my source connection profile .
  • Configure information about the source database of the dataflow by specifying the tables and schema in the source database on which the Datastream performs the following operations:
  • can be transferred to the target location.
  • cannot be transferred to the target location.
  • Determines whether the Datastream backfills historical data and streams in-progress changes to the destination, or only streams changes to the data.
  • Select The connection profile you created for Cloud Storage (target connection profile) . In this tutorial we are using the my target connection profile .
  • configures information about the destination bucket for the data flow.This information includes:
  • DataStream The folder of the destination bucket to which the schema, tables, and data will be transferred from the source database.
  • Output format for files written to Cloud Storage. Datastream currently supports two output formats: Avro and JSON. In this tutorial, the Avro file format is used.

Dataflow comes with wizards to help you create data flows. The wizard contains six panels: Getting Started, Define and Test Sources, Configure Sources, Define Destination, Configure Destination, and Review and Create. The various sections of this page describe how to populate each panel.

Define the settings of the data flow

  1. Go to the data flow page in the Google Cloud Console.
  2. The following information is available in the Define Data Stream Details panel of the Create Data Stream page:
  • Enter My Stream as the data stream name .
  • retains the automatically generated data stream ID.
  • From the region menu, select the region in which you created the source and target connection profiles.

Dataflow is a web service for capturing and processing data on Google Cloud. In this tutorial, you'll learn how Datastream seamlessly integrates with Dataflow through simplified Dataflow templates to implement up-to-date materialized views in BigQuery to perform analytics. - DayDayNews

  • From the source type menu, select the MySQL profile type.
  • From the target type menu, select the Cloud Storage profile type.

Dataflow is a web service for capturing and processing data on Google Cloud. In this tutorial, you'll learn how Datastream seamlessly integrates with Dataflow through simplified Dataflow templates to implement up-to-date materialized views in BigQuery to perform analytics. - DayDayNews

Dataflow is a web service for capturing and processing data on Google Cloud. In this tutorial, you'll learn how Datastream seamlessly integrates with Dataflow through simplified Dataflow templates to implement up-to-date materialized views in BigQuery to perform analytics. - DayDayNews

Select the target connection profile

  1. From the Target Connection Profile menu, select the target connection profile for Cloud Storage.
  2. Click to continue . The system will display the configuration data flow target location panel of the create data flow page.

Dataflow is a web service for capturing and processing data on Google Cloud. In this tutorial, you'll learn how Datastream seamlessly integrates with Dataflow through simplified Dataflow templates to implement up-to-date materialized views in BigQuery to perform analytics. - DayDayNews

Configure information about the destination of the data stream

  1. In the Output format field, select the format for files written to Cloud Storage. In this tutorial, the Avro file format is used.
  2. In the Dataflow path prefix field, you can provide a prefix to be appended to the path in the bucket name when Datastream transfers the data to the destination. This is the path to your Cloud Storage bucket into which DataStream will transfer the schema, tables, and data from the source MySQL database.
  3. When you created the target connection profile for Cloud Storage in , you provided the /integration/tutorial path. Therefore, you do not need to populate this field.
  4. Click to continue . The system will display the audit data flow details of the create data flow page and create the panel.

Dataflow is a web service for capturing and processing data on Google Cloud. In this tutorial, you'll learn how Datastream seamlessly integrates with Dataflow through simplified Dataflow templates to implement up-to-date materialized views in BigQuery to perform analytics. - DayDayNews

Create the data flow

  1. Verify the details of the data flow and the source and target connection profiles that the data flow uses to move data from the source MySQL database to the target bucket in Cloud Storage.
  2. Click to run verification to verify the data flow. By validating a data flow, Datastream checks that the source is configured correctly, verifies that the data flow can connect to the source and destination, and verifies the end-to-end configuration of the data flow.
  3. If the validation check passes, a checkmark icon appears to the left of the validation check.
  4. If the verification check fails, an English exclamation mark icon will be displayed on the left side of the check, and the view error details button will be displayed below the check. After clicking the button, a pop-up window will appear explaining why the check failed and how to correct the problem. After making the appropriate corrections, click to re-verify .
  5. For more information on troubleshooting failed validation checks, see diagnosing issues . After all validation checks of
  6. have passed, click to create .
  7. In the Create Data Flow? dialog box, click to create .

After creating the data stream, you can start it.

Dataflow is a web service for capturing and processing data on Google Cloud. In this tutorial, you'll learn how Datastream seamlessly integrates with Dataflow through simplified Dataflow templates to implement up-to-date materialized views in BigQuery to perform analytics. - DayDayNews

Starting the data flow

In the previous part of this tutorial, you created a data flow but did not start it. You can now start.

In this tutorial, you will create and start the data flow separately to prevent the data flow creation process from increasing the load on the source database.To eliminate this load, you create the data flow but do not start it, and then start the data flow when the load can be generated.

Datastream moves data, schemas, and tables from a source database to a target location by initiating a data flow.

  1. Select the checkbox to the left of the data flow you want to start. In this tutorial this is my dataflow .
  2. click to start .
  3. In the dialog box, click to start . The status of the data flow changes from Not started to Starting to Running. The
  4. data stream takes approximately 30 seconds to start running. Background resources need to be started before the data flow can be started.

After you start the data flow, you can verify that Dataflow transfers data from the source database to the target location.

Validating Data Flow

In this section, you confirm that Dataflow transfers data from all tables in the source MySQL database to the /integration/tutorial folder in the Cloud Storage target bucket. For this tutorial, your bucket is named my-integration-bucket-001.

  1. The link you click consists of the name of the bucket and the folder of the bucket to which Datastream will transfer the schema, tables, and data from the source database. When you created the target connection profile for Cloud Storage , you specified this folder as /integration/tutorial. Therefore, the link should read my-integration-bucket-001/integration/tutorial.
  2. Verify that you see the folder representing the tables of the source database.
  3. Click on one of the table folders and drill down until you see the data associated with that table.
  4. The first folder is [schema]_[table], followed by the folder indicating the specific time (year, month, day, hour and minute) when Datastream transferred the data from the source database to the target bucket in Cloud Storage. .
  5. creates a folder every minute (when new data is to be written).
  6. When the file size reaches 250 MB or the schema changes, a new file will be created. If the table is partitioned, files are created for each partition.

Dataflow is a web service for capturing and processing data on Google Cloud. In this tutorial, you'll learn how Datastream seamlessly integrates with Dataflow through simplified Dataflow templates to implement up-to-date materialized views in BigQuery to perform analytics. - DayDayNews

Creating a Dataflow job

In this section, you create a job in Dataflow. After Dataflow streams data changes from the source MySQL database to the Cloud Storage bucket, a notification of the new file containing the changes is sent to Dataflow. Dataflow jobs process these files and transfer the changes to BigQuery.

  1. Click to create job from the template.
  2. On the Create job from template page, in the job name field, enter the name of the Dataflow job that you want to create. For this tutorial, enter my-dataflow-integration-job in the field.
  3. From the region endpoint menu, select the region where the job will be stored. This region is the region you select for the source connection profile , target connection profile , and data flow you create.
  4. From the Dataflow Template menu, select the template you want to use to create the job. For this tutorial, choose Datastream to BigQuery.
  5. After making a selection, additional fields related to this template will appear. The
  6. Datastream to BigQuery template is a stream processing pipeline that reads Datastream data and copies it into BigQuery. The template uses Pub/Sub notifications to read data from Cloud Storage and copy it to a time-partitioned BigQuery staging table. Once copied, the template executes MERGE in BigQuery to insert/update all change data capture (CDC) changes into the copy of the source table.
  7. To minimize the cost of frequent MERGE operations, we recommend starting with an initial frequency of every 6-12 hours. Once all backfilling is complete and the data is successfully copied, lower this value to the desired frequency.
  8. For more information on the Datastream to BigQuery template, see Datastream to BigQuery (Stream).
  9. In the File location field of the Datastream file output in Cloud Storage, enter the path that contains the name of the Cloud Storage bucket. For this tutorial, enter gs://my-integration-bucket-001 in the field.
  10. In the Pub/Sub subscription used in Cloud Storage notification policy field, enter the path that contains the Pub/Sub subscription name. For this tutorial, enter projects/project-name/subscriptions/my_integration_notifs_sub in the field.
  11. project-name is a placeholder for your Google Cloud project name. Additionally, you created the my_integration_notifs_sub subscription in the Enable Pub/Sub notifications for a Cloud Storage bucket section of this tutorial.
  12. Enter avro in the Datastream output file format (avro/json) field because, for this tutorial, Avro is the format of the files that Datastream writes to Cloud Storage.
  13. Enter my-dataflow-integration-job in the name of the dataset containing the staging table or template field, because Dataflow will use this dataset to stage data changes received from the Datastream.
  14. Enter My_integration_dataset_final in the template field of the dataset that contains the replica table, because changes staged in the My_integration_dataset_log dataset will be merged in this dataset to create a one-to-one copy of the table in the source database.
  15. You created the My_integration_dataset_log and My_integration_dataset_final datasets in the Creating a Dataset in BigQuery section of this tutorial.
  16. In the Bad Letter Queue Directory field, enter the path that contains the name of the Cloud Storage bucket and the Bad Letter Queue folder. Data changes that Dataflow cannot transfer to BigQuery are stored in the queue. You can fix the contents of the queue so that Dataflow can reprocess it.
  17. For this tutorial, enter gs://my-integration-bucket-001/ in the Bad Letter Queue Directory field (where dlq is the folder for the Bad Letter Queue).
  18. click to run job .

Your Dataflow job may take a few minutes to run. Need to run background resources. In addition, the job must process files containing data changes received from Cloud Storage before these changes can be transferred to BigQuery.

Dataflow is a web service for capturing and processing data on Google Cloud. In this tutorial, you'll learn how Datastream seamlessly integrates with Dataflow through simplified Dataflow templates to implement up-to-date materialized views in BigQuery to perform analytics. - DayDayNews

Validation Integration

In the Validation Data Flow section of this tutorial, you confirmed that Datastream transferred data from all tables of the source MySQL database to the /integration/tutorial folder of the Cloud Storage target bucket.

In this section, you verify that Dataflow processes the file containing the data associated with this data and transfers the changes to BigQuery. Therefore, there is end-to-end integration between Datastream and BigQuery.

  1. Go to BigQuery's SQL workspace page in the Google Cloud Console.
  2. In the explorer pane, expand the node to the left of the Google Cloud project name.
  3. expands the nodes to the left of the My_integration_dataset_log and My_integration_dataset_final datasets.
  4. Verify that each dataset now contains data. This confirms that Dataflow has processed the files associated with the data that Datastream streamed to Cloud Storage, and that those changes have been transferred to BigQuery.

Dataflow is a web service for capturing and processing data on Google Cloud. In this tutorial, you'll learn how Datastream seamlessly integrates with Dataflow through simplified Dataflow templates to implement up-to-date materialized views in BigQuery to perform analytics. - DayDayNews

Problem description:

Problem 1: needs to turn on the sql binlog log during operation

Problem 2: The file location of the Datastream file output by in Cloud Storage is accurate to the file avro

technology Category Latest News