Staging for BigQuery - DSS 6 | Data Source Solutions Documentation

Documentation: Staging for BigQuery - DSS 6 | Data Source Solutions Documentation

Staging for BigQuery

Data Source Solutions DSS stores/stages data in a temporary directory (staging file) before loading it into the target BigQuery location during Burst Integrate and Bulk Refresh. For more information about staging files on BigQuery, refer to Google BigQuery documentation.

The following outlines how DSS implements staging for BigQuery:

  1. DSS first writes data into the staging directory on Google Cloud Storage. Since 6.2.5/1, the staging files can be stored in CSV or Parquet format.

Known issues with CSV and Parquet: - Dates and timestamps before year 1583 can be replicated incorrectly due to the shift from the Gregorian to the Julian calendar. - TIMESTAMP WITH TIMEZONE values with fractional seconds in years before 1698 or after 2242 may be reported as mismatches by Compare due to a rounding error during data querying. - Parquet timestamps are interpreted as UTC. For replication to work correctly, ensure the replication environment uses UTC by setting the environment variable TZ=UTC. You can define this environment variable using the action Environment.

2. DSS then uses the BigQuery SQL command `INSERT INTO` to ingest data into the BigQuery target tables from the staging directory. - To ingest data into BigQuery target, DSS pulls the compressed data from the staging directory (located on Google Cloud Storage) into a target table. A special 'external table' needs to exist for each target table that DSS loads data into. DSS will create these tables in BigQuery with names having the following patterns **\_ \_x** or **\_ \_bx**.

Configuring Staging

Setting up staging on Google Cloud Storage (GCS) requires configuring both the GCS environment and DSS.

Prerequisites

  • A GCS location (bucket) to store staging files. For more information about creating and configuring a GCS bucket, refer to the GCS documentation.

  • A Google Cloud user (storage account) to access the GCS bucket.

  • Configure permissions required on GCS and BigQuery. For more information, refer to the Google BigQuery documentation.

DSS Location Configuration

To connect DSS to GCS, define the following location properties when creating a BigQuery location.
For an existing BigQuery location, you can define these properties by editing the location's source and target properties.

The Location Property equivalent to the UI field is shown in parentheses below.

  1. In the STORAGE drop-down menu (File_Scheme), select the protocol for connecting DSS to Google Cloud Storage:

    • Google Cloud Storage (HTTPS)
    • Google Cloud Storage (HTTP)
  2. In the BUCKET field (GS_Bucket), enter the name of the GCS bucket.

  3. In the STAGING DIRECTORY field (Staging_Directory), specify the path to the directory where DSS will create temporary staging files within the GCS bucket (e.g. /my_staging_dir).

  4. Since 6.2.5/1, in the STAGING FILE FORMAT drop-down menu (Staging_File_Format), select the format for storing the staging files:

    • CSV

Known issues with CSV: TIMESTAMP WITH TIMEZONE values with fractional seconds in years before 1698 or after 2242 may be reported as mismatches by Compare due to a rounding error during data querying.

- **Parquet**

Known issues with Parquet: - Dates and timestamps before year 1583 can be replicated incorrectly due to the shift from the Gregorian to the Julian calendar. - TIMESTAMP WITH TIMEZONE values with fractional seconds in years before 1698 or after 2242 may be reported as mismatches by Compare due to a rounding error during data querying. - Parquet timestamps are interpreted as UTC. For replication to work correctly, ensure the replication environment uses UTC by setting the environment variable TZ=UTC. You can define this environment variable using the action Environment with parameters Name=TZ and Value=path_to_directory.