Google Cloud Storage Requirements - DSS 6 | Data Source Solutions Documentation

Documentation: Google Cloud Storage Requirements - DSS 6 | Data Source Solutions Documentation

Google Cloud Storage Requirements

This section describes the requirements, access privileges, and other features of Data Source Solutions DSS when using Google Cloud Storage (GCS) for replication.

Supported Platforms

Learn about the Google Cloud Storage versions compatible with DSS on our Platform Support Matrix page (6.1.0, 6.1.5, 6.2.0, and 6.2.5).

Data Management

Learn how DSS maps data types between source and target DBMSes or file systems on the Data Type Mapping page.

DSS uses GCS S3-compatible API (cURL library) to connect, read, and write data to Google Cloud Storage during Capture, Continuous Integrate, Bulk Refresh, and Direct File Compare.

Permissions

To run a Capture or Refresh or Integrate in Google Cloud Storage location, it is recommended that the GCS user has the role of Storage Admin (roles/storage.admin).

The minimal permission set for capture and integrate location are:

storage.buckets.get
storage.multipartUploads.list
storage.objects.list
storage.objects.get
storage.objects.create
storage.objects.delete

For more information on the Google Cloud Storage role permissions, refer to the Google Cloud Storage documentation.

Hive External Tables

To Compare files that reside on the Google Cloud Storage location, DSS allows you to create Hive external tables above Google Cloud Storage. The Hive ODBC connection can be enabled for Google Cloud Storage location by selecting the Hive External Tables field while creating a location or editing the existing location's file connection properties. For more information about configuring Hive external tables, refer to Apache Hadoop documentation.

ODBC Connection

DSS uses an ODBC connection to the Hadoop cluster for which it requires the ODBC driver (Amazon ODBC or HortonWorks ODBC) for Hive installed on the machine (or in the same network). The Amazon and HortonWorks ODBC drivers are similar and compatible to work with Hive 2.x release. However, it is recommended to use the Amazon ODBC driver for Amazon Hive and the Hortonworks ODBC driver for HortonWorks Hive. For information about the supported ODBC driver version, refer to the DSS release notes (dss.rel) available in DSS_HOME directory or the download page.

On Linux, DSS additionally requires unixODBC.

By default, DSS uses Amazon ODBC driver for connecting to Hadoop. However, if you want to use the (user installed) Hortonworks ODBC driver, while creating a location or editing the existing location's file connection properties, use the ODBC Driver field in DSS UI and specify the ODBC driver.

Channel Configuration

For the file formats (CSV, JSON, and AVRO) the following action definitions are required to handle certain limitations of the Hive deserialization implementation during Bulk or Row-wise Compare:

For CSV

Group Table Action Parameter(s)

S3 * FileFormat NullRepresentation=\\N

S3 * TableProperties CharacterMapping="\x00>\\0;\n>\\n;\r>\\r;">\""

S3 * TableProperties MapBinary=BASE64

Group	Table	Action	Parameter(s)
S3	*	FileFormat	NullRepresentation=\\N
S3	*	TableProperties	CharacterMapping="\x00>\\0;\n>\\n;\r>\\r;">\""
S3	*	TableProperties	MapBinary=BASE64

For JSON

Group Table Action Parameter(s)

S3 * TableProperties MapBinary=BASE64

S3 * FileFormat JsonMode=ROW_FRAGMENTS
For Avro

Group Table Action Parameter(s)

S3 * FileFormat AvroVersion=v1_8

Group	Table	Action	Parameter(s)
S3	*	TableProperties	MapBinary=BASE64
S3	*	FileFormat	JsonMode=ROW_FRAGMENTS

Group	Table	Action	Parameter(s)
S3	*	FileFormat	AvroVersion=v1_8

v1_8 is the default value for parameter AvroVersion, so it is not mandatory to define this action.

- ## Related Articles