Databricks as Target - DSS 6 | Data Source Solutions Documentation

Documentation: Databricks as Target - DSS 6 | Data Source Solutions Documentation

Databricks as Target

Data Source Solutions DSS supports integrating changes into Databricks location. This section describes the configuration requirements for integrating changes using Integrate and Refresh into Databricks location. For the list of supported Databricks versions into which DSS can integrate changes, see Integrate changes into location in Capabilities.

The preferred methods for writing data into Databricks are using Burst Integrate and Bulk Refresh as they provide better performance. However, it is required to create staging files on a temporary location to perform Burst Integrate and Bulk Refresh. For more information about staging, see section Staging for Databricks.

Due to technical limitations, Databricks on Azure is not supported in the DSS releases since 6.1.5/3 to 6.1.5/9.

Grants for Integrate and Refresh

This section lists the grants/permissions required for integrating changes into Databricks.

Following are the permissions required for a cluster that is part of a Unity Catalog:

  1. To create target tables using Refresh, the DSS database user must be granted CREATE TABLE, USE SCHEMA, and USE CATALOG privileges:

    GRANT CREATE TABLE ON SCHEMA <em>schema_name</em> TO <em>`user or group`</em>;
    
    GRANT USE SCHEMA ON SCHEMA <em>schema_name</em> TO <em>`user or group`</em>;
    
    GRANT USE CATALOG ON CATALOG <em>catalog_name</em> TO <em>`user or group`</em>;
    
  2. To Integrate changes or Refresh data into target tables, the DSS database user must be granted MODIFY privilege:

    GRANT MODIFY ON TABLE <em>table_name</em> TO <em>`user or group`</em>;
    
  3. To query a table, the DSS database user must be granted SELECT, USE SCHEMA, and USE CATALOG privileges:

    GRANT SELECT ON TABLE <em>table_name</em> TO <em>`user or group`</em>;
    
    GRANT USE SCHEMA ON SCHEMA <em>schema_name</em> TO <em>`user or group`</em>;
    
    GRANT USE CATALOG ON CATALOG <em>catalog_name</em> TO <em>`user or group`</em>;
    
  4. When using Unity Catalog and Databricks version 15+, the DSS database user must be granted SELECT and USE SCHEMA privileges to discover primary key constraints:

    GRANT SELECT ON TABLE information_schema.key_column_usage TO <em>`user or group`</em>;
    
    GRANT SELECT ON TABLE information_schema.table_constraints TO <em>`user or group`</em>;
    
    GRANT USE SCHEMA ON SCHEMA information_schema TO <em>`user or group`</em>
    

Primary Key support can be disabled by setting the DSS_CAPABILITY environment variable to HasPkUniqueKey=0.

For more information about these privileges, refer to the Databricks documentation - Azure, AWS.

Additional Considerations or Permissions

  • Cluster access to cloud store: The Databricks cluster must have access to the cloud store configured as the staging area. This is crucial for the COPY INTO command to function without errors when using Staging for Databricks.

  • External tables and locations: When using Unity Catalog and external tables, configure an external location for the root location of external tables for Azure Databricks or AWS Databricks platforms. This ensures external tables are correctly accessed and managed within Databricks. For more information about external locations, refer to the Databricks documentation - Azure, AWS.

Dropping Columns from Delta Tables

Databricks Runtime version 11 introduced support for the ALTER TABLE DROP COLUMN command. If you have action AdaptDDL configured and want to avoid a full Refresh of the target table when using ALTER TABLE DROP COLUMN on a source table, you must enable the Delta Lake column mapping on the target table in Databricks.

For more information about enabling Delta Lake column mapping, refer to the Databricks documentation.

Prior to Databricks Runtime version 11, only the ALTER TABLE ADD COLUMN operation was supported by Databricks. DSS managed any other DDL changes through an extensive process, which involved copying the target table in Databricks to a temporary table, dropping and re-creating the target table, and then transferring the rows from the temporary table back to the new target table.

To enable column mapping for the existing tables, you must use the following command:

  ALTER TABLE <em>table_name</em> SET TBLPROPERTIES (
    'delta.minReaderVersion' = '2',
    'delta.minWriterVersion' = '5',
    'delta.columnMapping.mode' = 'name'
  )

For new tables, Delta Lake column mapping is not enabled by default. When creating target tables in Databricks using Refresh, you must set the Delta table properties to enable column mapping. This setting affects only new tables and does not override or replace properties set on existing tables.

For more information about Delta table properties, refer to the Databricks documentation.

To enable column mapping for new tables, use the following SparkSession configuration commands:

SET spark.databricks.delta.properties.defaults.autoOptimize.optimizeWrite = true
SET spark.databricks.delta.properties.defaults.autoOptimize.autoCompact = true
SET spark.databricks.delta.properties.defaults.minReaderVersion = 2
SET spark.databricks.delta.properties.defaults.minWriterVersion = 5
SET spark.databricks.delta.properties.defaults.columnMapping.mode = name

Alternatively, you can set these Delta table properties by defining the action Environment for the Databricks location:

ActionParameters
EnvironmentName=DSS_DATABRICKS_TBLPROPERTIES
Value="delta.autoOptimize.optimizeWrite = true, delta.autoOptimize.autoCompact = true, delta.minReaderVersion = 2, delta.minWriterVersion = 5, delta.columnMapping.mode = 'name'"

Column Names Forced Case

<b>Since</b> v6.1.0/37

This option allows you to manage the case sensitivity of column names created in the target Databricks tables. This applies to Activating Replication, Refresh, and Compare.

Available options are:

  • No default: Create column names in the same case as received from the source, which can be either lowercase or uppercase.
  • Upper: Create column names in uppercase.
  • Lower: Create column names in lowercase.

This option is equivalent to the location property ForceCase.

Location Time Zone

<b>Since</b> v6.1.5/5

The time zone of the DSS Agent and the Databricks database must be the same. This option allows you to specify the timezone for Databricks location. In the DSS User Interface, you can specify the timezone using this option while creating a location (under the Configure Capture/Integrate for Databricks Location section) or by editing the existing location's source and target properties.

The equivalent location property for this field is Connection_Timezone_Name.

Instructions for DSS versions up to 6.1.5/5

If the time zone of the DSS Agent and the Databricks database are not the same, the following environment variables TZ and DSS_SQL_INIT should be defined to make them same:

GroupTableActionParameter(s)
Databricks*EnvironmentName=TZ
Value=time_zone
Databricks*EnvironmentName=DSS_SQL_INIT
Value=SET TIME ZONE 'time_zone'

The environment variable DSS_SQL_INIT runs the SQL statement SET TIME ZONE 'time_zone' when the database connection is established.

Intermediate Directory

{% partial file="dss6/requirements/source-and-target-requirements/intermediate-directory.template.md" /%}

Intermediate Directory is Local

{% partial file="dss6/requirements/source-and-target-requirements/intermediate-directory-local.template.md" /%}

Version Recovery

This option allows you to enable version-based recovery for target Databricks tables. When enabled, it lets DSS restore the table to its previous state if the Integrate job fails and avoid creating duplicate rows.

DSS achieves this by utilizing the DESCRIBE HISTORY and RESTORE TABLE ... TO VERSION AS OF ... statements.

When this option is selected, and if BurstCommitFrequency parameter in the Integrate action is set to CYCLE, DSS will treat it as TABLE to support version-based recovery. For best performance, we recommend using TABLE. Setting it to STATEMENT results in a higher number of SQL statements being issued.

When this option is selected, it adds a new column (last_successful_version) to the burst state table. You do not need to recreate the table manually; if the additional column is missing, it will be automatically added.

This option does not handle recovery of enrichment rows. To prevent duplicates created by enrichment, enable enrichment cleanup by setting the DSS_BURST_ENRICHMENT_CLEANUP environment variable to -1 or 1.

This option is equivalent to the location property Databricks_Version_Recovery.

Limitations

For Databricks runtime version earlier than 11.2, the DEFAULT constraint is not supported. As a result, it cannot automatically provide a default value for a column when changing its nullability. Therefore, if you need to alter a nullable column to NOT NULL in the target, you must first ensure that all existing NULL values in the source are updated with non-NULL values.