Keys - DSS 6 | Data Source Solutions Documentation

Documentation: Keys - DSS 6 | Data Source Solutions Documentation

Keys

In a database management system, a key is an attribute or a set of attributes that help to uniquely identify a row in a table. This section describes the types of keys and their roles in Data Source Solutions DSS.

Primary Keys

Primary Key is an ID unique for each specific row in a database table. It consists of a single mandatory column or a set of mandatory columns. Primary keys are typically validated by transaction processing databases. However, analytical databases can consider the primary key to be declarative and not enforce it.

Replication Keys

Replication Key is a unique key that DSS uses to identify a row in a target table. DSS requires every table in a channel to have a replication key. When a channel is created, DSS inspects a source database for a "good" replication key. If it sees a primary key, then that is used for its replication key. If there is no primary key, DSS will look for a unique index as a replication key. If none of the above exists, then DSS will define the replication key using all non-LOB columns. This last case is called an implicit replication key.

If there are multiple uniqueness constraints on the capture table (e.g., a primary key and several unique indices), DSS uses a certain hierarchy rule to decide which one to use as a replication key (e.g. a primary key would 'win' and be considered).

The hierarchy rule is based on:

  • the source dictionary;

  • a unique key with mandatory columns;

  • or all non-LOB columns in the table.

You can add a column to a replication key (assuming that the implicit key has no explicitly-defined columns). This can be achieved using action ColumnProperties with parameter Key or TimeKey.

DSS ignores a table's non-unique indexes. On Oracle, it also ignores unique indexes if that index only contains a single nullable column.

Distribution Keys

For distributed DBMSes, such as Greenplum, DSS has a concept of Distribution Key for each table. A distribution key is a column or group of columns used to determine the distribution of portions of the table across logical slices or nodes. These are the column(s) that DSS uses for hashing the storage when DSS creates its tables in a target database.

Some databases do not support distribution keys, and some have their own limitations. For example, Redshift only supports a single column as a distribution key, while Greenplum and Teradata support multi-column distribution keys. For good distribution of data across a database cluster, the distribution key values should ideally be unique, or otherwise, be well-distributed and almost unique.

There are two types of distribution keys in DSS:

  • Explicit Distribution Key means that the distribution key has been explicitly defined by a user in DSS using action ColumnProperties with parameter DistributionKey.
  • Implicit Distribution Key means that the distribution key has not been explicitly defined by a user in DSS. In this case, DSS will derive the distribution key from the first column of the replication key. To influence the selection of the key, refer to the TableProperties action. For example, you can eliminate unsuitable columns (DistributionKeyAvoidPattern) or set a maximum number of columns (DistributionKeyLimit) to be considered for the distribution key.

If action Integrate with parameter Method=Burst is defined, DSS creates target tables in a distributed DBMS, but they can also be created directly. When DSS creates the target tables, it will use a distribution key for the distributed DBMS. Additionally, the table will have the DSS replication key as an index (either with or without uniqueness depending on whether action TableProperties with parameter NoDuplicateRows is defined).

DSS also creates staging tables (ending with '__b') for delivering bursts of changes into its target tables. They do not get an index (not needed), but DSS will use for them the same distribution key as in the target tables. It is essential that the distribution key of the burst table matches the distribution key of the target table, otherwise the burst updates will be misaligned and the integrate job will either fail or be too slow.

NoDuplicateRows Parameter

Parameter NoDuplicateRows only affects tables that have an implicit replication key. It has no effect on tables with an explicit replication key. This parameter can be used if a table without a unique constraint does not contain multiple identical rows in the database. Doing so will increase performance because more efficient queries can be used in the target database.

If no replication key column(s) are defined on the table and this parameter is not set, then duplicates are allowed and all updates are treated as key updates and are replicated as a delete and an insert. Additionally, each delete is integrated using a special SQL subselect, which ensures only a single row is deleted, not multiple rows. Incorrect use of the NoDuplicateRows parameter can lead to errors if there are duplicates.