How to Build Hash Keys in Oracle

Hash keys can be used alternatively of sequence numbers to build deputy keys in a Data Warehouse. This is much done in Data Vault environments, and there are some good reasons to do that. But there are besides several issues and restrictions with hash keys. then, practice hashish keys lone if you have adept reasons for them. And if you decide to use hash keys in your Oracle Data Warehouse, make sure that you do it the right way .

A hash function is a deterministic serve that distributes a ( normally high ) phone number of input values to a set of distinct hash values. Hash algorithm are used for unlike purposes in calculator science ( e.g. compilers and parsers, in cryptography or for cardinal generators ). In the Oracle database, hash functions are used for hashish joins, hash breakdown and hashish clusters – to name precisely a few examples .
To illustrate the rationale of a hashish affair, I use the expression MOD ( n, 7 ) as a hash serve in the comply example. The function is deterministic, i.e. for the like stimulation rate, the leave is constantly the lapp. The resultant role returns the remainder of a division by 7, consequently always a total between 0 and 6.

Hash function
basic rationale of a hash function
As you can see in the case above, the hash values are not alone. For the input signal values 15 and 22, the hashish key ( 1 ) is the lapp. That is called a hashish collision. For many situations, this is not an exit. If a hash algorithm is used for equal distribution of rows to multiple hashish partitions, it is clear that more than one quarrel is stored in each partition. But for a hash key generator, hash collisions have to be avoided .
In this web log mail, I will write about a special habit casing of hashish functions : build hashish keys in a Data Warehouse to replace the occupation key with a deputy primary key. In this case, the business key ( one or more attributes of a source table ) is the remark value. The solution of the hash function is a hash key. Hash keys are much used in Data Warehouses with Data Vault models, they are flush one of the independent concepts of Data Vault 2.0. But hash keys can besides be used in dimensional models as elementary keys of the proportion tables, in historized head & version models, or in any 3NF data model .

Sequence Numbers or Hash Keys?

There are many discussions about the right access to build surrogate keys. Should I use sequence numbers or hash keys ? Or tied UUIDs ? As common in real liveliness, there are advantages and disadvantages for each of the concepts. unfortunately, there is an about religious “ key discussion ” in the Data Vault residential district. What is my opinion ? I don ’ deoxythymidine monophosphate want to be a “ good guy ” or a “ bad guy ”, depending on whether I use sequence numbers or hash keys. For me as a Data Warehouse adviser, it is crucial to understand the rationale of the different concepts, their advantages and disadvantages. then I can decide for each undertaking which approach fits best for the specific requirements and engineering. here precisely a short summary of the two approaches :
Sequence Numbers : For every modern clientele key in a Hub, a fresh sequence number is created. In Oracle, this can be implemented classically with a sequence, or in Oracle 12c with an identity column. To determine the foreign keys on Links or Satellites, a key search on the Hub is needed. For each business key, the equate surrogate key is derived from the Hub. Key search operations are very common in Data Warehouses and are available in every ETL tool. In SQL, a key search is normally good a ( left out ) join on the search table ( here on the Hub ) to determine the right sequence number .
Key lookup
sequence numbers with key search
The chief topic of this approach is the burden addiction : All Hubs must be loaded before the Links and Satellites to guarantee that all key lookups return a result. additionally, the sequence numbers in different environments ( e.g. development, test and production database ) are not in synchronize. If data has to be compared or moved from one environment to another, the sequence numbers must be recalculated. The third challenge is the reload of data from start : If – for any reason – a table is truncated and initially loaded, each row will get a new sequence number, so the dependent tables must be reloaded angstrom well .
Hash Keys : If a hash algorithm is used to determine the foster keys, the hashish key can be derived from the commercial enterprise key directly, without a cardinal search. Because a hashish function is deterministic, it will always return the like hashish key for the same clientele key : On the Hub american samoa well as on the referring Links and Satellites. therefore, there is no especial cargo arrange required, all tables can ( theoretically ) be loaded in parallel. And because of the deterministic officiate, the hash keys would be the like in multiple environments and after a recharge of the lapp data .
Hash keys
Hash keys calculated with hashish function
The coincident lode of all Hubs, Links and Satellites is not always possible. If you have alien identify constraints defined on your Data Vault tables, you still have to load the Hubs before the Links and Satellites. One option would be to omit the constraints ( which I decidedly not recommend ). dependable constraints can help here ( see my web log post Foreign Key Constraints in an Oracle Data Warehouse ). But if you don ’ thyroxine trust on the generator data and want to prove data quality with constraints, you still need a predefined load order, evening with hash keys .
Another topic of hash keys is the risk of hash collisions. hence it is important to choose a suitable hashish function. The probability of a hash collision with MD5 or SHA-1 algorithm ( see below ) is very small – but theoretically still exists. indeed, what happens if two business keys would result in the same hashish key ? In the best case, the warhead job fails with a basal key misdemeanor. In the worst subject, data of two unlike business entities is mix, e.g. sales orders of one customer are assigned to another customer. Sounds atrocious, but in many Data Warehouses, this is not a real issue. The risk of improper results because of badly beginning data quality or exploitation bugs in the ETL jobs is higher than the opportunity of a hashish collision .
The decisiveness whether to use sequence numbers or hashish keys depends on many aspects. If you have good reasons to use hash keys ( or sequence numbers ), then use them. But don ’ deoxythymidine monophosphate do it equitable because “ it is said to be used ” .

The Wrong Way: ORA_HASH

A well-known hash function in Oracle is the SQL function ORA_HASH. This is a useful serve to distribute data to multiple subsets, but the beget hash keys are far away from singularity. many people are impressed by the maximum number of buckets ( i.e. the count of potential return values ) of this hashish function. The default ( and highest possible value ) of the optional argument MAX_BUCKET is 4294967295. That means, the hash serve returns 32 bits – not very much. sol, hash collisions will occur frequently, even with small data sets ( see Connor McDonald ’ s blog post ). The probability of two stimulation values with the like hash samara is quite high. With 9300 remark rows, the probability is 1 %, with 50 ’ 000 rows already 25 %, and with 77 ’ 000 rows 50 % .
The mathematical background for this behavior is the alleged Birthday Problem : What is the probability that two people have their birthday on the same date of the class, depending on the full number of people ? Did you know, that with entirely 23 people, this probability is already 50 % ? A commodity explanation of the birthday trouble can be found on Wikipedia. If you don ’ thymine like mathematics and prefer more practical exercises, run the follow question on an Oracle database :

WITH input_data AS
(SELECT ROWNUM n, ORA_HASH(ROWNUM) hash
FROM dual CONNECT BY ROWNUM <= 1000000)
SELECT hash, COUNT(*)
FROM input_data
GROUP BY hash HAVING COUNT(*) > 1;

The result is 122 hash collisions for an stimulation of 1 million rows. With 100 ’ 000 rows, there are already 3 collisions. decidedly not a good choice for a Data Warehouse ! so, we need a better hash function .

Cryptographic Hash Functions in Oracle

respective common cryptanalytic hashish algorithms are available that are desirable to generate ( about ) singular hash keys with a very small probability of hashish collisions. The most celebrated ones are MD5 ( message-digest algorithm ) ampere well as SHA-1 and SHA-2 ( dependable hash algorithm ). SHA-2 consists of multiple variants with a different number of output bits. The higher the number of end product bits, the smaller is the probability of a hash collision .
The Oracle database contains unlike implementations of cryptanalytic hash functions, but not all of them are available in every adaptation. here an overview of the different packages and functions .
DBMS_OBFUSCATION_TOOLKIT : This PL/SQL software contains some overladen procedures and functions to generate MD5 hashish keys. The software is deprecated since Oracle 11.2 and should not be used anymore. Because of the clog versions of the MD5 function, it is not possible to use the officiate directly in SQL. rather, a PL/SQL function must be written to be called in SQL .

CREATE OR REPLACE 
FUNCTION hash_key (v_input VARCHAR2)
RETURN dbms_obfuscation_toolkit.varchar2_checksum
AS
BEGIN
RETURN dbms_obfuscation_toolkit.md5(input_string => v_input);
END hash_key;

DBMS_CRYPTO.HASH : The cryptography package of Oracle consists of a HASH function that can be used to generate hash keys. It is overloaded for data types RAW, BLOB and CLOB, but not for VARCHAR2. The return value is always a RAW value. The hash algorithm can be chosen with the second parameter. The predefined constants HASH_MD4, HASH_MD5 and HASH_SH1 were already available in previous Oracle releases. In Oracle 12c, three addtional constants for different lengths of SHA-2 (256, 384 and 512 bits) were added:

HASH_MD4   CONSTANT PLS_INTEGER := 1;
HASH_MD5 CONSTANT PLS_INTEGER := 2;
HASH_SH1 CONSTANT PLS_INTEGER := 3;
HASH_SH256 CONSTANT PLS_INTEGER := 4;
HASH_SH384 CONSTANT PLS_INTEGER := 5;
HASH_SH512 CONSTANT PLS_INTEGER := 6;

unfortunately, the constants can not be used directly in SQL. rather, the match numbers must be added as parameters, and an explicit type conversion to RAW, CLOB or BLOB must be included in the affirmation. so, it is hush easier to define a wrapping function in PL/SQL :

CREATE OR REPLACE 
FUNCTION hash_key (v_input VARCHAR2) RETURN RAW DETERMINISTIC
AS
PRAGMA UDF;
BEGIN
RETURN dbms_crypto.hash(utl_raw.cast_to_raw(v_input), dbms_crypto.HASH_MD5);
END hash_key;

STANDARD_HASH : This new SQL routine was introduced with Oracle 12.1. It supports the lapp hash algorithm as DBMS_CRYPTO, but is easier to use in SQL – and much faster. The first parameter is an input rate of any data type except LONG and LOB. The second ( optional ) argument is a string that defines the hashish algorithm. Default is ‘ SHA1 ’, other possible values are ‘ MD5 ’, ‘ SHA256 ’, ‘ SHA384 ’ and ‘ SHA512 ’. The affair can be used directly in SQL :

WITH input_data AS
(SELECT ROWNUM n, STANDARD_HASH(ROWNUM, 'MD5') hash
FROM dual CONNECT BY ROWNUM <= 1000000)
SELECT hash, COUNT(*)
FROM input_data
GROUP BY hash HAVING COUNT(*) > 1;

The statement generates 1 million hash keys without any hashish collisions. This is besides the case for all functions described above ( except ORA_HASH ). The unlike implementations of the Oracle cryptanalytic hash functions are all desirable to generate unique hashish keys, but the performance is different. The question above was used to run all of the available hashish algorithm of the PL/SQL packages and the SQL function. The following table compares the execution times .
Hash runtime
SQL runtime to generate 1 million has keys
As you can see from the performance times, the SQL serve STANDARD_HASH is a lot faster than DBMS_CRYPTO. so, if you decided to use hashish keys in your Data Warehouse and you work with Oracle 12c, the best option is to use the serve STANDARD_HASH. The nonpayment hash algorithm SHA-1 is the best compromise : 160 bits are enough to avoid hash collisions, but the performance is better than with the SHA-2 algorithm. DBMS_CRYPTO should only be used for backward compatibility, and DBMS_OBFUSCATION_TOOLKIT should not be used anymore at all .

Share this:

Like this:

Like

Loading…

Leave a Reply

Your email address will not be published.