Collect data lineage in Data Warehouses by analysis SQL script

This tool is aimed to collect data lineage information in Data Warehouses environment by analysis SQL script especially stored procedure like PL/SQL that used in a ETL process.

The result generated by this tool would be very useful to provide impact analysis feature to ETL mappings in data warehouses and marts.

Once we have such kind of data lineage in hand, we can do these kind of things easily:

  1. Trace Data Lineage: find the path from the source to the target object that we selected.
  2. Data Impact Analysis: find out which data items in target database and/or applications are affected if we change one source object, like a column or a table or another source object.

Table of Contents

  1. Run this tool(online live)
  2. How this tool works
  3. Supported database platform and SQL statement
  4. The relationship between target/source columns
    1. Dataflow
    2. Dataflow depends on table record set
    3. Impact
    4. Further reading

Run this tool(online live)

You may give a try of the online dataflow live demo.

How this tool works

This tool is based on General SQL Parser and works in the following steps:

  1. Analyze SQL script, build relationship among the source/taget table columns.
  2. Generate XML output describe the data flow from the source to target table/column.
  3. Visualize the XML result, make it easy to trace the data lineage and do data impact analysis.

Supported database platform and SQL statement

Support those major database platforms: db2, greenplum, hana, hive, impala, informix, mysql, netezza, oracle, postgresql, redshift, sqlserver, sybase, teradata

  • SQL statements: DML(Insert,Delete,Update,Merge), DDL(create, alter)
  • Stored procudure: Oracle PL/SQL, SQL Server T-SQL, DB2, PostgreSQL, Sybase.
  • Teradata FastExport, MultiLoad script, BTEQ command.

The relationship between target/source columns

There are three types of relationship between target and source columns in the generated result.

Dataflow

The most common case is that value of the target column depends on the source column

create view vEmp(eName) as
SELECT a.empName "eName"
FROM scott.emp a
Where sal > 1000

vEmp.eName depends on scott.emp.empName.

Dataflow depends on table record set

The value of the target columns are influenced by a source table itself, for example by the number of records. This is caused by the use of aggregate function in the query.

create view vSal as
SELECT a.deptno "Department", 
    a.num_emp/b.total_count "Employees", 
    a.sal_sum/b.total_sal "Salary"
FROM
    (SELECT deptno, COUNT() num_emp, SUM(SAL) sal_sum
        FROM scott.emp
       Where city = 'NYC'
       GROUP BY deptno) a,
    (SELECT COUNT() total_count, SUM(sal) total_sal
        FROM scott.emp
       Where city = 'NYC') b

vSal.Salary depends on the record number of table: scott.emp. This is due to the use of aggregate function COUNT(), SUM() in the query, vSal.Salary depends on the scott.emp.deptno column which is used in the group by clause as well.

The city column in the where clause also determine the value of vSal.Salary. This is different from the impact which is discussed below that doesn’t influence the value of target columns.

Impact

Some of the columns in source tables such as WHERE clause do not influence the value of target columns but are crucial for the selected row set, so they are also saved for impact analyses, with relationship to the target columns.

create view vEmp(eName) as
SELECT a.empName "eName"
FROM scott.emp a
Where sal > 1000

The value of vEmp.eName doesn’t depends on scott.emp.sal, but the number of records in the vEmp depends on the scott.emp.sal, so this tool record this kind of relationship as well.

Another type of impact is that the value of target column is depends on the value of the source column

select
	case when a.kamut=1 and b.teur IS null
			 then 'no locks'
		 when a.kamut=1
			then b.teur
	else 'locks'
	end teur
from tbl a left join TT b on (a.key=b.key)

The value of select result: teur depends on the source column tbl.kamut in case expression, although it’s value is not derived from tbl.kamut directly.

Further reading

Please check this document for the detailed explanation of the relationship used in the dataflow.