Tutorials

Collect data lineage in Data Warehouses by analysis SQL script

This tool is aimed to collect data lineage information in Data Warehouses environment by analysis SQL script especially stored procedure like PL/SQL that used in a ETL process.

The result generated by this tool would be very useful to provide impact analysis feature to ETL mappings in data warehouses and marts.

Once we have such kind of data lineage in hand, we can do these kind of things easily:

Trace Data Lineage: find the path from the source to the target object that we selected.
Data Impact Analysis: find out which data items in target database and/or applications are affected if we change one source object, like a column or a table or another source object.

create view vSal as
SELECT a.deptno "Department", 
    a.num_emp/b.total_count "Employees", 
    a.sal_sum/b.total_sal "Salary"
FROM
    (SELECT deptno, COUNT() num_emp, SUM(SAL) sal_sum
        FROM scott.emp
       Where city = 'NYC'
       GROUP BY deptno) a,
    (SELECT COUNT() total_count, SUM(sal) total_sal
        FROM scott.emp
       Where city = 'NYC') b

vSal.Salary depends on the record number of table: scott.emp. This is due to the use of aggregate function COUNT(), SUM() in the query, vSal.Salary depends on the scott.emp.deptno column which is used in the group by clause as well.

The city column in the where clause also determine the value of vSal.Salary. This is different from the impact which is discussed below that doesn’t influence the value of target columns.

Impact

Some of the columns in source tables such as WHERE clause do not influence the value of target columns but are crucial for the selected row set, so they are also saved for impact analyses, with relationship to the target columns.

create view vEmp(eName) as
SELECT a.empName "eName"
FROM scott.emp a
Where sal > 1000

The value of vEmp.eName doesn’t depends on scott.emp.sal, but the number of records in the vEmp depends on the scott.emp.sal, so this tool record this kind of relationship as well.

Another type of impact is that the value of target column is depends on the value of the source column

select
	case when a.kamut=1 and b.teur IS null
			 then 'no locks'
		 when a.kamut=1
			then b.teur
	else 'locks'
	end teur
from tbl a left join TT b on (a.key=b.key)

The value of select result: teur depends on the source column tbl.kamut in case expression, although it’s value is not derived from tbl.kamut directly.

General SQL Parser

Tutorials

Collect data lineage in Data Warehouses by analysis SQL script

Table of Contents

Run this tool(online live)

How this tool works

Supported database platform and SQL statement

The relationship between target/source columns

Dataflow

Dataflow depends on table record set

Impact

Further reading

Extracts the table/column from SQL script

SQL script rewrite using General SQL Parser: expression

SQL script rewrite using General SQL Parser

How to process commnets used in the SQL scripts

More demos for General SQL Parser