Collect data lineage in Data Warehouses by analysis SQL script

This tool is aimed to collect data lineage information in Data Warehouses environment by analysis SQL script especially stored procedure like PL/SQL that used in a ETL process.

The result generated by this tool would be very useful to provide impact analysis feature to ETL mappings in data warehouses and marts.

Once we have such kind of data lineage in hand, we can do these kind of things easily:

  1. Trace Data Lineage: find the path from the source to the target object that we selected.
  2. Data Impact Analysis: find out which data items in target database and/or applications are affected if we change one source object, like a column or a table or another source object.

Table of Contents

  1. Run this tool
  2. How this tool works
  3. Supported database platform and SQL statement
  4. The online live demo
  5. The relationship between target/source columns
    1. Dataflow
    2. Dataflow depends on table record set
    3. Impact
    4. Further reading

Run this tool

In order to run this tool, JRE 1.5 or higher need to be installed. In command line, run java -Xms256m -Xmx1024m -jar DlineageDemo.jar

  1. Select the database vendor from menu -> Vendor
  2. Pickup the SQL script from menu -> File -> Load SQL Files, or just paste SQL query into the SQL editor.
  3. Click “Analyze” button in the tool bar.
  4. Result will be shown in the diagram window and data flow xml file will be generated under the output directory in the same directory of DlineageDemo.jar.

How this tool works

This tool is based on General SQL Parser and works in the following steps:

  1. Analyze SQL script, build relationship among the source/taget table columns.
  2. Generate XML output describe the data flow from the source to target table/column.
  3. Visualize the XML result, make it easy to trace the data lineage and do data impact analysis.

Supported database platform and SQL statement

Support those major database platforms: db2, greenplum, hana, hive, impala, informix, mysql, netezza, oracle, postgresql, redshift, sqlserver, sybase, teradata

  • SQL statements: DML(Insert,Delete,Update,Merge), DDL(create, alter)
  • Stored procudure: Oracle PL/SQL, SQL Server T-SQL, DB2, PostgreSQL, Sybase.
  • Teradata FastExport, MultiLoad script, BTEQ command.

The online live demo

You may give a try of the online dataflow live demo.

The relationship between target/source columns

There are three types of relationship between target and source columns in the generated result.


The most common case is that value of the target column depends on the source column

create view vEmp(eName) as
SELECT a.empName "eName"
FROM scott.emp a
Where sal > 1000

vEmp.eName depends on scott.emp.empName.

Dataflow depends on table record set

The value of the target columns are influenced by a source table itself, for example by the number of records. This is caused by the use of aggregate function in the query.

create view vSal as
SELECT a.deptno "Department", 
    a.num_emp/b.total_count "Employees", 
    a.sal_sum/b.total_sal "Salary"
    (SELECT deptno, COUNT() num_emp, SUM(SAL) sal_sum
        FROM scott.emp
       Where city = 'NYC'
       GROUP BY deptno) a,
    (SELECT COUNT() total_count, SUM(sal) total_sal
        FROM scott.emp
       Where city = 'NYC') b

vSal.Salary depends on the record number of table: scott.emp. This is due to the use of aggregate function COUNT(), SUM() in the query, vSal.Salary depends on the scott.emp.deptno column which is used in the group by clause as well.

The city column in the where clause also determine the value of vSal.Salary. This is different from the impact which is discussed below that doesn’t influence the value of target columns.


Some of the columns in source tables such as WHERE clause do not influence the value of target columns but are crucial for the selected row set, so they are also saved for impact analyses, with relationship to the target columns.

create view vEmp(eName) as
SELECT a.empName "eName"
FROM scott.emp a
Where sal > 1000

The value of vEmp.eName doesn’t depends on scott.emp.sal, but the number of records in the vEmp depends on the scott.emp.sal, so this tool record this kind of relationship as well.

Another type of impact is that the value of target column is depends on the value of the source column

	case when a.kamut=1 and b.teur IS null
			 then 'no locks'
		 when a.kamut=1
			then b.teur
	else 'locks'
	end teur
from tbl a left join TT b on (a.key=b.key)

The value of select result: teur depends on the source column tbl.kamut in case expression, although it’s value is not derived from tbl.kamut directly.

Further reading

Please check this document for the detailed explanation of the relationship used in the dataflow.