Collect data lineage in Data Warehouses by analysis SQL script
This tool is aimed to collect data lineage information in Data Warehouses environment by analysis SQL script especially stored procedure like PL/SQL that used in a ETL process.
The result generated by this tool would be very useful to provide impact analysis feature to ETL mappings in data warehouses and marts.
Once we have such kind of data lineage in hand, we can do these kind of things easily:
- Trace Data Lineage: find the path from the source to the target object that we selected.
- Data Impact Analysis: find out which data items in target database and/or applications are affected if we change one source object, like a column or a table or another source object.
Table of Contents
Run this tool(online live)
You may give a try of the online dataflow live demo.
How this tool works
This tool is based on General SQL Parser and works in the following steps:
- Analyze SQL script, build relationship among the source/taget table columns.
- Generate XML output describe the data flow from the source to target table/column.
- Visualize the XML result, make it easy to trace the data lineage and do data impact analysis.
Supported database platform and SQL statement
Support those major database platforms: db2, greenplum, hana, hive, impala, informix, mysql, netezza, oracle, postgresql, redshift, sqlserver, sybase, teradata
- SQL statements: DML(Insert,Delete,Update,Merge), DDL(create, alter)
- Stored procudure: Oracle PL/SQL, SQL Server T-SQL, DB2, PostgreSQL, Sybase.
- Teradata FastExport, MultiLoad script, BTEQ command.
The relationship between target/source columns
There are three types of relationship between target and source columns in the generated result.
Dataflow
The most common case is that value of the target column depends on the source column
create view vEmp(eName) as
SELECT a.empName "eName"
FROM scott.emp a
Where sal > 1000
vEmp.eName depends on scott.emp.empName.
Dataflow depends on table record set
The value of the target columns are influenced by a source table itself, for example by the number of records. This is caused by the use of aggregate function in the query.
create view vSal as
SELECT a.deptno "Department",
a.num_emp/b.total_count "Employees",
a.sal_sum/b.total_sal "Salary"
FROM
(SELECT deptno, COUNT() num_emp, SUM(SAL) sal_sum
FROM scott.emp
Where city = 'NYC'
GROUP BY deptno) a,
(SELECT COUNT() total_count, SUM(sal) total_sal
FROM scott.emp
Where city = 'NYC') b
vSal.Salary
depends on the record number of table: scott.emp
. This is due to the use of aggregate function COUNT()
, SUM()
in the query, vSal.Salary
depends on the scott.emp.deptno column
which is used in the group by clause as well.
The city
column in the where clause also determine the value of vSal.Salary
. This is different from the impact which is discussed below that doesn’t influence the value of target columns.
Impact
Some of the columns in source tables such as WHERE clause do not influence the value of target columns but are crucial for the selected row set, so they are also saved for impact analyses, with relationship to the target columns.
create view vEmp(eName) as
SELECT a.empName "eName"
FROM scott.emp a
Where sal > 1000
The value of vEmp.eName
doesn’t depends on scott.emp.sal
, but the number of records in the vEmp
depends on the scott.emp.sal
, so this tool record this kind of relationship as well.
Another type of impact is that the value of target column is depends on the value of the source column
select
case when a.kamut=1 and b.teur IS null
then 'no locks'
when a.kamut=1
then b.teur
else 'locks'
end teur
from tbl a left join TT b on (a.key=b.key)
The value of select result: teur
depends on the source column tbl.kamut
in case expression, although it’s value is not derived from tbl.kamut
directly.
Further reading
Please check this document for the detailed explanation of the relationship used in the dataflow.
Extracts the table/column from SQL script
This tool will fetch table, column and column index inside the SQL script including PL/SQL. You can find this tool in demo\gettablecolumns shipped together with general SQL parser library Java and .NET version.
SQL script rewrite using General SQL Parser: expression
This page is under construction.
SQL script rewrite using General SQL Parser
After successfully parse a SQL script, You can modify the Parse Tree Nodes (PTN) of the input SQL script which is created by General SQL Parser(GSP) to generate the new SQL script as needed.
How to process commnets used in the SQL scripts
Parsing of comments and accessing of the block and inline comments of a SQL script