SSIS Dynamic Data Driven Extracts

Extracting data from multiple tables: a dynamic SSIS Data-Driven Approach


This blog entry looks at how you can use SSIS to extract data from multiple tables leveraging a ‘source table’ that provides the table names to extract data from. This solution utilizes a For Each Container, Package Variables, Package Expressions applied to a package variabe along with an OLEDB Source whose SQL Extract Command is stored in a package variable.


In our sample scenario we receive sales data in tables from a 3’d party. This data is delivered on an unpredictable schedule and when received we’d rather not have to modify our SSIS solution. Instead, we rely on inserting a new row into the source table that tells SSIS what it needs to know about the new source table.



Our Source Table / data driven approach
Our dynamic approach involves the use of a source table to drive the SSIS Extract process and designate the tables to extract.


The Source Table SSIS utilizes is dbo.SalesSource and its schema is below:

CREATE TABLE [dbo].[SalesSource] (
[SalesSourceID] [int] IDENTITY(1,1) NOT NULL,
[SourceTableName] [varchar](64) NOT NULL,
[ExtractProcessedFlag] [int] default (0) NULL,
[ExtractProcessingOrder] [int] NULL,
[RowsExtracted] [int] NULL,
[DateTimeExtracted] [datetime] NULL
)



Initially, we have 2 source files to work with so, we start out inserting the following two rows into our dbo.SalesSource table (Notice we really only need to supply values for 3 columns). The SSIS package updates the ExtractProcessedFlag, RowsExtracted and DateTimeExtracted columns upon sucessful completion.








The ExtractProcessingOrder column allows us to specify the order in which we would like to extract data from in our Source Tables. The reason this was required was that our source data may very well contain duplicate data. We specify the order in which we extract and load and have our SSIS packages flag the duplicate rows based on a business key and keep only the most recent data from our duplicates. (In our scenario, new data trumps old data)



Let’s take a look at our SSIS Solution and then dissect each component:
























The SSIS package uses the following Package Variables.











As we progress and dissect each SSIS task, we will look at each package variable it utilizes. The SSIS package uses a For Each container to cycle through each row present in the Source Table with the CurrentTable package variable holding the current table that needs to be extracted. Notice that we supplied an initial value for the CurrentTable. This way we can preview data in this table from the OLE DB Source connection we will see in the Data Flow task.


The package is fairly simple. It contains a highest level Sequence Container along with an inner For each Container. The For Each Container specifies an ADO Enumerator as its ‘Collection Type.’ The result set that feeds this container is below is produced by the Exec SQL Find Tables to Extract SSIS Task. The SQL Extract code is below:

select SourceTableName
from dbo.SalesSource
where ExtractProcessedFlag = 0
order by ExtractProcessingOrder


Notice that we are only extracting source tables where the ExractProcessedFlag is set to 0. The SSIS package sets this flag column to 1 when a source table is extracted. This EXEC SQL Task specifies the ResultSet Package Variable to ‘hold’ the results of the above select statement. (ResultSet is of type object)

The For Each Container Specifies a ‘For each ADO Enumerator’ as the Collection type and points the ResultSet package variable as the result set holder. ( Notice the setting for the Enumerator and the ADO object Source variable choice)





















Also, the CurrentTable package variable is specified in the Variable Mappings tab. This allows our CurrentTable variable to essentially hold the current table being processed each time we cycle through the For Each loop. We'll see this package variable used in the expression that assigns a value to the SQLExtractCommand package variable.





















The Data Flow task specifies that the SQL to extact data is present in a Package Variable. While this choice if probably not the most common Data Access Mode choice, it does provide us with the flexiblity we need. This package variable has its value dynamically set using Expressions. If you choose a package variable and display its properties (F4 will provide the properties) you will notice that by default the Evaluate As Expression property is False. This was changed to True as is seen below:



















If you click on the the ellipse in the Expression value, you will be provided with the expression builder dialog box. The SQLExtactCommand’s contents are created using the quoted string along with the CurrentTable package variable as seen below. (Notice I did specify square brackets to surrount the table name since the table name may include non-alphabetical or other strange characters)




















I found it helpful to click on Evaluate Expression and then paste the results into a Management Studio window. This way you can execute and validate the actual code.


The Data Flow Task that specifies the SQLExtractCommand package variable is below. You will notice that you are able to view columns and even preview data because we supplied an initial value for our CurrentTable package variable.























The final step in our SSIS package is a EXEC SQL statement that updates the source table to designate the SourceTable as being sucessfully processed.



UPDATE
dbo.SalesSource
SET
ExtractProcessedFlag = 1,
DateTimeExtracted = getdate(),
RowsExtracted = ?
WHERE
SourceTableName = ?


Parameters used in this update statement include the number of rows extracted from the source table along with the source table name.


Conclusion
What we left with is a SSIS package that extracts the tables to process from a source table. As new source tables arrive, we insert the rows into our source table without having to modify our SSIS package.

0 comments: