How do I split flat file data and load into parent-child tables in database?
I have denormalized data (coming from a file) that needs to be imported into parent-child tables. The source data is something like this:
Account# Name Membership Email 101 J Burns Gold firstname.lastname@example.org 101 J Burns Gold email@example.com 101 J Burns Gold firstname.lastname@example.org 227 H Gordon Silver email@example.com 350 B Clyde Silver firstname.lastname@example.org 350 B Clyde Silver email@example.com
What are the pieces, parts, or tactics of SSIS I should use to read the first three columns into a parent table, and the 4th column (Email) into a child table? I have several options for the parent key which I am permitted to take:
- Directly use the Account# as the primary key
- Use a surrogate key generated by SSIS during the import process
- Configure an identity primary key
I'm sure I've listed my primary key options in increasing order of difficulty. I'd be interested in knowing how to do the first and the last option - I'll infer how to achieve the middle option. To emphasize again, I'm interested in a decidedly SSIS solution; I'm looking for an answer that uses the language of SSIS, rather than a procedural, technology neutral answer.
My question is somewhat similar to another SO question, having an answer of vague viability. I'm hoping more detailed guidance could be given. I already know how to solve this problem by creating a "staging" middle-step, where the parent-child separation is actually handled with straight SQL. However, I'm curious about how this can be done without that kind of middle-step.
It seems to me this kind of import would be so common, that there would be a well-published formulaic way to handle it - a technique that SSIS excels at. As yet, I've not quite seen any straight up answer to this.
Update #1: Based on comments, I've adjusted the sample data to be more obviously denormalized. I also removed "flat" from "flat file," so that semantics don't interfere with the question.
Update #2: I've amplified my interest in a solution spoken in the language of SSIS.
Here is one possible option that you can consider in loading parent-child data. This option consists of two steps. In the first step, read the source file and write data to parent table. In the second step, read the source file again and use lookup transformation to fetch the parent info in order to write data to the child table. Following example uses the data provided in the question. This example was created using SSIS 2008 R2 and SQL Server 2008 database.
Create a sample flat file named Source.txt as shown in screenshot #1.
In the SQL database, create two tables named dbo.Parent and dbo.Child using the scripts given under SQL Scripts section. Both the tables have an auto generated identity column.
On the package, place an OLE DB connection to connect to the SQL Server and Flat File connection to read the source file as shown in screenshot #2. Configure the flat file connection as shown in screenshots #3 - #9.
On the Control Flow tab, place two Data Flow Tasks as shown in screenshot #10.
Inside the data flow task named Parent, place a Flat File source, Sort transformation and an OLE DB destination as shown in screenshot #11.
Configure the flat file source as shown in screenshots #12 and #13. We need to read the flat file source.
Configure the sort transformation as shown in screenshot #14. We need to eliminate the duplicate values so that only the unique records are inserted into the parent table dbo.Parent.
Configure the ole db destination as shown in screenshots #15 and #16. We need to insert the data into the parent table dbo.Parent.
Inside the data flow task named Child, place a Flat File source, Lookup transformation and an OLE DB destination as shown in screenshot #17.
Configure the flat file source as shown in screenshots #12 and #13. This configuration is same as the flat file source in the previous data flow task.
Configure the lookup transformation as shown in screenshots #18 and #20. We need to find the parent id from the table dbo.Parent using the other key columns present in the file. The key columns here are the Account, Name and Email. If the file happened to have a unique column, you could just use that column alone to fetch the parent id.
Configure the ole db destination as shown in screenshots #21 and #22. We need to insert the Email column along with the Parent id into the table dbo.Child.
Screenshot #23 shows data in the tables before the package execution.
Screenshots #24 and #25 show sample package execution.
Screenshot #26 shows data in the tables after the package execution.
Hope that helps.
CREATE TABLE [dbo].[Child]( [ChildId] [int] IDENTITY(1,1) NOT NULL, [ParentId] [int] NULL, [Email] [varchar](21) NULL, CONSTRAINT [PK_Child] PRIMARY KEY CLUSTERED ([ChildId] ASC)) ON [PRIMARY] GO CREATE TABLE [dbo].[Parent]( [ParentId] [int] IDENTITY(1,1) NOT NULL, [Account] [varchar](12) NULL, [Name] [varchar](12) NULL, [Membership] [varchar](14) NULL, CONSTRAINT [PK_Parent] PRIMARY KEY CLUSTERED ([ParentId] ASC)) ON [PRIMARY] GO
If the data is sorted and Account# is an integer I would:
Insert the emails into a table (add an auto increment column, it's a best practise).
1 101 firstname.lastname@example.org 2 101 email@example.com 3 101 firstname.lastname@example.org etc.
Then I would insert the other records to a parent table.
- using Account# as the primary key
- omitting the email addresses
- skipping duplicates (easy if the data is sorted).
If you have a foreign key relationship setup, you will need to do the second step first (to avoid having any orphan records).
My two cents: I don't know what your requirements are but it seems a bit over-normalized. If there is a small limit on the number of email addresses, I would consider adding several email columns to the main table...for speed and simplicity.