2012年3月26日星期一

Flat File with Nested Data

I am looking to import data into SQL Server 2005 using SSIS. I want to take data that is contained in a flat file and place it into the various appropriate tables in my system. The flat file contains nested data. For example...

Bob,Smith,555-5555,123~3.33|245~1.99,Active

So I want to build a package that brings in the records as follows

Client Table: First Name, Last Name, Phone, and Status (Bob, Smith, 555-5555, Active)

Order Table: OrderID, Amount (ID 123 @. $3.33 and another row ID 245 @. $1.99). If possible I would also like to tie the orders to the client record that was inserted.

My first question is if SSIS supports nested fields as in my example. Can it break a file by commas, then within a field by other delimiters? If so how do I do this, and if not what is the recommend way to accomplish this sort of task.

My second quesiton is if it can do that, can it tie the Client and Order data on the fly?

Thanks.

Looks like you could get what you want using multicasts, conditional splits, merges/unions, etc...

That's terribly messy source data, and if you can normalize it before bringing it into SSIS, you should do so.|||

The main problem I see you having with correllating a client with an order is not having an ID available. At least you won't have a Client ID until after you insert the client. This limits your ability to insert both within the same Data Flow Task.

One way to solve this is to generate an ID for the client before inserting.

If you had an ID you could then just use the Multicast Shape to send each row to multiple destinations. One destination would be the Client table, and the other would be the Order table. Your Data Flow would then look something like this.

Flat File Source --> Derived Column to add ID --> Multicast --> Client table, Order table

|||

I do not have control of the data feed. It is coming from a provider and therefore is out of my control. I also simplified the data greatly as there are multiple nested fields and nested within nested. So my initial dilema is just how to parse the data. Presently I have looked at two options, but both seem to be non-optimal and trying to figure out a better way to do this.

1. I did this with BizTalk. The problem is BizTalk is horrible on the performance side. I was able to convert the nested files into an XML structure and then send an XML message to a web service using code to import the data. This is just not a good solution.

2. I can import the data once. Then re-run and loop over the columns that can be nested untill all that data is pulled out. This works at a decent speed, but complex, subject to issues with file changes, prone to errors, etc.

So I have not found a way to do this easier... I will look at multicasts, conditional splits, etc but not seeing how i could use them to get the initial file split properly. If you have specific informaiton, links, examples, etc of how i can handle a multi nested flat file please let me know.

Thanks again.

|||

This is easy enough... Bear with me as I try to explain for you:
Bring in your source using the Flat File Source. Delimit on ",". The field names I used were FirstName, LastName, Phone, OrderData, Active. All are strings.
Next, I chose to create a composite key because I'm assuming that first name, last name, and phone number ensures a unique record in your dataset. Creating our own counter on each load may not guarantee uniqueness when inserting into your destination tables. So, throw a derived column transformation onto the data flow. Connect it to your source and create a new column, Key. Its expression is "FirstName+LastName+Phone". I left its datatype as unicode string, although you can cast it back to string if you'd like to fit your destination.
Then, add a multicast to your data flow. Connect it to the derived column transformation.
Next, add a destination for your client table. Hook it into the multicast. Take the FirstName, LastName, Phone, Active, and Key fields.
Add a script component to another output of the multicast transform. Select two fields as input columns, OrderData and Key. Select "Inputs and Outputs". Click on Output 0 and change the property, SynchronousInputID to 0. Next, add three output columns, "order", "key", and "value". Set order to string. Set key to which ever datatype you chose in the derived column above. Set value to numeric (perferrably) and set the precision/scale appropriately. Next up is the script:


Imports System
Imports System.Data
Imports System.Math
Imports Microsoft.SqlServer.Dts.Pipeline.Wrapper
Imports Microsoft.SqlServer.Dts.Runtime.Wrapper
Public Class ScriptMain
Inherits UserComponent
Public Overrides Sub Input0_ProcessInputRow(ByVal Row As Input0Buffer)
Dim arrayOrder As Array
Dim arrayOrderValues As Array
arrayOrder = Row.OrderData.ToString.Split("|"c)
For Each order As String In arrayOrder
arrayOrderValues = order.Split("~"c)
Output0Buffer.AddRow()
' I'm assuming that there is two and only two values for each order - ordernum, amount
' I perform no error checking of the data
Output0Buffer.order = arrayOrderValues.GetValue(0).ToString()
Output0Buffer.value = Decimal.Parse(arrayOrderValues.GetValue(1).ToString())
' Attach the key from the source row to each of the new rows we create from the nested data
Output0Buffer.key = Row.Key
Next
End Sub
End Class

Next, take the script, and hook it up to your second destination, which is for the order table. Take all three fields.
Done.
--Phil

|||I get where you are going with this and it may be my best solution. I think you answered my one thought, which is that with the given components available in SSIS it is not doable, it must be done with .NET or other code in the script component. I will try your sample code and work with it a little and see how it works. As i stated there is nested within nested, so I assume I can then repeat what you did and multicast a nested field to another .NET script that further processes the record?|||My code splits everything out...

An example source file I used:
Bob,Smith,555-5555,123~3.33|245~1.99,Active
Phil,Brammer,535-3333,347~2.14|671~5.14,Inactive
Test,Name,321-3211,347~2.14|671~5.14|127~1.26|876~4.20,Active

The output from what I documented above:
CLIENT
FirstName LastName Phone Active Key
Bob Smith 555-5555 Active BobSmith555-5555
Phil Brammer 535-3333 Inactive PhilBrammer535-3333
Test Name 321-3211 Active TestName321-3211

ORDER
order key value
123 BobSmith555-5555 3.330
245 BobSmith555-5555 1.990
347 PhilBrammer535-3333 2.140
671 PhilBrammer535-3333 5.140
347 TestName321-3211 2.140
671 TestName321-3211 5.140
127 TestName321-3211 1.260
876 TestName321-3211 4.200|||

I think this will be a solution... atleast for now. I am working with your sample code and testing some throughput... and working with the real data to see if I see problem, but it looks promising. Thank you for your help and suggestions.

One quick question is you mention to set "SynchronousInputID to 0." What is this and why? I am sure I could look it up and find info, but thought it might be easier coming straight from you. Thanks again.

|||Allows you to add rows to the data flow. This sets it to asynchronous.|||This all is working well. Thank you for your help... I think that this will server well to break the flat file feed and get it into the database. The last issue that I think I will face is I will not be using a composite key as you do to tie the records. The table is using an Identity column, so records that are inserted will get a new unique value. Is there an easy way to get the id out and tie it to the other inserts?|||Well, I hate identity columns, so...

You'll have to do a lookup task to lookup a record in the dataflow with it's associated record in the table. HOWEVER, you'll likely have to split the single dataflow I provided into two separate dataflows first.sql

没有评论:

发表评论