Learned a lot about SQL Server Integration Services (SSIS) and ETL packages in the last few days…
I have 36,934 XML files in several subdirectories that needed to be imported into a SQL database so that I could report on the metadata to build some convincing data on the level of indexing on a client project.
The first issue I faced was that Kodak Capture Pro doesn’t output the data in the best XML format. It formats the data like this:
<?xml version = "1.0" encoding = "utf-8"?> <root> <document> <field level = "document" name = "Client Number" value = "00000314638"/> <field level = "document" name = "Client Short Name" value = "SOME CLIENT"/> <field level = "document" name = "IBS Oper Acct ACBS Cust Num" value = "00004437252"/> <field level = "document" name = "Document Type" value = "LaserPro Disbursement Request Form"/> <field level = "document" name = "Effective Date" value = "02-13-2014"/> </document> </root>
What I needed was for it to be formatted like this:
<?xml version = "1.0" encoding = "utf-8"?> <root> <document> <ClientNumber>00000314638</ClientNumber> <ClientShortName>SOME CLIENT</ClientShortName> <IBSOperAcctACBSCustNum>00004437252</IBSOperAcctACBSCustNum> <DocumentType>LaserPro Disbursement Request Form</DocumentType> <EffectiveDate>02-13-2014</EffectiveDate> </document> </root>
So, thanks to some help on stackoverflow.com I was able to create an SSIS XML Task to run the XML files through XSLT to transform the data. I then stored that transformed XML in a variable. The variable was passed along to my SSIS Data Flow Task which allows me to map the fields to my SQL Database table columns and import the data!
The next issue I faced were rows where the data was too long and would get truncated caused the SSIS package to stop. So, I added error handling that redirected the row data and the error code/description to an error table in my database. This allowed me to go back and correct the issues but more importantly, it allowed the SSIS package to keep running!
The last issue was a few of the XML files got corrupted. Pete got in me the direction of writing some VB.NET code to handle that error. What I ended up doing was writing to a text file the filename of the XML that was invalid and then once again continuing on the SSIS package. When it was all done there were two files (out of 36,934) that were bad. It was a simple missing “ in the element data that caused the rest of the file to be off. Easy fix!
I now have all 36,934 rows imported into my SQL database. Now I can write my SQL Server Reporting Services reports against that table and come up with my metrics!
It’s not fast though. It took 3,596 seconds to import 36,934 rows of data. That breaks down to 59.93 minutes or about 1/10th of a second per row. I’m guessing there are some inefficiencies in my looping through the directories building the list of XML files but at the end of the day (literally now at 11:22pm) it was more about getting the data in than worrying about performance. I’d like to go back and do some research as to how to improve speed but that will be once the fire is put out and the reports are delivered!