OPENROWSET and BULK INSERT (4)

{Previous Article in Series}
Here are some final notes on this topic. I’m going to describe an import I did recently, and hope this may help you decide where OPENROWSET and BULK INSERT could fit into your development activity and what tools you might use.

I was doing an ad hoc load of some data from a small csv file. There was no corresponding table in the database to store this data, and I don’t like doing work for the sake of it, so my first stop was the “Sql Server 2014 Import and Export Data” wizard. I was hoping it would be a simple matter for the wizard to build a table and read the data into it. It crashed with an error which wasn’t very specific.

At this point I could have started tweaking the settings of the wizard, but I prefer to spend time on the problem rather than on the tool that was supposed to solve the problem. So I went straight on to working with OPENROWSET and creating a format file.

With no database table as a model I had to hand-craft it. This isn’t difficult, provided you know your way past one or two bumps in the road. The non-xml version is just a tab-delimited file, which makes MS Excel a very good editor. You get it the way you want it and just paste the cells into notepad and save it as a *.fmt file. The one I ended up with (after a number of modifications) looked like this:

12.0
40
1 SQLCHAR 0 1 "" 0 "" ""
2 SQLCHAR 0 0 "\",\"" 1 First Latin1_General_CI_AS
3 SQLCHAR 0 0 "\",\"" 2 Middle Latin1_General_CI_AS
4 SQLCHAR 0 0 "\",\"" 3 Last Latin1_General_CI_AS
5 SQLCHAR 0 0 "\",\"" 4 Nickname Latin1_General_CI_AS
6 SQLCHAR 0 0 "\",\"" 5 Email Latin1_General_CI_AS
7 SQLCHAR 0 0 "\",\"" 6 Category Latin1_General_CI_AS
8 SQLCHAR 0 0 "\",\"" 0 DistributionLists ""
...
40 SQLCHAR 0 0 "\"\r\n" 0 Comments ""

You can find detailed references on the format file elsewhere so I’ll just discuss the columns that are relevant here. Column 1 must have consecutive numbers for every field in the source file. Numbers in Column 6 determine the position of each field in the output, and there’s a zero where the field isn’t wanted. Column 7 provides labels for the columns – obviously this is not necessary if the field isn’t going to be returned, but probably useful as a reference. Excel made it easy to populate this column; I pasted the source file headers into a worksheet, then did a Copy/Paste Special into Column 7 with the “Transpose” box checked. Incidentally, the format file should finish with a line break.

I got the delimeters sorted out which initially were “,” for all but the last column, and I was able to retrieve the values in my sql query. I could now see the probable cause of the problem with the import/export wizard: There was a text qualifier on the columns with alphabetical data. The odd thing was that the qualifier was only used where there was a delimiter embedded in the data. I’m pretty sure there’s no way that BULK INSERT can do this kind of row-level processing, and it wouldn’t be very fast if it did.

I decided that the easiest way to deal with this was by writing a Powershell script to add text qualifiers to every row, and in fact to all columns in the file. This solved the problem (you’ll find details of how to build the OPENROWSET query in an earlier article in this series). What I like about Powershell is that it’s just a script; there’s not much temptation to “gold-plate” it and turn it into an application or a library, so I get the ad hoc work done without distraction. If it needs to go into Production, I’ll rewrite it in C# using the script as a crib sheet; otherwise it goes into the toolbox for next time.

The revised delimeters are shown above. I reran the query, and could see another problem. In the first column of my output every value was prefixed by a double quote. This is a problem with any file in which column 1 has text-qualified values. The solution was to add a new field at row 1 of the specifications in the format file. This has a fixed-width of 1 character, and isn’t selected for output. So the double quote drops out.

While we’re looking at Column 4, you’ll see that the other columns are sized as zero. Since the file is tab-delimited, you surely want the BULK driver to use only the delimeters to figure out where the columns start and end. On the other hand, if like me you’re trying to determine the best size for columns in your target table, it might be useful to add some test values to the columns. You’ll get a nice specific error message for any truncated values that not only tells you the row and column numbers, but also the name of the column that’s not wide enough.
As I mentioned in an earlier header, if you have non-character type definitions in your format file, the FIRSTROW setting in the T-Sql OPENROWSET clause won’t prevent errors, but in this case I was using character types for all the columns so my final change was to get rid of the headers with “FIRSTROW = 2”. If you use it, just remember to position this below the BULK specification.

So at all stages of developing this load, by using OPENROWSET I could get pretty good information on why the import was going wrong, and fine-grained control to help me fix it. I was quickly able to specify the destination table, and I have what I need to build a Production solution in SSIS in any of several different ways. So for me, OPENROWSET and BULK INSERT/BCP format files are valuable tools for ETL development.

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out /  Change )

Google+ photo

You are commenting using your Google+ account. Log Out /  Change )

Twitter picture

You are commenting using your Twitter account. Log Out /  Change )

Facebook photo

You are commenting using your Facebook account. Log Out /  Change )

Connecting to %s

This site uses Akismet to reduce spam. Learn how your comment data is processed.