Dougbert.com
Microsoft SQL Server Integration Services stuff by Douglas Laudenschlager.

Excel in Integration Services, Part 2 of 3: Tables and Data Types

This post is the 2nd in a series of 3 planned installments about using Excel data files with Microsoft SQL Server Integration Services.

  1. Connections and components
  2. Tables and data types
  3. Issues and alternatives

Tables

What's a table in Excel?

The source of data in an Excel workbook can be:

  • A worksheet (for example, Sheet1$), which you can recognize in a list because it has the $ appended. In a SQL statement, the name of a worksheet must be delimited (for example, [Sheet1$]) to avoid a syntax error caused by the $ sign. If you use the Query Builder, it automatically adds these delimiters.
  • A named range (for example, MyRange), which you can recognize in a list because it does not have the $ appended.
  • An unnamed range specified by using cell coordinates (for example, Sheet1$A1:C100), which you normally have to specify in a SELECT statement.

Both worksheets and named ranges appear in lists of "tables" in an Excel data source.

Working with Excel tables

When you use the Excel driver to CREATE the table, as you might do for an Excel destination in the SQL Server Import and Export Wizard, the driver creates both a worksheet (MyOutput$) and a named range (MyOutput) with the name that you supply.

Whether you specify a worksheet or a range, the driver reads the contiguous block of cells starting with the first non-empty cell in the upper-left corner of the worksheet or range. Therefore you cannot have empty rows in the source data, or an empty row between the header row and the data rows. Similarly, if you use cell A1 for a worksheet title, you may have to assign a name to your range of data farther down, and use the named range instead of the worksheet name.

When you specify a range, you get an error if the range is not wide enough; that is, if it has fewer columns than you need. However, if the range that you've defined isn't long enough - that is, if it has fewer rows - the driver is intelligent enough not just to continue writing rows, but to extend the range definition as well.

Deleting and reusing

If you try to delete Excel data or tables through the Excel driver (for example, in an Execute SQL task), you'll probably get results that you don't expect. In my most recent testing:

  • A DELETE statement fails with an error.
  • A DROP TABLE statement, with either a named range or a worksheet, "succeeds" and, ironically, deletes the data, but does not delete the range or the worksheet.

It's not easy to re-use a worksheet that you used previously as a destination. Clearing the cell values or rows in Excel that were loaded by the driver does not make those rows "reusable"; the driver still treats the rows as if they contained data. Deleting the physical rows in Excel solves this problem, but if you're using a named range, of course that also shrinks or deletes your range.

If you want to "re-use" a worksheet in a particular format - for example, for a daily export to Excel - the cleanest solution is to create a blank workbook to save as a template, and to use a File System task to copy that workbook into the working directory when needed. Leave the template intact.

Data types

The Excel driver recognizes only a limited set of data types. For example, all numeric columns are interpreted as doubles (DT_R8), and all string columns (other than memo columns) are interpreted as 255-character Unicode strings (DT_WSTR).

Integration Services maps the Excel data types as follows:

  • Numeric – double-precision float (DT_R8)
  • Currency – currency (DT_CY)
  • Boolean – Boolean (DT_BOOL)
  • Date/time – datetime (DT_DATE)
  • String – Unicode string, length 255 (DT_WSTR)
  • Memo – Unicode text stream (DT_NTEXT)

"Hey, columns don't have data types in Excel!" No, and this is the root of the most common issues when using the Excel driver. So the driver assigns data types by sampling several rows of data (typically 8, by default) and picking the data type the represents the majority of the values in each column. In other words, it guesses. If Column B contains 5 strings and 3 numbers, then it's a string column. In a tie (for example, 4 strings and 4 numbers), numeric wins...this is a spreadsheet, after all! (Yes, I bothered to test this.) You can increase the number of rows sampled by increasing the value of TypeGuessRows under the HKEY_LOCAL_MACHINE\SOFTWARE\Microsoft\Jet\4.0\Engines\Excel registry key.

"I wish I knew where to find more information about the data types." Well, if you have Microsoft Office, it's hidden on your hard drive in the file JETSQL40.CHM. I have a copy of this Help file in both C:\Program Files\Common Files\microsoft shared\OFFICE12\1033 (from Office 2007) and C:\Program Files\Common Files\microsoft shared\OFFICE11\1033 (from Office 2003).

Data type and length conversions

Integration Services does not implicitly convert data types. As a result, you may need to use Data Conversion or Derived Column transformations to convert Excel data explicitly before loading it into a non-Excel destination, or to convert non-Excel data before loading it into an Excel destination. In this case, it may be useful to create the initial package by using the Import and Export Wizard, which warns you about data type differences (in SQL Server 2008) and configures the necessary conversions for you. Some examples of the conversions that may be required include the following:

  • Conversion between Unicode Excel string columns and non-Unicode string columns with specific codepages
  • Conversion between 255-character Excel string columns and string columns of different lengths
  • Conversion between double-precision Excel numeric columns and numeric columns of other types

See also

Although these Books Online topics were also listed in our previous installment on Connections and components, they both discuss how to work with Excel tables, as well as Excel files.

In our next installment...

In Issues and alternatives, we'll take a look at the most common issues when importing from, or exporting to, Excel.

- Doug


Posted Jun 18 2008, 11:24 by dougbert
Filed under:

Comments

Peter wrote re: Excel in Integration Services, Part 2 of 3: Tables and Data Types
on 06-18-2008 9:49 PM

So is there a max value that can be configured for "TypeGuessRows"?  I've tried setting that to some obscenely high values and still end up with char values NULLed out.  Seems like that might be due to the "most values of this type win" behaviour, though.  That seems odd to me - shouldn't the JET driver figure that if you have a mixed column, it moves up to the largest type able to handle it - whether that's nvarchar(255), ntext, or just float?

Short of the IMEX trick, can I just force things to be text or varchar?  Can I default the IMEX setting so it works with a Wizard and I don't have to manually build packages just to handle that piece?  It would be really nice to override whatever JET picks up and force data types on the columns, but I realize that's a JET thing and not an SSIS thing (unfortunately).

Enjoying the series so far - looking forward to Part 3 now.  :-)

(Would love a way to subscribe to the comments somehow.)

dougbert wrote re: Excel in Integration Services, Part 2 of 3: Tables and Data Types
on 06-19-2008 7:43

Peter, Thank you for your ongoing interest.

First, I have enabled "RSS for Comments" in Community Server, but I notice that I only see the option to subscribe when I am logged in to the site. I'll have to explore that CS issue.

You can tell TypeGuessRows to sample ALL rows - I think it's zero but would want to check - but that won't change the behavior of IMEX. I promise to finish Part 3 as soon as possible, but please don't anticipate miracles - some of the Excel driver issues simply don't have a workaround that satisfies everyone, and if they did, they'd be well-known after all these years with little change in the driver.

-Doug

Katrina wrote re: Excel in Integration Services, Part 2 of 3: Tables and Data Types
on 06-19-2008 2:32 PM

Hi Doug,

Your blog topics on Excel have been very helpful.  I'm hoping you might have some insight into this issue which is has me very puzzled.

My project needs to be able to read in multiple excel files from a single directory and process them.  The column mapping needs to be dynamic, but I've put together a solution for that.

The issue I can't seem to get around is that dates don't seem to be handled consistently from file to file under the Excel connection manager.  Sometimes they come back as formatted dates and sometimes they come back as integers.  In either case the output column in my excel source has the WSTR data type.

Because the excel files have column headers I'm using "IMEX=1" in my connection string.  This seems to work just dandy for all columns (currency, numeric, alphanumeric) except for the date columns.  

Sometimes I can preview them in my Excel data flow source and they are integers, and sometimes they are valid dates (mm/dd/yyyy).  My insert function dies if the dates are in integer format (not surprising), and I can't perform a data conversion on the column because the behavior varies from file to file.

Perhaps if I understood why this was happening I could figure out my own work around, but this behavior does have me stumped.

The connection string I'm using is -->  

"Provider=Microsoft.Jet.OLEDB.4.0;Data Source=" +  @[User::ExcelFilename] + ";Extended Properties=\"Excel 8.0;HDR=NO;IMEX=1\""

Any insight you could pass along would be very much appreciated.

Katrina

dougbert wrote re: Excel in Integration Services, Part 2 of 3: Tables and Data Types
on 06-21-2008 7:57

Katrina,

Before we talk about the problem with dates, I'm curious about your statement that you're using IMEX=1 because your Excel data has column headers. Normally we would tell the driver this fact by using HDR=Yes; however I see HDR=No in your connection string.

In all the years that I've tried to follow issues with the Excel driver, I've almost never heard about problems with date columns, so I don't have a canned answer. However after a quick test, it looks like IMEX might be contributing to the problem here.

I created an .XLS and entered dates in one of the columns. Then I went through the first steps of the Import and Export Wizard to examine the Edit Mappings dialog box and the autogenerated SQL statement for creating a destination table to hold the data. Whether I used IMEX or not, the dates LOOKED correct in the Preview window. However: When I did NOT use IMEX, my column of dates was recognized as a datetime column; when I DID use IMEX, the driver seemed unable to identify the data type of the column. In fact, data type was blank.

Can you try again without IMEX?

-Doug

Katrina wrote re: Excel in Integration Services, Part 2 of 3: Tables and Data Types
on 06-23-2008 2:52

Hi Doug,

The ultimate goal of this is an SSIS task which can read in Excel files from different clients.  The files contain invoice information which we need to automatically load into our system for further processing.

Unfortunately, I'm not guaranteed any type of consistent formatting in the Excel files I receive.  This includes the header rows which can be either 1,2  or 3 rows (so far).

I keep client profile information on the number of header rows and column positions of the data I want to extract.  I read in the profile information and create a dynamic select statement where I alias the standard rows from the spreadsheet (F1, F2, ... FN) to the appropriate columns in my data flow.  I also use the profile information to strip off the header rows (that's why HDR=NO).  

I'm using IMEX = 1 because I read all the issues associated with with Excel's typing of data, and how unexpected nulls could be inserted into the columns if the data didn't match what was originally guessed at.  I often have data that appears to be numeric (invoice numbers can be numeric for some clients alphanums for others), so I wanted to try to force Excel to always put that into character fields.  Handling the metadata for SSIS is very problematic if it's not consistent, that's the exact problem I'm running into with the date field.  But it could crop up in the invoice field as well (read an excel file in and have it interpret the invoice as numeric for 1 client and text for another, I haven't found a way in SSIS to correctly handle this).

I will try this morning  when I get in to remove "IMEX = 1" from the string, retest my various formats and see if any issues come up.

Oh by the way ... the date problem is very difficult to reproduce.  It usually goes away if I open the file and save it on the test machine, then rerun it.

Do you know if there is an Excel or OleDB forum where I could post this question?  I realize it is rather obscure.

Thx,

Kathy

dougbert wrote re: Excel in Integration Services, Part 2 of 3: Tables and Data Types
on 06-23-2008 7:23

Kathy,

I see that you are not a novice to the challenges of working with the Excel driver.

One challenge is that it has never been clear what factors (other than the raw values) contribute to the driver's decision to classify a column as one data type rather than another. In some cases, Excel cell formatting seems to make a different; at other times, not. In the case of dates, it seems to matter, since formatting is all that distinguishes the underlying numeric value from any other floating point number.

When my searches lead me to Excel forums where users have posted questions about the Excel driver, I notice that the question is often misunderstood and rarely answered correctly - experts in working IN Excel don't necessarily know about driver issues. But they might be able to shed some light on the anomaly where simply opening and re-saving the workbook leads to better results.

Consider trying the SSIS forum and newsgroup first. In the newsgroup (the old DTS newsgroup), SSIS MVP Allan Mitchell has been answering Excel driver questions for many years.

forums.microsoft.com/.../showforum.aspx

www.microsoft.com/.../default.aspx

Here is the Excel "general questions" newsgroup, but your question might quickly get buried...

www.microsoft.com/.../default.aspx

Best wishes,

-Doug

Katrina wrote re: Excel in Integration Services, Part 2 of 3: Tables and Data Types
on 06-23-2008 12:04 PM

Hi Doug,

Thanks for the pointers.  I'm also going to download the latest version of the driver (12.0) to see if that makes any difference.

Katrina

Michael wrote re: Excel in Integration Services, Part 2 of 3: Tables and Data Types
on 06-25-2008 9:03

Hi Doug,

Thanks for the blog and all of the workaround options.  The issue I encounter seems not to fit any previously described.  Using SSIS, I am trying to export data from SQL Server to EXCEL.  I created  a sheet in Excel 2002 with formatted fields I intend to load data into.  Then I created a Connection Manager to that Excel using the Excel Connection Manager. When in the Mappings section of the Excel Destination Editor, I mouse over the Destination Columms to see what SSIS is reading in from the destination Excel file.  In the XLS file itself, the data types of the 5 colums are TEXT, TEXT, NUMBER (0 decimals), NUMBER (0 decimals) and CUSTOM = mm/dd/yyyy.   Regardless of how I set these columns in XLS, SSIS Excel Destination editor reads them in as DT_WSTR.  What gives????  Why does not SSIS recognize the format of the field - there are no data values to compare here as the target is empty, save for 5 column headers.

Another bug in the Excel Destination object is that the target XLS file comes in as only having 1 field (F1) if you do not supply headings on the target file you are connected to.  I've selected WITH AND WITHOUT FIRST COLUMN HEADINGS on the Connection Manager. But both times the Excel Destination Editor only reflects 1 column of F1 to map to.  If I put 5 headers on the target worksheet, I see 5 columns on the Destination Editor.

And finally, if I format said headers with BOLD and SHADING, when the records are insterted from SSIS to EXCEL, the bold and shading carries forward with it AND all the fields are recognized as text and left aligned - even though the FORMAT CELLS option reflects NUMBER(0 decimals).  This BOLD/SHADING issue is discussed at length with no resolution on this MS Technet blog- forums.microsoft.com/.../ShowPost.aspx

Any insight would be greatly appreciated.

Several issues here.

dougbert wrote re: Excel in Integration Services, Part 2 of 3: Tables and Data Types
on 06-25-2008 4:42 PM

Michael,

Issues 1 and 2 are the result of the Excel driver's SAMPLING to learn about the source or destination worksheet:

1) No DATA to sample means the default is text. Most of the time, cell formatting - particularly of empty cells - seems to be ignored by the driver.

2) No COLUMNS means just 1 (F1) by default. I assume that you selected "First row has column names", by which you meant, "I WANT TO save column names into the first row". Unfortunately the driver relies instead on what it finds in the worksheet.

The Bold/Shading issue is one that I had never heard of...I missed that post in the forum. Since the driver ignores far more important aspects of cell formatting, it's really hard to believe that it would not only carry bolding forward, but that bold column headers would cause everything to be formatted as text! I'll have to take a look at that, but have no solution at the present time.

-Doug

dougbert wrote re: Excel in Integration Services, Part 2 of 3: Tables and Data Types
on 06-25-2008 4:46 PM

On second thought, Issue #3 is probably just Excel being helpful by copying the formatting above to newly-added rows below...just as it does when you're entering data manually.

In Excel 2007, this checkbox is titled, "Extend data range formats and formulas". I don't have Excel 2003 installed to find the equivalent option.

-Doug

acbpriya wrote re: Excel in Integration Services, Part 2 of 3: Tables and Data Types
on 06-26-2008 10:15

dougbert,

I have two issues with Excel and SSIS

Sometime the  data is  appending  on the excel sheet. Do you know how to overcome this. I delete the .xls and create the new.xls from a template each time when the SSIS runs.  

Second issue is, one of the dates field ,  is not changing its  datetime format.  I can't even edit the format even though I can delete it and  change.  I want the format mm/dd/yyyy.  do you have any idea on these issue?.

thx,

priya

dougbert wrote re: Excel in Integration Services, Part 2 of 3: Tables and Data Types
on 06-26-2008 7:53 PM

Priya,

I am not sure what you mean by "appending", if you are starting with an empty workbook and worksheet. However, if you started with a sheet into which the driver once loaded data, then it still considers those rows as "used".

Cell formatting is an issue in the worksheet and not an issue with the driver. There are not "read-only" formats in Excel, although it's possible to "protect" a worksheet to prevent certain types of changes. I'm sure that the driver outputs datetime data in the default format for the locale, unless you have already formatted the column otherwise.

-Doug

Add a Comment

(required)  
(optional)
(required)  
Remember Me?
Powered by Community Server (Non-Commercial Edition), by Telligent Systems