Jun
18
2008

Excel in Integration Services, Part 2 of 3: Tables and Data Types

This post is the 2nd in a series of 3 planned installments about using Excel data files with Microsoft SQL Server Integration Services.

  1. Connections and components
  2. Tables and data types
  3. Issues and alternatives

Tables

What's a table in Excel?

The source of data in an Excel workbook can be:

  • A worksheet (for example, Sheet1$), which you can recognize in a list because it has the $ appended. In a SQL statement, the name of a worksheet must be delimited (for example, [Sheet1$]) to avoid a syntax error caused by the $ sign. If you use the Query Builder, it automatically adds these delimiters.
  • A named range (for example, MyRange), which you can recognize in a list because it does not have the $ appended.
  • An unnamed range specified by using cell coordinates (for example, Sheet1$A1:C100), which you normally have to specify in a SELECT statement.

Both worksheets and named ranges appear in lists of "tables" in an Excel data source.

Working with Excel tables

When you use the Excel driver to CREATE the table, as you might do for an Excel destination in the SQL Server Import and Export Wizard, the driver creates both a worksheet (MyOutput$) and a named range (MyOutput) with the name that you supply.

Whether you specify a worksheet or a range, the driver reads the contiguous block of cells starting with the first non-empty cell in the upper-left corner of the worksheet or range. Therefore you cannot have empty rows in the source data, or an empty row between the header row and the data rows. Similarly, if you use cell A1 for a worksheet title, you may have to assign a name to your range of data farther down, and use the named range instead of the worksheet name.

When you specify a range, you get an error if the range is not wide enough; that is, if it has fewer columns than you need. However, if the range that you've defined isn't long enough - that is, if it has fewer rows - the driver is intelligent enough not just to continue writing rows, but to extend the range definition as well.

Deleting and reusing

If you try to delete Excel data or tables through the Excel driver (for example, in an Execute SQL task), you'll probably get results that you don't expect. In my most recent testing:

  • A DELETE statement fails with an error.
  • A DROP TABLE statement, with either a named range or a worksheet, "succeeds" and, ironically, deletes the data, but does not delete the range or the worksheet.

It's not easy to re-use a worksheet that you used previously as a destination. Clearing the cell values or rows in Excel that were loaded by the driver does not make those rows "reusable"; the driver still treats the rows as if they contained data. Deleting the physical rows in Excel solves this problem, but if you're using a named range, of course that also shrinks or deletes your range.

If you want to "re-use" a worksheet in a particular format - for example, for a daily export to Excel - the cleanest solution is to create a blank workbook to save as a template, and to use a File System task to copy that workbook into the working directory when needed. Leave the template intact.

Data types

The Excel driver recognizes only a limited set of data types. For example, all numeric columns are interpreted as doubles (DT_R8), and all string columns (other than memo columns) are interpreted as 255-character Unicode strings (DT_WSTR).

Integration Services maps the Excel data types as follows:

  • Numeric – double-precision float (DT_R8)
  • Currency – currency (DT_CY)
  • Boolean – Boolean (DT_BOOL)
  • Date/time – datetime (DT_DATE)
  • String – Unicode string, length 255 (DT_WSTR)
  • Memo – Unicode text stream (DT_NTEXT)

"Hey, columns don't have data types in Excel!" No, and this is the root of the most common issues when using the Excel driver. So the driver assigns data types by sampling several rows of data (typically 8, by default) and picking the data type the represents the majority of the values in each column. In other words, it guesses. If Column B contains 5 strings and 3 numbers, then it's a string column. In a tie (for example, 4 strings and 4 numbers), numeric wins...this is a spreadsheet, after all! (Yes, I bothered to test this.) You can increase the number of rows sampled by increasing the value of TypeGuessRows under the HKEY_LOCAL_MACHINE\SOFTWARE\Microsoft\Jet\4.0\Engines\Excel registry key.

"I wish I knew where to find more information about the data types." Well, if you have Microsoft Office, it's hidden on your hard drive in the file JETSQL40.CHM. I have a copy of this Help file in both C:\Program Files\Common Files\microsoft shared\OFFICE12\1033 (from Office 2007) and C:\Program Files\Common Files\microsoft shared\OFFICE11\1033 (from Office 2003).

Data type and length conversions

Integration Services does not implicitly convert data types. As a result, you may need to use Data Conversion or Derived Column transformations to convert Excel data explicitly before loading it into a non-Excel destination, or to convert non-Excel data before loading it into an Excel destination. In this case, it may be useful to create the initial package by using the Import and Export Wizard, which warns you about data type differences (in SQL Server 2008) and configures the necessary conversions for you. Some examples of the conversions that may be required include the following:

  • Conversion between Unicode Excel string columns and non-Unicode string columns with specific codepages
  • Conversion between 255-character Excel string columns and string columns of different lengths
  • Conversion between double-precision Excel numeric columns and numeric columns of other types

See also

Although these Books Online topics were also listed in our previous installment on Connections and components, they both discuss how to work with Excel tables, as well as Excel files.

In our next installment...

In Issues and alternatives, we'll take a look at the most common issues when importing from, or exporting to, Excel.

- Doug

Comments (27) -

Peter

So is there a max value that can be configured for "TypeGuessRows"?  I've tried setting that to some obscenely high values and still end up with char values NULLed out.  Seems like that might be due to the "most values of this type win" behaviour, though.  That seems odd to me - shouldn't the JET driver figure that if you have a mixed column, it moves up to the largest type able to handle it - whether that's nvarchar(255), ntext, or just float?


Short of the IMEX trick, can I just force things to be text or varchar?  Can I default the IMEX setting so it works with a Wizard and I don't have to manually build packages just to handle that piece?  It would be really nice to override whatever JET picks up and force data types on the columns, but I realize that's a JET thing and not an SSIS thing (unfortunately).


Enjoying the series so far - looking forward to Part 3 now.  Smile


(Would love a way to subscribe to the comments somehow.)


Douglas Laudenschlager

Peter, Thank you for your ongoing interest.


First, I have enabled "RSS for Comments" in Community Server, but I notice that I only see the option to subscribe when I am logged in to the site. I'll have to explore that CS issue.


You can tell TypeGuessRows to sample ALL rows - I think it's zero but would want to check - but that won't change the behavior of IMEX. I promise to finish Part 3 as soon as possible, but please don't anticipate miracles - some of the Excel driver issues simply don't have a workaround that satisfies everyone, and if they did, they'd be well-known after all these years with little change in the driver.


-Doug


Katrina

Hi Doug,


Your blog topics on Excel have been very helpful.  I'm hoping you might have some insight into this issue which is has me very puzzled.


My project needs to be able to read in multiple excel files from a single directory and process them.  The column mapping needs to be dynamic, but I've put together a solution for that.


The issue I can't seem to get around is that dates don't seem to be handled consistently from file to file under the Excel connection manager.  Sometimes they come back as formatted dates and sometimes they come back as integers.  In either case the output column in my excel source has the WSTR data type.


Because the excel files have column headers I'm using "IMEX=1" in my connection string.  This seems to work just dandy for all columns (currency, numeric, alphanumeric) except for the date columns.  


Sometimes I can preview them in my Excel data flow source and they are integers, and sometimes they are valid dates (mm/dd/yyyy).  My insert function dies if the dates are in integer format (not surprising), and I can't perform a data conversion on the column because the behavior varies from file to file.


Perhaps if I understood why this was happening I could figure out my own work around, but this behavior does have me stumped.


The connection string I'm using is -->  


"Provider=Microsoft.Jet.OLEDB.4.0;Data Source=" +  @[User::ExcelFilename] + ";Extended Properties=\"Excel 8.0;HDR=NO;IMEX=1\""


Any insight you could pass along would be very much appreciated.


Katrina


dougbert

Katrina,


Before we talk about the problem with dates, I'm curious about your statement that you're using IMEX=1 because your Excel data has column headers. Normally we would tell the driver this fact by using HDR=Yes; however I see HDR=No in your connection string.


In all the years that I've tried to follow issues with the Excel driver, I've almost never heard about problems with date columns, so I don't have a canned answer. However after a quick test, it looks like IMEX might be contributing to the problem here.


I created an .XLS and entered dates in one of the columns. Then I went through the first steps of the Import and Export Wizard to examine the Edit Mappings dialog box and the autogenerated SQL statement for creating a destination table to hold the data. Whether I used IMEX or not, the dates LOOKED correct in the Preview window. However: When I did NOT use IMEX, my column of dates was recognized as a datetime column; when I DID use IMEX, the driver seemed unable to identify the data type of the column. In fact, data type was blank.


Can you try again without IMEX?


-Doug


Katrina

Hi Doug,


The ultimate goal of this is an SSIS task which can read in Excel files from different clients.  The files contain invoice information which we need to automatically load into our system for further processing.


Unfortunately, I'm not guaranteed any type of consistent formatting in the Excel files I receive.  This includes the header rows which can be either 1,2  or 3 rows (so far).


I keep client profile information on the number of header rows and column positions of the data I want to extract.  I read in the profile information and create a dynamic select statement where I alias the standard rows from the spreadsheet (F1, F2, ... FN) to the appropriate columns in my data flow.  I also use the profile information to strip off the header rows (that's why HDR=NO).  


I'm using IMEX = 1 because I read all the issues associated with with Excel's typing of data, and how unexpected nulls could be inserted into the columns if the data didn't match what was originally guessed at.  I often have data that appears to be numeric (invoice numbers can be numeric for some clients alphanums for others), so I wanted to try to force Excel to always put that into character fields.  Handling the metadata for SSIS is very problematic if it's not consistent, that's the exact problem I'm running into with the date field.  But it could crop up in the invoice field as well (read an excel file in and have it interpret the invoice as numeric for 1 client and text for another, I haven't found a way in SSIS to correctly handle this).


I will try this morning  when I get in to remove "IMEX = 1" from the string, retest my various formats and see if any issues come up.


Oh by the way ... the date problem is very difficult to reproduce.  It usually goes away if I open the file and save it on the test machine, then rerun it.


Do you know if there is an Excel or OleDB forum where I could post this question?  I realize it is rather obscure.


Thx,


Kathy


dougbert

Kathy,


I see that you are not a novice to the challenges of working with the Excel driver.


One challenge is that it has never been clear what factors (other than the raw values) contribute to the driver's decision to classify a column as one data type rather than another. In some cases, Excel cell formatting seems to make a different; at other times, not. In the case of dates, it seems to matter, since formatting is all that distinguishes the underlying numeric value from any other floating point number.


When my searches lead me to Excel forums where users have posted questions about the Excel driver, I notice that the question is often misunderstood and rarely answered correctly - experts in working IN Excel don't necessarily know about driver issues. But they might be able to shed some light on the anomaly where simply opening and re-saving the workbook leads to better results.


Consider trying the SSIS forum and newsgroup first. In the newsgroup (the old DTS newsgroup), SSIS MVP Allan Mitchell has been answering Excel driver questions for many years.


forums.microsoft.com/.../showforum.aspx">forums.microsoft.com/.../showforum.aspx


www.microsoft.com/.../default.aspx">www.microsoft.com/.../default.aspx


Here is the Excel "general questions" newsgroup, but your question might quickly get buried...


www.microsoft.com/.../default.aspx">www.microsoft.com/.../default.aspx


Best wishes,


-Doug


Katrina

Hi Doug,


Thanks for the pointers.  I'm also going to download the latest version of the driver (12.0) to see if that makes any difference.


Katrina


Michael

Hi Doug,


Thanks for the blog and all of the workaround options.  The issue I encounter seems not to fit any previously described.  Using SSIS, I am trying to export data from SQL Server to EXCEL.  I created  a sheet in Excel 2002 with formatted fields I intend to load data into.  Then I created a Connection Manager to that Excel using the Excel Connection Manager. When in the Mappings section of the Excel Destination Editor, I mouse over the Destination Columms to see what SSIS is reading in from the destination Excel file.  In the XLS file itself, the data types of the 5 colums are TEXT, TEXT, NUMBER (0 decimals), NUMBER (0 decimals) and CUSTOM = mm/dd/yyyy.   Regardless of how I set these columns in XLS, SSIS Excel Destination editor reads them in as DT_WSTR.  What gives????  Why does not SSIS recognize the format of the field - there are no data values to compare here as the target is empty, save for 5 column headers.


Another bug in the Excel Destination object is that the target XLS file comes in as only having 1 field (F1) if you do not supply headings on the target file you are connected to.  I've selected WITH AND WITHOUT FIRST COLUMN HEADINGS on the Connection Manager. But both times the Excel Destination Editor only reflects 1 column of F1 to map to.  If I put 5 headers on the target worksheet, I see 5 columns on the Destination Editor.


And finally, if I format said headers with BOLD and SHADING, when the records are insterted from SSIS to EXCEL, the bold and shading carries forward with it AND all the fields are recognized as text and left aligned - even though the FORMAT CELLS option reflects NUMBER(0 decimals).  This BOLD/SHADING issue is discussed at length with no resolution on this MS Technet blog- forums.microsoft.com/.../ShowPost.aspx">forums.microsoft.com/.../ShowPost.aspx


Any insight would be greatly appreciated.


Several issues here.


dougbert

Michael,


Issues 1 and 2 are the result of the Excel driver's SAMPLING to learn about the source or destination worksheet:


1) No DATA to sample means the default is text. Most of the time, cell formatting - particularly of empty cells - seems to be ignored by the driver.


2) No COLUMNS means just 1 (F1) by default. I assume that you selected "First row has column names", by which you meant, "I WANT TO save column names into the first row". Unfortunately the driver relies instead on what it finds in the worksheet.


The Bold/Shading issue is one that I had never heard of...I missed that post in the forum. Since the driver ignores far more important aspects of cell formatting, it's really hard to believe that it would not only carry bolding forward, but that bold column headers would cause everything to be formatted as text! I'll have to take a look at that, but have no solution at the present time.


-Doug


dougbert

On second thought, Issue #3 is probably just Excel being helpful by copying the formatting above to newly-added rows below...just as it does when you're entering data manually.


In Excel 2007, this checkbox is titled, "Extend data range formats and formulas". I don't have Excel 2003 installed to find the equivalent option.


-Doug


acbpriya

dougbert,


I have two issues with Excel and SSIS


Sometime the  data is  appending  on the excel sheet. Do you know how to overcome this. I delete the .xls and create the new.xls from a template each time when the SSIS runs.  


Second issue is, one of the dates field ,  is not changing its  datetime format.  I can't even edit the format even though I can delete it and  change.  I want the format mm/dd/yyyy.  do you have any idea on these issue?.


thx,


priya


dougbert

Priya,


I am not sure what you mean by "appending", if you are starting with an empty workbook and worksheet. However, if you started with a sheet into which the driver once loaded data, then it still considers those rows as "used".


Cell formatting is an issue in the worksheet and not an issue with the driver. There are not "read-only" formats in Excel, although it's possible to "protect" a worksheet to prevent certain types of changes. I'm sure that the driver outputs datetime data in the default format for the locale, unless you have already formatted the column otherwise.


-Doug


Neil

Hi Doug,


You mention that you can use an unnamed range in an Excel worksheet.  Is it possible to use this when exporting data from SQL to Excel?


I have an existing excel worksheet that I want to export data to.  I'm having problems with numeric data not being recognised correctly in Excel so I tried writing to a new worksheet using the CREATE TABLE command.  The numeric data then worked OK as it had been specified as such.


I was hoping that I could use the same method to write to an unnamed range in an existing worksheet but the CREATE TABLE method always fails.


There is summary information at the top of the worksheet so I need to start writing my exported data at B14.


Any advice you can give would be much appreciated.


Neil


dougbert

Neil,


I don't believe the CREATE TABLE statement can be used with Excel to create a range at a specified location. By default, CREATE TABLE creates a new Worksheet, and at the same time a new Named Range with the same name (on the new Worksheet).


It seems to me that your options are:


1. Manually, in Excel, create a Named Range that starts at B14.


2. Or, in specifying the destination "table", specify the unnamed range that begins at B14.


If your source data is purely numeric and not mixed text and numeric, Excel should certainly recognize that and treat it as numbers, though it may not FORMAT the numeric data as you'd prefer.


Best wishes,


-Doug


Jenny

Hi,


 I have an issue when load a mixed excel data to table. The most values in the excel look like 2345.73 for some days while a few of them are 2345.74NT in another day so the destination column has to be varchar(). I tried to force the mixed data to Text but it then converted 2345.73 to 2345.729999999999999 which is painful as the data means an account number istead of a numeric amount.


  I tried many ways such as conver function or data type converion in Derived columns such as from DT_R8 to DT_STR, no help.


  The incorrect conversion (2345.73 to 2345.729999999999999) won't happen only if the destination column is type of "Number" such as float or decimal. But it's not acceptable as it means account number with some values like "2345.37NT".


  Very appreicated for any help.


Thanks,


Jenny


dougbert

Jenny,


The issue that you described seems to be one of several that may occur when using Import Mode to import mixed text and numeric data reliably. (The other issue is conversion to exponential or scientific notation.) There's no use discussing why the driver does this or whether it should - it does, and we have to find workarounds, as for its other quirks.


It sounds like your issue with numbers occurs only when the data is NOT an account number and does NOT have the alphabetic characters like "NT" on the end. So it seems that the logic needed to fix your situation is either:


If there are more than 4 characters to the right of the decimal, then


  Round to 2 places


or


If the rightmost 2 characters are not alphabetic, then


  Round to 2 places


Of course this could be implemented in code in a Script transformation, but the better solution if possible would be a Derived Column transformation that uses an SSIS expression to fix the data. Unfortunately I don't use the expression language enough to come up with the syntax instantly, early on a Monday morning after only 1 cup of coffee.


I strongly suggest that you re-post your issue in the large forum of SSIS experts at social.msdn.microsoft.com/.../">social.msdn.microsoft.com/.../threads. A single blogger is not a reliable source of technical support, and by the nature of my job, my knowledge of SSIS is broad but not always deep.


Best wishes,


-Doug


Nik

Hi how can i get the list of the field names of an Excel file for each tab within SSIS?


+ can you email me @ SNikkhah@Live.ca or tell me how to register @ your site/blog


dougbert

Nik,


A blog is not a good place for tech support. You will get an answer more quickly from a larger audience in the SSIS forum at social.msdn.microsoft.com/.../">social.msdn.microsoft.com/.../threads.


If by field names you mean column names in an Excel worksheet table or named range, then here is a pointer in the right direction. You would presumably use a Script task for this.


Basically, your script would:


1.  Connect to the Excel worksheet using OLE DB. (You could extract the file path from the Excel connection in the package, although the script would be creating its own connection.)


2.  Call GetOleDbSchemaTable and ask for the Columns schema table for the desired table.


The C# in http://support.microsoft.com/kb/318452">support.microsoft.com/.../318452 shows GetOleDbSchemaTable but makes it look much more complicated than it really is. Here are the critical lines out of that longer sample, which is most of the code (untested!) for steps #1 and #2 above.


using System.Data;


using System.Data.OleDb;


      private  OleDbConnection cn;


      private String strCn = "Provider=Microsoft.Jet.OLEDB.4.0; Data Source=c:\\<myexcelfile>.xls;Extended Properties=Excel 8.0";


      private DataTable dtColumns;


       String strTable = “<mytablename>”;


       cn = new OleDbConnection(strCn);


               cn.Open();


               dtColumns = cn.GetOleDbSchemaTable(OleDbSchemaGuid.Columns,new Object[]{null,null, strTable, null});


               cn.Close();


Best of luck,


-Doug


Nik

I have a SSIS package that is looping  through a folder and finds Excel files the second loop loops through each Tab or sheet and finally it inserts into a table in SQL, the tab/sheet have the same headers but because the headers are entered manually the user always forgets a field name and sometimes they type in a incorrect name, anyways i need to the list of the field names of each table/Sheet in excel tabs, how can i do that? The code that i have is as mentined


i need to get the list of field names for each Tab/Sheet


thanks


Nik


SNikkhah@live.ca


.......


   Public Sub Main()


       Try


           Dim excelFile As String


           Dim connectionString As String


           Dim excelConnection As OleDbConnection


           Dim tablesInFile As DataTable


           Dim tablesColumn As DataColumn


           Dim ColumnCollection As DataColumnCollection


           Dim strstr As String


           Dim Mycount As Integer


           Dim tableCount As Integer = 0


           Dim tableInFile As DataRow


           Dim currentTable As String


           Dim tableIndex As Integer = 0


           Dim excelTables As String()


           Dim LoopForNumnberOfRealTabs As Integer = 0


           ' IF the Flag "uVar_SourceMultiTabExcel" is TRUE, Then


           ' The system will loop through all tabs to get tab names


           ' ELASE FLASE


           ' we will be using the Fixed tab name that is in the "uVar_ExcelActiveTabName"


           '


           Dim flagMultiTab As String


           flagMultiTab = Dts.Variables("uVar_SourceMultiTabExcel").Value.ToString.ToUpper


           If flagMultiTab = "TRUE" Then


               'uVar_SourceMultiTabExcel, uVar_ExcelActiveTabName


               connectionString = Dts.Variables("uVar_SourceConStr_ConnectionString").Value.ToString


               excelConnection = New OleDbConnection(connectionString)


               excelConnection.Open()


               tablesInFile = excelConnection.GetSchema("Tables") '----("Restrictions")


               'tablesInFile = excelConnection.GetSchema("Columns") '----("Restrictions")


               tableCount = tablesInFile.Rows.Count


               For Each tableInFile In tablesInFile.Rows


                   currentTable = tableInFile.Item("TABLE_NAME").ToString


                   currentTable = currentTable.Replace("'", "")


                   If Right(currentTable, 1) = "$" Then


                       LoopForNumnberOfRealTabs += 1


                       ReDim Preserve excelTables(LoopForNumnberOfRealTabs - 1)


                       excelTables(LoopForNumnberOfRealTabs - 1) = currentTable


                       'Dim tbl As DataTable


                       'Dim Cols As DataColumnCollection


                       'tbl = tbl


                       'Cols = tbl.Columns


                       ''Cols = tableInFile.Table.Columns


                       ''www.devart.com/dotconnect/mysql/docs/MetaData.html">www.devart.com/.../MetaData.html


                       'Dim col As DataColumn


                       'Dim ssstr As String


                       'For Each col In Cols


                       '    ssstr = col.ColumnName


                       'Next


                   End If


               Next


               excelConnection.Close()


               excelConnection = Nothing


           Else ' Else if of ... If flagMultiTab = "TRUE" Then


               ReDim Preserve excelTables(0)


               excelTables(0) = Dts.Variables("uVar_ExcelActiveTabName").Value.ToString()


           End If ' end if of ... If flagMultiTab = "TRUE" Then


           Dts.Variables("uVar_ExcelTabObjectName").Value = excelTables


           Dts.TaskResult = Dts.Results.Success


       Catch ex As Exception


           Dim strEX As String


           strEX = ex.Message.ToString


           Dts.TaskResult = Dts.Results.Failure


       End Try


   End Sub


Nik

how can i convert .......dtColumns = cn.GetOleDbSchemaTable(OleDbSchemaGuid.Columns,new Object[]{null,null, strTable, null});


to VB.net?


have you ever used the excel file tab fields before?


thanks


dougbert

The question about retrieving a list of column names for each worksheet in an Excel file was answered in the following thread in the SSIS forum on MSDN:


social.msdn.microsoft.com/.../a5a8953f-acf1-4424-a021-60e8307dc642">social.msdn.microsoft.com/.../a5a8953f-acf1-4424-a021-60e8307dc642


-Doug


Nik

WOOOOOOOOOOO


thank you very very much, i had the same code but i had problems with it running but anyways i removed and added few codes and it worked


thanks a lot.


Linda
Graeme

Hi,

Thanks for the tips on Excel, very useful, however, I have tried your looping methods on Excel 2007 files, and the Jet driver does not seem to work with them. I have tried the Office 12 driver, but that did not work either. Can you read/list multiple sheets in an Excel 2007 file? Are there any special tricks to it?

Thanks,

Graeme

david

Thanks for this helpful information

hassan

i had the same code but i had problems with it running but anyways i removed and added few codes

Lempster

..."In all the years that I've tried to follow issues with the Excel driver, I've almost never heard about problems with date columns"...

I can only assume that you have never tried using a dynamic SSIS package to create an Excel 2007 workbook and export data into that workbook using an OPENROWSET...INSERT statement then Doug!

Try as I might, I cannot find a way to get datetime data to display as anything other then general text in an Excel 2007 workbook using the above method. It works fine in Excel 97-2003. It wouldn't be a problem if the text data observed the correct date order when sorting or woould allow the cells to be formatted as dates, but they resist all attempts to change them.

Pingbacks and trackbacks (1)+