News
Photos
Articles
Components
Applications
Kleinkunst

.NET - Querying Excel worksheets with LINQ

A while ago I was looking for a way to query Excel worksheets with LINQ. I found a very nice LINQ to Excel Provider developed by K. Jacobson. To use it, you have to implement a strong typed class that represents an Excel worksheet. The class and its properties need to be decorated with special ExcelProvider attributes. Once you have this class it becomes very easy to read data from a given Excel worksheet. And what is even more, this component also supports change tracking and writing changes back to the Excel worksheet. Jacobson's ExcelProvider is really great but I just needed a simpler solution that allowed me to query Excel worksheets in a tool like LINQPad without implementing special classes.

Finally I decided to write my own ExcelProvider. The first step was looking up some examples of reading Excel worksheets with the ADO.NET OleDbDataReader.

Then I started implementing the IExcelRow and IExcelColumn interfaces and internal classes. An ExcelRow object will store a collection of cell values and provide methods to access this data. An ExcelColumn object just holds the definition of a column. The private Load() method in the ExcelProvider class, which will be called in the factory methods, will do almost all the magic. It reads the given Excel file and stores all data in IEnumerable IExcelRow and IExcelColumn collections.

private void Load()
{
  string connectionString = GetConnectionString();
 
  using (OleDbConnection connection = new OleDbConnection(connectionString))
  {
    // Get OleDB connection to Excel (XLS/XLSX) or CSV file
    // and open it with a DataReader.
    connection.Open();
    using (OleDbCommand command = connection.CreateCommand())
    {
      command.CommandText = GetCommandText();
      using (OleDbDataReader reader = command.ExecuteReader())
      {
        // Run through fields and create column objects
        for (int i = 0; i < reader.FieldCount; i++)
        {
          columns.Add(new ExcelColumn(i, reader.GetName(i), reader.GetFieldType(i)));
        }
 
        // Run through records and create rows with cells 
        while (reader.Read())
        {
          ExcelRow newRow = new ExcelRow(columns);
          for (int index = 0; index < reader.FieldCount; index++)
          {
            newRow.AddCell(reader[index]);
          }
          rows.Add(newRow);
        }
      }
    }
  }
}

I wanted to support Excel 2007 XLSX files, older XLS files and ASCII CSV files. Therefore I only needed to change the connection string of the OleDbConnection object and the command string of the OleDbCommand.

private string GetConnectionString()
{
  switch (version)
  {
    case ExcelVersion.Excel2007:
      return string.Format(
        @"Provider=Microsoft.ACE.OLEDB.12.0;Data Source={0};Extended Properties=""Excel 12.0 Xml;HDR=YES""",
        fileName);
    case ExcelVersion.Csv:
      return string.Format(
        @"Provider=Microsoft.Jet.OLEDB.4.0;Data Source={0};Extended Properties=""text;HDR=Yes;FMT=Delimited;""",
        System.IO.Path.GetDirectoryName(fileName));
    default:
      return string.Format(
        @"Provider=Microsoft.Jet.OLEDB.4.0;Data Source={0};Extended Properties=""Excel 8.0;HDR=YES;""",
        fileName);
  }
}
 
private string GetCommandText()
{
  switch (version)
  {
    case ExcelVersion.Csv:
      return string.Format("SELECT * FROM {0}", System.IO.Path.GetFileName(fileName));
    default:
      return string.Format("SELECT * FROM [{0}$]", sheetName);
  }
}

All cell values are stored as Objects. As I wanted to avoid a lot of typecasting, I implemented some generic Get<T> and GetByName<T> methods in the ExcelRow class. These methods will try to convert the given cell value and return it in the given type. The Get<T> method can be called with a column index (starting from 1) but it also supports column headers (A-Z, AA-ZZ, ...). The GetByName<T> method expects a field name which is stored in the first row of the Excel worksheet.

private static T ConvertTo<T>(object data)
{
  try
  {
    if ((data is DBNull) || (data == null))
      return default(T);
    return (T) Convert.ChangeType(data, typeof (T));
  }
  catch
  {
    return default(T);
  }
} 
 
public T Get<T>(int columnIndex)
{
  if (CheckColumn(columnIndex) > -1)
    return ConvertTo<T>(cells[columnIndex - 1]);
  return default(T);
}
 
public T Get<T>(string columnHeader)
{
  int columnIndex = CheckColumn(columnHeader);
  if (columnIndex > -1)
    return ConvertTo<T>(cells[columnIndex]);
  return default(T);
}
 
public T GetByName<T>(string columnName)
{
  int columnIndex = CheckColumnByName(columnName);
  if (columnIndex > -1)
    return ConvertTo<T>(cells[columnIndex]);
  return default(T);
}

This is a class diagram of the ScipBe.Common.Office.ExcelProvider classes:

I've tried to briefly explain some of the most interesting parts of the implementation of this ExcelProvider. Now I would like to demonstrate its features by giving an example of a LINQ query and the result in LINQPad.

This is how the Persons worksheet in the Excel Persons.xlsx file looks like:

The LINQ query in LINQPad demonstrates the different ways of how to access the cell data.

var excel = ExcelProvider.Create(@"D:\Mijn Documents\LINQPad Queries\Linq to Excel\Persons.xlsx", "Persons");
//var excel = ExcelProvider.Create(@"D:\Mijn Documents\LINQPad Queries\Linq to Excel\Persons.xls", 
//    "Persons", ExcelVersion.ExcelXp);
//var excel = ExcelProvider.Create(@"D:\Mijn Documents\LINQPad Queries\Linq to Excel\Persons.csv", 
//    "", ExcelVersion.Csv);
 
excel.Columns.Dump("ExcelProvider : columns");
 
var query = 
  from r in excel.Rows
  select new
  {
    ID =  r[1],
    ID2 = r["A"],
    ID3 = r.Get<int>(1),
    ID4 = r.Get<string>(1),
    ID5 = r.Get<int>("A"),
    ID6 = r.Get<string>("A"),
    ID7 = r.GetByName<int>("ID"),
    ID8 = r.GetByName<string>("ID"),
    FirstName = r[2],
    LastName = r.Get<string>(3),
    Country = r.GetByName<string>("Country"),
    BirthDate = r.Get<DateTime>(5),
    BirthDate2 = r[5].ToString(),
    BirthDate3 = r["E"],
    BirthDate4 = r.GetByName<string>("BirthDate"),
    BirthDate5 = r.Get<DateTime>("E"),
    BirthDate6 = r.GetByName<DateTime>("BirthDate").AddMonths(1),
    Row = r.ToString()
  };
 
query.Dump("ExcelProvider : cell values from rows");

Take a look at one of my older articles Execute queries on Office data with LINQPad on how to add a reference to the ScipBe.Common.Office assembly which contains the ExcelProvider class.

This will be the result in LINQPad:

I hope you like this ExcelProvider implementation and that you can take advantage of this free code. Full sources and a help file can be found in the components section of my website.