Thursday, August 18, 2016

ADO.NET: Reading CSV Files

It is 2016 and generating CSV files is a mature technology (can it really be called a technology?). For example ADO.NET has been able to consume a CSV file and create a corresponding DataTable for at least a dozen years (see the MSDN article from March 5, 2004, Much ADO About Text Files).  This post will cover the correct way to use ADO.NET to access CSV files and transform said files into an ADO.NET DataTable.

How not to do it

The impetus for this blog posting is a project I was assigned which accessed CSV files all the wrong ways. I was taken aback recently with a 2016  era project when I saw it include code to read each line in a text file (System.IO.File.ReadAllLines) and use Array.Split to process the CSV and at the same time to do it in a flawed manner. An example of this travesty against Alan Turing and all things good is as follows:

private static DataTable ReadCSVFile()
{
    string[] rows = File.ReadAllLines(GetCSVFilename());
    string[] columnNames = null;
    DataTable table = new DataTable();

    if (rows.Length == 0)
    {
        return table; // error empty CSV file
    }

    columnNames = rows[0].Split(',');
    foreach (string columnName in columnNames)
    {
        table.Columns.Add(columnName);
    }

    object[] rowValues = new object[columnNames.Length];

    // start from 1 b/c rowindex=0 is the header row of column names
    for (int rowIndex = 1; rowIndex < rows.Length; rowIndex++)
    {
        string rowLine = rows[rowIndex];
        // flaw in code -- any string containing a column will 
        // generate an extra column
        string[] columnValues = rowLine.Split(',');

        if (columnNames.Length != columnValues.Length)
        {
          throw new Exception(
            "Split() failed because there is a comma in the data.");

        }

        for (int columnIndex = 0; 
             columnIndex < columnValues.Length; 
             columnIndex++)
        {
            rowValues[columnIndex] = columnValues[columnIndex];
        }

        table.Rows.Add(rowValues);
    }

    return table; 
}

Pay close attention to the code marked in boldface which is a core flaw in how the CSV values are parsed. This bug is the reason it is better to use established technologies. The  bug show up at line 291 due this text data that contains a column:
  For urgent issues, do not send e-mail

How to access a CSV file using ADO.NET's OLE DB Provider

This article will demonstrated a cleaner way to convert a CSV file into a DataTable. The previously referenced MSDN article that demonstrates CSV file to DataTable code is actually flawed with respect to how connections to CSV files are handled and how the command text related to accessing values is written.  The example presented here will address these issues and said example is as follows:

public static DataTable GetDataTableFromCSV(string folderPath, 
                                            string csvFileName)
{
    DataTable table = new DataTable();
    string connectionText = String.Format(
        "Provider=Microsoft.Jet.OLEDB.4.0;Data Source='{0}';" 
          "Extended Properties='text;HDR=Yes';",
        folderPath);
    string commandText = String.Format(
        "SELECT * FROM [{0}]", csvFileName);

    using (OleDbConnection connection = 
                  new OleDbConnection(connectionText))
    {
        connection.Open();
        using (OleDbDataAdapter command = 
             new OleDbDataAdapter(commandText, connection))
        {
            command.Fill(table);
            command.Dispose();
        }

        connection.Close();
    }
}

The connection text in the previous code includes the folder in which the CSV file resides. The folder path is assigned to the Data Source property of the connection string as follows:
  "Provider=Microsoft.Jet.OLEDB.4.0;Data Source='{0}';" 
    "Extended Properties='text;HDR=Yes';",

Notice that the {0} String.Format parameters is bracketed by single quotes: '{0}'. This single quotes protect the code from any special characters in the folder path that would cause an error in the connection string.

The command text is a SQL SELECT where the CSV filename is the value specified for the FROM clause:
  "SELECT * FROM [{0}]"

Notice that the {0} String.Format parameters is bracketed by square brackets. In a perfect would this would protect the command execution from special characters in the filename.

Accessing the Source Code

Each project used in this application can be found in Github for user SoftwarePronto (https://github.com/softwarepronto/Blog). The solution associated with blog is:

  • ADODotNetReadingCSVFiles

The pertinent projects are:

  • BarbaricCSVAccess
  • CSVToDataTableUsingADODotNet 


Each project includes a sample CSV file that is part of the project and is copied to the build folder. The settings associated with supporting thsi were discussed in a previous posting, "Visual Studio Installer Projects: Installing PDFs and other Content with an Install Project".

No comments :

Post a Comment