Back to Basics - Keep it Simple and Develop Your Sense of Smell - From Linq To CSV
I was working with a friend recently on a side thing they were doing. They wanted to create an "Export" function for some small bit of data and start it from their website.
- You'd hit a URL after logging in
- Some data would come out of a database
- You'd get a .CSV file downloaded
- You could open it in Excel or whatever.
I spoke to my friend and they said it was cool to share their code for this post. This post isn't meant to be a WTF or OMG look at that code, as is it meant to talk about some of the underlying issues. There's few things going on here and it's not all their fault, but it smells.
- They are using a Page when a IHttpHandler will do.
- Not a huge deal, but there's overhead in making a Page, and they're not using any of the things that make a Page a Page. The call to ClearContents is an attempt at telling the Page to back off. It's easier to just not have a page.
- They're "thrashing" a bit in the Page_Load, basically "programming by coincidence."
- They're not 100% sure of what needs to be done, so they're doing everything until it works. Even setting defaults many times or calling methods that set properties and they setting those properties again.
- This means a few things. First, HTTP is subtle. Second, the Response APIs are confusing (less so in .NET 4) and it's easy to do the same thing in two ways.
- They're not using using() or IDisposable
- They're cleaning up MemoryStreams and StreamWriters, but if an exception happens, things'll get cleaned up whenever. It's not tight in a "cleaning up after yourself" deterministic (when it needs to be) kind of way.
- They're calling Flush() when it's not really needed, again programming by coincidence. "I think this needs to be done and it doesn't break..."
- Old school data access
- Not bad, pre se, but it could be easier to write and easier to read. DataAdapters, DataSets, are a hard way to do data access once you've used Linq to SQL, EF or NHibernate.
- Re-Throwing an Exception via "throw ex"
- When you want to re-throw an exception, ALWAYS just throw; or you'll lose your current call stack and it'll be hard to debug your code.
- Not reusable at all
- Now, reuse isn't the end-all, but it's nice. If this programmer wants different kinds of exports, they'll need to extract a lot from the spaghetti.
Here's what they came up with first.
Note that this code is not ideal. This is the before. Don't use it.
using System;
using System.Collections.Generic;
using System.Linq;
using System.Web;
using System.Web.UI;
using System.Web.UI.WebControls;
using System.IO;
using System.Configuration;
using System.Data.SqlClient;
using System.Data;
namespace FooFoo
{
public partial class Download : System.Web.UI.Page
{
protected void Page_Load(object sender, EventArgs e)
{
System.IO.MemoryStream ms = CreateMemoryFile();
byte[] byteArray = ms.ToArray();
ms.Flush();
ms.Close();
Response.Clear();
Response.ClearContent();
Response.ClearHeaders();
Response.Cookies.Clear();
Response.Cache.SetCacheability(HttpCacheability.Private);
Response.CacheControl = "private";
Response.Charset = System.Text.UTF8Encoding.UTF8.WebName;
Response.ContentEncoding = System.Text.UTF8Encoding.UTF8;
Response.AppendHeader("Pragma", "cache");
Response.AppendHeader("Expires", "60");
Response.ContentType = "text/comma-separated-values";
Response.AddHeader("Content-Disposition", "attachment; filename=FooFoo.csv");
Response.AddHeader("Content-Length", byteArray.Length.ToString());
Response.BinaryWrite(byteArray);
}
public System.IO.MemoryStream CreateMemoryFile()
{
MemoryStream ReturnStream = new MemoryStream();
try
{
string strConn = ConfigurationManager.ConnectionStrings["FooFooConnectionString"].ToString();
SqlConnection conn = new SqlConnection(strConn);
SqlDataAdapter da = new SqlDataAdapter("SELECT * FROM [FooFoo] ORDER BY id ASC", conn);
DataSet ds = new DataSet();
da.Fill(ds, "FooFoo");
DataTable dt = ds.Tables["FooFoo"];
//Create a streamwriter to write to the memory stream
StreamWriter sw = new StreamWriter(ReturnStream);
int iColCount = dt.Columns.Count;
for (int i = 0; i < iColCount; i++)
{
sw.Write(dt.Columns[i]);
if (i < iColCount - 1)
{
sw.Write(",");
}
}
sw.WriteLine();
int intRows = dt.Rows.Count;
// Now write all the rows.
foreach (DataRow dr in dt.Rows)
{
for (int i = 0; i < iColCount; i++)
{
if (!Convert.IsDBNull(dr[i]))
{
string str = String.Format("\"{0:c}\"", dr[i].ToString()).Replace("\r\n", " ");
sw.Write(str);
}
else
{
sw.Write("");
}
if (i < iColCount - 1)
{
sw.Write(",");
}
}
sw.WriteLine();
}
sw.Flush();
sw.Close();
}
catch (Exception Ex)
{
throw Ex;
}
return ReturnStream;
}
}
}
I don't claim to be a good programmer, but I do OK. I went over my concerns with my friend, and suggested first an HttpHandler. I started with Phil's basic abstract HttpHandler (based on my super basic HttpHandler boilerplate). I could have certainly done by just implementing IHttpHandler, but I like this way. They're about the same # of lines of code. The important part is in HandleRequest (or ProcessRequest).
namespace FooFoo
{
public class ExportHandler : BaseHttpHandler
{
public override void HandleRequest(HttpContext context)
{
context.Response.AddHeader("Content-Disposition", "attachment; filename=FooFoo.csv");
context.Response.Cache.SetCacheability(HttpCacheability.Private);
var dc = new FooFooDataContext();
string result = dc.FooFooTable.ToCsv();
context.Response.Write(result);
}
public override bool RequiresAuthentication
{
get { return true; }
}
public override string ContentMimeType
{
get { return "text/comma-separated-values"; }
}
public override bool ValidateParameters(HttpContext context)
{
return true;
}
}
}
At the time I wrote this, I was writing how I wished the code would look. I didn't have a "ToCsv()" method, but I was showing my friend how I though the could should be separated. Even better if there was a clean DAL (Data Access Layer) and Business Layer along with a service for turning things into CSV, but this isn't too bad. ToCsv() in this example is a theoretical extension method to take an IEnumerable of something and output it as CSV. I started writing it, but then decided to Google with Bing, and found a decent direction to start with at Mike Hadlow's blog. He didn't include all the code in the post, but it saved me some typing, so thanks Mike!
namespace FooFoo
{
public static class LinqToCSV
{
public static string ToCsv<T>(this IEnumerable<T> items)
where T : class
{
var csvBuilder = new StringBuilder();
var properties = typeof(T).GetProperties();
foreach (T item in items)
{
string line = string.Join(",",properties.Select(p => p.GetValue(item, null).ToCsvValue()).ToArray());
csvBuilder.AppendLine(line);
}
return csvBuilder.ToString();
}
private static string ToCsvValue<T>(this T item)
{
if(item == null) return "\"\"";
if (item is string)
{
return string.Format("\"{0}\"", item.ToString().Replace("\"", "\\\""));
}
double dummy;
if (double.TryParse(item.ToString(), out dummy))
{
return string.Format("{0}", item);
}
return string.Format("\"{0}\"", item);
}
}
}
This creates an extension method that lets me call something.toCsv() on anything IEnumerable. It'll spin through the properties (yes, I know that could have been a LINQ statement also, but I like a nice ForEach sometimes. Feel free to fix this up in the comments! ;) ) and build up the Comma Separated Values.
At some point, it really should format Dates as {0:u} but as of now, it works identically as the before code and attempts to rectify most of the issues brought up. Of course, one could take something like this as far and make it as robust as they like.
As Mike points out, you can also do little projections to control the output:
string csv = things.Select(t => new { Id = t.Id, Name = t.Name, Date = t.Date, Child = t.Child.Name }).AsCsv();
Your thoughts, Dear Reader?
Related Links
- Good Exception Management Rules of Thumb
- Cargo-cult programming
- LINQ to CSV using DynamicObject
- Mike Hadlow's LINQ ToCsv()
- Recommended: Open Source "FileHelpers" for importing and exporting Fixed and Delimited files with .NET from Marco Meli
- Eric White on Linq to Text and Linq to CSV
- Reading data FROM CSV files using LINQ
- Think Linq
About Scott
Scott Hanselman is a former professor, former Chief Architect in finance, now speaker, consultant, father, diabetic, and Microsoft employee. He is a failed stand-up comic, a cornrower, and a book author.
About Newsletter
Someone implement a comment voting system so I can get voted down, as this opinion will be likely be very unpopular by Scott's readership :-)
String.Format...
vs
string line = string.Join(",",properties.Select(p => p.GetValue(item, null).ToCsvValue()).ToArray());
(PS Yes, I know LINQ and No, it's not always appropriate in real life)
Coincidentally, this morning I was just about to write a csv export in a WPF app but decided to check my feeds before starting work. Saved me an hour or so.
Now I have some serious ammunition if anyone has an issue with me browsing the web ;-)
Instead of LINQ to CSV do IDataReader to CSV
-Mike
Agreed
o Not a huge deal, but there's overhead in making a Page, and they're not using any of the things that make a Page a Page. The call to ClearContents is an attempt at telling the Page to back off. It's easier to just not have a page.
Agreed
* They're "thrashing" a bit in the Page_Load, basically "programming by coincidence."
Agreed
o They're not 100% sure of what needs to be done, so they're doing everything until it works. Even setting defaults many times or calling methods that set properties and they setting those properties again.
Agreed
o This means a few things. First, HTTP is subtle. Second, the Response APIs are confusing (less so in .NET 4) and it's easy to do the same thing in two ways.
Agreed
* They're not using using() or IDisposable
Agreed
o They're cleaning up MemoryStreams and StreamWriters, but if an exception happens, things'll get cleaned up whenever. It's not tight in a "cleaning up after yourself" deterministic (when it needs to be) kind of way.
Agreed
o They're calling Flush() when it's not really needed, again programming by coincidence. "I think this needs to be done and it doesn't break..."
I always do this because I've been burned before with a file being closed without it's full contents. Supposedly can't happen, but I've seen it in the past.
* Old school data access
DataSets are bad enough, because are you really sure that the contents of what you are selecting actually fits in the web server's memory (for all simultaneous accessors)? Of course you don't know that, so when you get busy your server is going to thrash like crazy and you won't know why.
LINQ is worse, because the objects are heavier and there is tons of Reflection.
Use a DataReader for performance.
o Not bad, pre se, but it could be easier to write and easier to read. DataAdapters, DataSets, are a hard way to do data access once you've used Linq to SQL, EF or NHibernate.
Easier to write, yes, but LINQ is NEVER easier to read. LINQ ensures that 50% of programmers will NEVER be able to read it. I'm not saying that LINQ doesn't have it's place, but you better be sure that all of your programmers will be LINQ-aware. If you aren't 100% certain that that will always be the case, you probably shouldn't use LINQ, especially chains of 8-10 functions in a row.
LINQ is totally not debuggable. It either completely works or gets completely rewritten.
* Re-Throwing an Exception via "throw ex"
o When you want to re-throw an exception, ALWAYS just throw; or you'll lose your current call stack and it'll be hard to debug your code.
Agreed
* Not reusable at all
o Now, reuse isn't the end-all, but it's nice. If this programmer wants different kinds of exports, they'll need to extract a lot from the spaghetti.
Agreed
if (item is string)
{
return string.Format("\"{0}\"", item.ToString().Replace("\"", "\\\""));
}
Seriously? You are using string.Format to add quotes in a tight loop?
This adds StringBuilder creation overhead to EVERY CSV VALUE! That's a lot of StringBuilders! Use + instead. That makes only 2 strings for each. The original and the final.
if (double.TryParse(item.ToString(), out dummy))
{
return string.Format("{0}", item);
}
Again with the StringBuilders. Set string ItemAsString = item.ToString(); in a variable first and then do the double.TryParse. If it's OK, just return ItemAsString. If not, return "\"" + itemAsString + "\"".
PRMan - Hm...I F11 into LINQ statements all the time....you can set debug points even on segments of a LINQ statement...why do you say it's not debuggable?
Typically because of the function chaining, which seems to be all the rage.
And again, not trying to be a jerk here, you did a great job of pointing out things to look for when evaluating code, but I'm surprised at the lack of attention to performance aspects of something that could be very problematic depending on the size of the result set.
Now that I'm the one responsible for utilizing resources that roll in and out of projects my opinion has changed.
LINQ is like Regular Expressions. Half the coders get it, the other half is left scratching their heads at the voodoo magic.
In Scott's example, did he really make the code something easier for people to understand - or did he just demonstrate how he could accomplish something with the least amount of code?
Based on the comments from other readers, you've at least given folks something to ponder when choosing LINQ for an otherwise simple solution.
I've decided to treat it as a last resort option. This will make porting the code to Java much easier - Just kidding ;-)
However, my class inherits from CustomTypeDescriptor, where I create dozens of dynamic virtual properties (effectively pivoting my row based data). This is an interesting topic in itself but I won't explain further.
You can't get the virtual properties via reflection, so I wrote a few other similar methods that might help someone :
public static string CustomTypeDescriptorToCsvColumns(CustomTypeDescriptor instance)
{
return string.Join(",", GetTypeDescriptorProperties(instance).Select(p => p.Name.ToCsvValue()).ToArray());
}
public static List<PropertyDescriptor> GetTypeDescriptorProperties(CustomTypeDescriptor instance)
{
List<PropertyDescriptor> properties = new List<PropertyDescriptor>();
foreach (var p in instance.GetProperties())
{
properties.Add(p as PropertyDescriptor);
}
return properties;
}
public static string CustomTypeDescriptorToCsv<T>(this IEnumerable<T> items,string columnHeaderProperty) where T : CustomTypeDescriptor
{
string ret = string.Empty;
if (items.Count() > 0)
{
var csvBuilder = new StringBuilder();
List<PropertyDescriptor> properties = GetTypeDescriptorProperties(items.First());
foreach (T item in items)
{
string line = typeof(T).GetProperties().Where(p => p.Name.Equals(columnHeaderProperty)).First().GetValue(item, null).ToString();
line += string.Join(",", properties.Select(p => p.GetValue(item).ToCsvValue()).ToArray());
csvBuilder.AppendLine(line);
}
ret = csvBuilder.ToString();
}
return ret;
}
1. Any reason for using var vs StringBuilder when specifying csvBuilder's type? I assume it's just to save a few keystrokes since they are functionally equivalent in this use case.
2. The mixing of business logic/data access code is another source of code smell. Do you agree?
As for LINQ -- frankly, if a programmer is unable to read and debug a simple LINQ chain -- whether it includes built in predicates or custom delegates -- he/she should consider a new career path (and so should any developer who's chaining together 8-10 functions in one call). Seriously folks. This is what we do. And, it doesn't get much easier than LINQ. By it's very nature it's succinct and concise, as opposed to the linguine-like code that's often written to achieve the same result.
And, regarding LINQ's performance, if one is coding for a high performance back end server application, obviously LINQ is not going to be the tool of choice. However, for the vast majority of business tier web applications, LINQ's performance is more than acceptable. Shoot, the performance of LINQ to XML blows the doors off the XmlDocument xpath calls (3x - 5x faster) that we've used up until now.
You have a collection called 'items'.
1. filter the items that meet a predicate (Where)
2. map the items to something else (Select)
3. ......
4. PROFIT!
BTW, this:
string line = string.Join(",",properties.Select(p => p.GetValue(item, null).ToCsvValue()).ToArray());Is kind of hard to read, for my taste anyway.
But all in all, a big improvement!
To those who objected to using LINQ above due to reflection concerns, using a custom DynamicObject implementation allows you to specify the call dispatching and avoid reflection. Also since it is streamed, you don't have to worry with the performance overhead of loading the thing into memory.
For those who object to the code thinking it is harder to read, The plumbing code may be harder, but once that is written, it should make the end implementation code (in LINQ?) easier to consume the plumbing. The same objection can be said for much of the Reactive Framework code out there now. Much of the nasty code is in setting up the event plumbing. Once that is set, consuming it using LINQ becomes much easier.
And for those of you who haven't learned LINQ yet, I could recommend a good book...
using System;
class F {
public static void Main () {
try {
int i = 0;
int j = 3 / i;
} catch {
throw;
}
}
}
compile this with '/debug'
using System;
class F {
public static void Main () {
try {
int i = 0;
int j = 3 / i; // line 6
} catch {
throw; // line 8
}
}
}
quiz: which line does the 1st frame of the the exception reference?
var product = customers
.Select(c => c.Requirements)
.Design()
.Implement()
.Verify()
.Maintain()
.First();
(Um, except of course that the execution of the above statements is actually iterative under the covers.)
Scott, I would prefer not to build the whole output at once in the StringBuilder. I would consider modifying ToCsvValue() to return IEnumerable<string> so that each row can be written immediately to the output stream. The idea here is that we're not building every bit of the results up in memory but getting rid of it as we build each row. Of course this depends on the stream to not do the same dastardly thing, but since we can't see what the stream is doing it must not matter, right? ;-)
I've renamed the method to AsCsvRows, it's called something like this:
foreach (string row in LinqToCSV.AsCsvRows(input))
{
context.Response.WriteLine(row);
}
And while my goal is reduced memory load, not brevity, the code is a little tighter:
public static IEnumerable<string> AsCsvRows<T>(this IEnumerable<T> items)
where T : class
{
var properties = typeof(T).GetProperties();
foreach (T item in items)
{
yield return string.Join(",", properties.Select(p => p.GetValue(item, null).ToCsvValue()).ToArray());
}
}
That's my two cents, anyway. I've left the string manipulation unchanged for comparison.
I started writing it, but then decided to Google with Bing, and found a decent direction to start with at Mike Hadlow's blog.
Is Google with Bing a phrase all MS employees use?
@Scott: Sure, in theory you're correct and it's better and all that, but come on, a DAL for a simple CSV export? Talking about overhead. Sometimes you just need something fast (both in terms of build time and performance) and that usually means deliberatly throwing conventions out the window. As long as you realize you'll have to throw it away at some point in the near future, it's not a problem.
On the perf of Linq ... these kind of things always make my bar weighing "programmer time is much more expensive than CPU time". This bar needs to be applied with care but Linq just saves so much programming time I usually always find it worth the runtime cost, except in the most critical loops.
On the other hand - this code does seem to be a bit lazy about using too many individual string allocations in inner loops in ways that are easy to fix.
And this line here is just an outright crime - I am shocked to see someone like Scott use such a thing. Must have been a lapse.
> return string.Format("{0}", item);
Blech.
Otherwise very interesting post - I've always felt like CSV should be supported by the framework as natively as ToList() or ToArray() ...
1. You need to implement some kind of voting for comments since after Stackoverflow I have a +1 (or -1) reflex.
2. So +1 to Mike for mentioning this too :))
3. I agree to use just streams (a DbDataReader and the Response stream so you don't bloat the RAM (leave the buffering worries to the ASP.NET runtime)
4. Maybe the catch(Exception ex) { throw ex; } was a means to obfuscate debugging info :))))))
5. @PRMan : WTF is with the flush? I can't believe it till I see it.. Maybe it was something else..
6. I agree on the string.Format hate. I do hate it and rarely see its benefits
7. Regarding googling it with Bing... Can't we bing it with Bing? Or maybe binging it with Google?! :))))
Now shoot me :P
For better Excel (or Sql Server) support we could add column headers and use ';' as separator values.
Maybe a new parameter isExcelMode on ToCsv method.
C# CODE:
public static string ToCsv<T>(this IEnumerable<T> items, bool isExcelMode)
where T : class
{
var csvBuilder = new StringBuilder();
var properties = typeof(T).GetProperties();
var separator = isExcelMode ? ";" : ",";
if (isExcelMode) //HEADER COLUMNS
csvBuilder.AppendLine(string.Join(separator, properties.Select(p => p.Name.ToCsvValue(isExcelMode)).ToArray()));
foreach (T item in items)
{
string line = string.Join(separator, properties.Select(p => p.GetValue(item, null).ToCsvValue(isExcelMode)).ToArray());
csvBuilder.AppendLine(line);
}
return csvBuilder.ToString();
}
private static string ToCsvValue<T>(this T item, bool isExcelMode)
{
var columnDelimiter = String.Empty;
if (!isExcelMode)
columnDelimiter = "\"";
if (item == null)
return String.Format("{0}{0}", columnDelimiter);
if (item is string)
{
return string.Format("{0}{1}{0}", columnDelimiter,
!isExcelMode ? item.ToString().Replace("\"", "\\\"") : item.ToString());
}
double dummy;
if (double.TryParse(item.ToString(), out dummy))
{
return string.Format("{0}", item);
}
return string.Format("{0}{1}{0}", item);
}
Tested with Excel 2003.
Bye
For an example, see http://codeplex.com/exporter
public static void ExportToCSV<T>(this IQueryable<T> list, string filename, string[] exclude )
You can see the full example here: http://www.codecapers.com/post/export-to-csv-using-reflection-and.aspx
Peter: Agreed. That's sloppy. You may have noticed I kind of slapped it together. ;)
Ignat: I'll check it out! As well as CSVHelper
Michael Ceranski: Nice...
OO was invented in the 70s. List comprehnsions where invented in...the 70s.
Though I suppose OO is 7 odd years it's senior, so we should all be programming in OO and completely ignoring everything else that has come out of Computer Science in the last 4 decades.
Just because Micosoft was late to the party doesn't mean the latest technology doesn't have a sound theoretical base or the benefit of decades of research and experience to draw on. Quite the contrary.
I really like the vssettings of the editor you are having, can you please share it or let me know how to create interleaved colors on the editor.
Regards,
Ravi.
But Google with Bing seems to be too much an attempt to deteriorate the "Google Brand" on your very visible blog. Not trying to be too much a groupie, but while maybe? all in fun, it seems a bit malicious.
Such an approach would concentrate on the data rather than procedural code, provide reusable components, provide a flexible pipeline architecture, validate against strong types with XML schema (for XML output), be easy to debug with a viewable intermediate data format, and use a declarative model that leaves the underlying code to optimize performance.
It could also avoid the hardcoded values in the examples, as you would parameterize the stored procedure to make it more reusable and responsive to user choice.
I have generated Word 2003 XML documents in a similar way, and that schema is more difficult than the Excel 2003 schema.
Even better would be SQL Server exposing the final page directly through a RESTful interface, though :)
string line = string.Join(",",properties.Select(p => p.GetValue(item, null).ToCsvValue()).ToArray());
properties.Select(....) iterates the properties and creates an array of their values in which the call to string.Join(...) iterates the newly created array and joins the values. Wouldn't the following be better or is what I have below one of those, as the saying goes, "premature optimization is the root of all evil"?
... snip ...
foreach(T item in items) {
bool first = true;
foreach (var p in properties) {
if (!first) csvBuilder.Append(",");
else first = false;
csvBuilder.Append(p.GetValue(item, null).ToCsvValue());
}
csvBuilder.AppendLine();
}
.... snip ....
In effect you replace two inner loops with one; granted it is a little uglier to look at.
http://csharptutorial.com/blog/how-export-a-datatable-to-a-text-file/
Wouldn't it much easier and above all flawless or did I miss something ?
Comments are closed.
Here is an example: http://support.microsoft.com/kb/271572