Parsing CSVs and Poor Man's Web Log Analysis with PowerShell
I was messing around with some log files for the podcast and exported a CSV. There's lots of ways for us to get stats, but I was just poking around.
Given a CSV format like this:
"File","Hits","Bandwidth"
"/hanselminutes_0026_lo.wma","78173","163625808"
"/hanselminutes_0076_robert_pickering.wma","24626","-1789110063"
"/hanselminutes_0077.wma","17204","1959963618"
"/hanselminutes_0076_robert_pickering.mp3","15796","-55874279"
"/hanselminutes_0078.wma","14832","-1241370004"
"/hanselminutes_0075.mp3","13685","-1840937989"
"/hanselminutes_0075.wma","12129","1276597408"
"/hanselminutes_0078.mp3","11058","-1186433073"
And noticing that there are different file extensions, but one logical show...for example there's an MP3 of Show 76 and WMA of that same show I wanted to find out how many downloads per show.
I fired up PowerShell. First I can bring in the CSV file and notice that it's parsed into a PowerShell object list nicely:
PS C:\> import-csv File.csv | Select File,Hits
File Hits
---- ----
/hanselminutes_0026_lo.wma 78173
/hanselminutes_0076_robert_pickering.wma 24626
/hanselminutes_0077.wma 17204
/hanselminutes_0076_robert_pickering.mp3 15796
/hanselminutes_0078.wma 14832
/hanselminutes_0075.mp3 13685
/hanselminutes_0075.wma 12129
/hanselminutes_0078.mp3 11058
Notice that there are headings in this output? PowerShell's Import-CSV is smart enough to infer the headers and positions and split things up. After the first | pipe I don't ever have to think CSV again.
I'm going to need a new column, though, called Show with the Show number it in. I'll probably glean it from the Filename. Just to get started, I'll make a new "calculated column."
PS C:\> import-csv file.csv | select File, Hits, @{Name="Show";Expression={ $_.File } }
File Hits Show
---- ---- ----
/hanselminutes_0026_lo.wma 78173 /hanselminutes_0026_lo.wma
/hanselminutes_0076_robert_pickering... 24626 /hanselminutes_0076_robert_pickering...
/hanselminutes_0077.wma 17204 /hanselminutes_0077.wma
/hanselminutes_0076_robert_pickering... 15796 /hanselminutes_0076_robert_pickering...
/hanselminutes_0078.wma 14832 /hanselminutes_0078.wma
/hanselminutes_0075.mp3 13685 /hanselminutes_0075.mp3
/hanselminutes_0075.wma 12129 /hanselminutes_0075.wma
/hanselminutes_0078.mp3 11058 /hanselminutes_0078.mp3
/hanselminutes_0077.mp3 11003 /hanselminutes_0077.mp3
/hanselminutes_0074.mp3 6494 /hanselminutes_0074.mp3
Except you can see that while I have a new column called "Show" it's just got the File duplicated. Notice the format for making a new Column:
select File, Hits, @{Name="Show";Expression={ $_.File } }.
Inside that { } scriptblock I need to do my work to get a show number.
It's a classic joke, so I'll say it again: "So you've got a problem and you've decided to use Regular Expressions to solve it. Now you've got two problems."
The show format has changed a bit over the years and some have leading zeros, some have names at the end. It's organic. So, in front of the expression we add a Regular Expression. You can make them easily in PowerShell by putting [regex] in front of a string. The ; is just a separator between statements. The first statement is just the assignment of a RegEx to a variable called $re. Later in my script block I apply that RegEx to the $_.File (remember that $_ is the equivalent of "this" in PowerShell, so $_.File is this.File).
The @{Name="Value";Name="Value"} format is how to express a Hashtable to PowerShell.
PS C:\> $re =[regex]"\d{2}(?=[_.])"; import-csv file.csv |
select File, Hits, @{Name="Show";Expression={$re.matches($_.File)[0] } }
File Hits Show
---- ---- ----
/hanselminutes_0026_lo.wma 78173 26
/hanselminutes_0076_robert_pickering... 24626 76
/hanselminutes_0077.wma 17204 77
/hanselminutes_0076_robert_pickering... 15796 76
/hanselminutes_0078.wma 14832 78
/hanselminutes_0075.mp3 13685 75
/hanselminutes_0075.wma 12129 75
/hanselminutes_0078.mp3 11058 78
/hanselminutes_0077.mp3 11003 77
/hanselminutes_0074.mp3 6494 74
The RegEx "\d{2}(?=[_.])" says "find me the first two decimals to the left of either a "_" or a "." It's not foolproof, at least not until show 100, but this is a Poor Man's Solution.
Next, I'll sort by Show descending then group all the like shows together.
PS C:\> $re =[regex]"\d{2}(?=[_.])"; import-csv file.csv |
select File, Hits, @{Name="Show";Expression={$re.matches($_.File)[0] } } |
sort Show -desc | group Show
Count Name Group
----- ---- -----
9 79 {@{File=/Hanselminutes_0079.wma; Hits=27; Show=79},...
12 78 {@{File=/hanselminutes_0078.m4b.torrent; Hits=18; Show=78},...
12 77 {@{File=/hanselminutes_0077.wma.torrent; Hits=52; Show=77},...
18 76 {@{File=/hanselminutes_76.mp3; Hits=1; Show=76}, ...
11 75 {@{File=/hanselminutes_0075_lo.wma; Hits=468; Show=75},...
In this listing I can see via Count that there were 9 different formats of Show 79 downloaded in this time period. However, the good data is trapped in the Group column. All my previous line items are grouped in there. I need to sum up all the Hits for all the downloads of the the 9 different formats of Show 76...and all the shows.
PS C:\> $re =[regex]"\d{2}(?=[_.])"; import-csv file.csv |
select File, Hits, @{Name="Show";Expression={$re.matches($_.File)[0] } } |
sort Show -desc | group Show |
select Name,{($_.Group | Measure-Object -Sum Hits).Sum }
Name ($_.Group | Measure-Object -Sum Hits).Sum
---- ------------------------------------------
79 9205
78 27575
77 29807
76 42798
75 27174
74 13060
73 10532
72 10145
71 9826
70 8745
69 8065
68 8132
67 7024
66 8535
65 13041
Now, I select out Name and then take each item's Group (remember $_ is this, so $_.Group means "the current item's Group." Each item will go through Measure-Object which has options like -sum and -average, etc. We grab them up and Sum them.
But the Column Header looks silly. I want a pretty name.
PS C:\> $re =[regex]"\d{2}(?=[_.])"; import-csv file.csv |
select File, Hits, @{Name="Show";Expression={$re.matches($_.File)[0] } } |
sort Show -desc | group Show |
select Name,
@{Name="Hits";Expression={($_.Group | Measure-Object -Sum Hits).Sum }}
I just make a Hashtable with the Name "Hits" and put the scriptblock in a key called "Expression." Add some more sorting and:
Name Hits
---- ----
26 78673
76 42798
77 29807
78 27575
75 27174
74 13060
65 13041
73 10532
Now I can see that show 76 got 42798 downloads this time period and the classic show 26 did nicely with 78673.
It this representative of the way you SHOULD do things in PowerShell? No, but it's a transcript of a real-life problem being together at the command line. It took less time than it would take to write a C# program and it's easily modified ad-hoc. Here's what Lee Holmes said he would have done, accomplishing the same thing not as a one-liner but as a proper PowerShell function:
function Get-ShowHits { $regex = '/hanselminutes_(\d+).*' $shows = Import-CSv File.csv | Select File,Hits | Group { $_.File -replace $regex,'$1' } foreach($show in $shows) { $showOutput = New-Object System.Management.Automation.PsObject $showOutput | Add-Member NoteProperty Name $show.Name $showOutput | Add-Member NoteProperty Hits ($show.Group | Measure-Object -Sum Hits).Sum $showOutput } } Get-ShowHits | Sort -Desc Hits
Both are great examples of why PowerShell rocks my world. When are you going to try it? I swear, everyone has "been meaning to try PowerShell but I haven't had the time." Just do it.
About Scott
Scott Hanselman is a former professor, former Chief Architect in finance, now speaker, consultant, father, diabetic, and Microsoft employee. He is a failed stand-up comic, a cornrower, and a book author.
About Newsletter
Just out of interest what is your 'IDE' for PowerShell - and what add-ons/extensions do you have installed etc ??
.. Ken
So in line with your final comment in your post, I just did it! But now I am trying for the 1000 time to truly grok regular expressions, sigh....
Cheers,
Robert Porter
BTW, PowerShell team is doing usability test on the next version of PowerShell. Sure enough, they have a remoting capability... I wish I could get my hands on it.
I have a command-line program that will output text to the screen. I want to parse values from that output. This seems like one of the most basic operations that PowerShell would thrive at, but I'm having no luck.
To be more concrete, I'm trying to get a list of changesets for a directory in my VSTS source code repository, and then do something with each of them. I'm using the tfpt (TF Power Toys) history command:
tfpt history /recursive /stopafter:3 MyRootFolder
The output (from cmd.exe or powershell) is:
Changeset User Date Comment
------------
1234 someone 11/1/2007 something
1235 someone 11/1/2007 something
1242 someone 11/1/2007 something
I want to get 1234, 1235, 1242 in some kind of data structure that will allow me to loop over them and do something with each changeset.
I started down the road of using a regex: ^\d+
but I cannot figure out how to easily apply it to the output.
I know this has to be easy. Make me look stupid, please.
tfpt history /recursive /stopafter:3 MyRootFolder | select Changeset | foreach-object { dosomething.exe $_ }
where $_ is the changesset number. You could also do:
tfpt history /recursive /stopafter:3 MyRootFolder | foreach-object { dosomething.exe $_.Changeset }
17> $pattern = '^(?<changeset>\d+)\s+(?<user>\w+)\s+(?<date>[^ ]+)\s+(?<comment>.*)$'
18> $changesets = @()
19> tfpt history /recursive /stopafter:3 MyRootFolder | foreach {if ($_ -match $pattern) {$changesets += $matches}}
20> $changesets
Name Value
---- -----
date 11/1/2007
user someone
changeset 1234
comment something
0 1234 someone 11/1/2007 something
date 11/1/2007
user someone
changeset 1235
comment something
0 1235 someone 11/1/2007 something
date 11/1/2007
user someone
changeset 1242
comment something
0 1242 someone 11/1/2007 something
In this example I initialize an empty array called $changesets. Then I do a regex match against each line output by TFPT. Then ones that match, result in a global called $matches to be filled in with all the named captures. I just append this instance of $matches to the $changesets array. There are other ways to do this including creating a custom psobject and then dynamically adding properties to it. However the approach above is somewhat easier and suits the need I think.
BTW I pester the TFS team (Brian Harry, James Manning, etc) for PowerShell support (TFS cmdlets and providers) whenever I get the opportunity. I suggest that you do the same - if you get the chance. :-)
You're so right; I haven't had the time for about a year now!
Comments are closed.
I always use LogParser from Microsoft to mess with these types of files SQL like expressions but the ability to handle large files very quickly and parse multiple log types (including plain csv or xml files).
http://www.microsoft.com/downloads/details.aspx?FamilyID=890cd06b-abf8-4c25-91b2-f8d975cf8c07&displaylang=en
I have to admit that the last time I was doing this I couldn't help thinking "why am I not in Powershell", but LogParser has always proven to be a very fast parser for me on large log type files.
That said, it's good to have a few tools in your belt to be able to pull out and do stuff like this. It strikes me that if you make the effort to learn PowerShell then you are always carrying around a really nice swiss-army knife