Unix Fight! - Sed, Grep, Awk, Cut and Pulling Groups out of a PowerShell Regular Expression Capture
There's a wonderful old programmers joke I've told for years:
"You've got a problem, and you've decided to use regular expressions to solve it.
Ok, now you've got two problems..."
A friend of mine was talking on a social network and said something like:
"That decade I spent in the Windows world stunted my growth. one teeny-tiny unix command grabbed certain values from an XML doc for me."
Now, of course, I took this immediately as a personal challenge and rose up in a rit of fealous jage and defended my employer. Nah, not really as I worked at Nike on Unix for a number of years and I get the power of sed and awk and what not. However, he said XML, and well, PowerShell rocks XML.
Because it's a dynamic language, you can refer to XML nodes just like this:
$a = ([xml](new-object net.webclient).downloadstring("http://feeds.feedburner.com/Hanselminutes"))
$a.rss.channel.item
The first line gets the feed and the second line gets all the items.
However, turns out my friend was actually trying to retrieve values within poorly-formed XML fragments within a larger SQL dump file. There's three kinds of XML. Well-formed, valid, and crap. He was sifting through crap for some values. Basically he had this crazy text file with some fragments of XML within it and wanted the values in-between elements: "<FancyPants>He wants this value</FancyPants>."
Something like this:
grep "<FancyPants>.*<.FancyPants>" test.txt | sed -e "s/^.*<FancyPants/<FancyPants/" | cut -f2 -d">"| cut -f1 -d"<" > fancyresults.txt
I'm old, but I'm not an expert in grep and sed so I'm sure there are ways he could have done it more tersely. There always is, right? With regular expressions, sometimes someone just types $@($*@)$(*@)(@*)@*(%@%# and Shakespeare pops out. You never know.
There's also a lot of different ways to do this in PowerShell, but since he used RegExes, who am I to disagree?
First, here's the one line answer.
cat test.txt | foreach-object {$null = $_ -match '<FancyPants>(?<x>.*)<.FancyPants>'; $matches.x}
But I thought I'd also sort them, remove duplicates...
cat test.txt | foreach-object {$null = $_ -match '<FancyPants>(?<x>.*)<.FancyPants>'; $matches.x} | sort | get-unique
But foreach-object can be aliased as % and get-unique can be just "gu" so the final answer is:
cat test.txt | % {$null = $_ -match '<FancyPants>(?<x>.*)<.FancyPants>';$matches.x} | sort | gu
I think we can agree at they are both hard to read. I still love PowerShell.
Related Links:
- Awesome Visual Studio Command Prompt and PowerShell icons with Overlays
- Hanselminutes Podcast 190: The State of Powershell with Lee Holmes and Jason Shirk
- Download Podcasts with Powershell
- Batch Converting a Directory Tree of Videos Recursively with Handbrake for Streaming to an Xbox360
- Parsing CSVs and Poor Man's Web Log Analysis with PowerShell
About Scott
Scott Hanselman is a former professor, former Chief Architect in finance, now speaker, consultant, father, diabetic, and Microsoft employee. He is a failed stand-up comic, a cornrower, and a book author.
About Newsletter
gci . *.csproj -rec | select-string "<HintPath>(.*)</HintPath>" | % { $_.Matches } | % { $_.Groups[1].Value }
Someone posts a question asking how to do something on the command line, and the authors try to do it using Unix commands, the Windows command line, and Powershell. Some of them are really neat.
1. Sort-Object has a -Unique parameter, so you don't actually need Get-Unique there.
2. You can save the ForEach-Object in there by just running replace over every line that matches the pattern:
(gc test.txt) -match '<(FancyPants)>.*</\1>' -replace '.*<FancyPants>|</FancyPants>.*' | sort -u
Ok, it's not much simpler, since you need to filter the lines that match first (akin to the grep and sed from before). Oh, well.
% {If ($_ -match '<FancyPants>(?<x>.*)<.FancyPants>') { $matches.x}}
Instead of using a -match and a -split or a cat / gc and a -match you can use select-string
select-string "(?<=(FancyPants)>).*(?=</\1)" '.\test.txt' | %{$_.matches[0].value} | sort -u
Though if you'll be maintaining it
select-string -pattern "(?<=(FancyPants)>).*(?=</\1)" -path '.\test.txt' | foreach-object {$_.matches[0].value} | sort-object -unique
gawk 'match($0, /.*<FancyPants>(.*)<\/FancyPants>.*/, a) {print a[1]}' test.txt
At work I use it all the time to search through the code:
dir -Recurse -Filter *.cs | ss "something"
If the content of file is the following (all in one line):
<FancyPants>He wants this value</FancyPants><FancyPants>and this value</FancyPants>
the result of the command would be:
He wants this value</FancyPants><FancyPants>and this value
$content = [io.file]::ReadAllText('c:\test.txt'); $content -match '(?is)...'; ...
to kick the regex into Singleline mode. It's criminal IMO that Get-Content doesn't have a parameter to force reading the file as a single string. It would be very useful for situatations like this where a pattern to be matched can span multiple lines.
Also note that file paths do *not* need to be quoted in PowerShell unless there's a space or a special PowerShell character you need to escape.
Finally, I would select-string for this but the result is longer:
select-string '<FancyPants>(?<x>.*)<.FancyPants>' test.txt | %{$_.matches} | %{$_.Groups[1].value} | sort -u
Thanks, as always, for another superb article that schooled me on techniques I will now use and claim as my own.
Comments are closed.