Unix Fight! - Sed, Grep, Awk, Cut and Pulling Groups out of a PowerShell Regular Expression Capture

August 02, 2011 Comment on this post [15] Posted in PowerShell

Sponsored By

There's a wonderful old programmers joke I've told for years:

"You've got a problem, and you've decided to use regular expressions to solve it.

Ok, now you've got two problems..."

A friend of mine was talking on a social network and said something like:

"That decade I spent in the Windows world stunted my growth. one teeny-tiny unix command grabbed certain values from an XML doc for me."

Now, of course, I took this immediately as a personal challenge and rose up in a rit of fealous jage and defended my employer. Nah, not really as I worked at Nike on Unix for a number of years and I get the power of sed and awk and what not. However, he said XML, and well, PowerShell rocks XML.

Because it's a dynamic language, you can refer to XML nodes just like this:

$a = ([xml](new-object net.webclient).downloadstring("http://feeds.feedburner.com/Hanselminutes"))
$a.rss.channel.item

The first line gets the feed and the second line gets all the items.

However, turns out my friend was actually trying to retrieve values within poorly-formed XML fragments within a larger SQL dump file. There's three kinds of XML. Well-formed, valid, and crap. He was sifting through crap for some values. Basically he had this crazy text file with some fragments of XML within it and wanted the values in-between elements: "<FancyPants>He wants this value</FancyPants>."

Something like this:

grep "<FancyPants>.*<.FancyPants>" test.txt | sed -e "s/^.*<FancyPants/<FancyPants/" | cut -f2 -d">"| cut -f1 -d"<" > fancyresults.txt

I'm old, but I'm not an expert in grep and sed so I'm sure there are ways he could have done it more tersely. There always is, right? With regular expressions, sometimes someone just types $@($*@)$(*@)(@*)@*(%@%# and Shakespeare pops out. You never know.

There's also a lot of different ways to do this in PowerShell, but since he used RegExes, who am I to disagree?

First, here's the one line answer.

cat test.txt | foreach-object {$null = $_ -match '<FancyPants>(?<x>.*)<.FancyPants>'; $matches.x}

But I thought I'd also sort them, remove duplicates...

cat test.txt | foreach-object {$null = $_ -match '<FancyPants>(?<x>.*)<.FancyPants>'; $matches.x} | sort | get-unique

But foreach-object can be aliased as % and get-unique can be just "gu" so the final answer is:

cat test.txt | % {$null = $_ -match '<FancyPants>(?<x>.*)<.FancyPants>';$matches.x} | sort | gu

I think we can agree at they are both hard to read. I still love PowerShell.

Related Links:

About Scott

Scott Hanselman is a former professor, former Chief Architect in finance, now speaker, consultant, father, diabetic, and Microsoft employee. He is a failed stand-up comic, a cornrower, and a book author.

About Newsletter

Hosting By

Hosted on Linux using .NET in an Azure App Service

Comment on this post [15]

Share on BlueSky or use the Permalink and post anywhere!

August 02, 2011 1:47

cat itself is alias for get-content with the standard alias of "gc" - you can shave one more character :)

Manoj Mahalingam

August 02, 2011 1:47

don't forget select-string

gci . *.csproj -rec | select-string "<HintPath>(.*)</HintPath>" | % { $_.Matches } | % { $_.Groups[1].Value }

fschwiet

August 02, 2011 1:47

In a similar vein, if you've never checked out the Command Line Kung Fu blog, I highly recommend it.

Someone posts a question asking how to do something on the command line, and the authors try to do it using Unix commands, the Windows command line, and Powershell. Some of them are really neat.

m0nastic

August 02, 2011 2:08

Please note that the regex you posted is not for Shakespeare, but for Samuel Taylor, the regex for Shakespeare has an additional pair of parenthesis and a question mark.

LasseVK

August 02, 2011 2:55

There are a few thnigs you can use to simplify this:

1. Sort-Object has a -Unique parameter, so you don't actually need Get-Unique there.
2. You can save the ForEach-Object in there by just running replace over every line that matches the pattern:


(gc test.txt) -match '<(FancyPants)>.*</\1>' -replace '.*<FancyPants>|</FancyPants>.*' | sort -u

Ok, it's not much simpler, since you need to filter the lines that match first (akin to the grep and sed from before). Oh, well.

Johannes

August 02, 2011 3:06

One comment that this crowd has not mentioned...maybe because it is not fact, rather opinion and this crowd is smart enough to only speak of facts. But...IMO the PowerShell version is much easier to read therefore (theoretically) easier to maintain.

ebohling

August 02, 2011 4:53

It's a great quote - I use it myself - but credit for it goes to Jamie Zawinski: http://regex.info/blog/2006-09-15/247

JohnW

August 02, 2011 7:57

A few other quotes from jwz (outside of jwz.org and @jwz on twitter, of course) can be found @ http://en.wikiquote.org/wiki/Jamie_Zawinski

James Manning

August 02, 2011 14:59

The original example will need to be de-duplicated because $matches is not reset on the lines where -match finds nothing, so after a match is found, every line will return something. It's better to have

  % {If ($_ -match '<FancyPants>(?<x>.*)<.FancyPants>') { $matches.x}}

Instead of using a -match and a -split or a cat / gc and a -match you can use select-string

select-string "(?<=(FancyPants)>).*(?=</\1)" '.\test.txt' | %{$_.matches[0].value} | sort -u

Though if you'll be maintaining it

select-string -pattern "(?<=(FancyPants)>).*(?=</\1)" -path '.\test.txt' | foreach-object {$_.matches[0].value} | sort-object -unique

James O'Neill

August 02, 2011 15:19

Using gawk gives you access to the regex groups on Linux:

gawk 'match($0, /.*<FancyPants>(.*)<\/FancyPants>.*/, a) {print a[1]}' test.txt

Dean Jones

August 02, 2011 16:50

Select-String can also be abbreviated to "ss".

At work I use it all the time to search through the code:

dir -Recurse -Filter *.cs | ss "something"

Andreas

August 02, 2011 18:35

This doesnt work if the desired xml node appears more than once on the same line.

If the content of file is the following (all in one line):
<FancyPants>He wants this value</FancyPants><FancyPants>and this value</FancyPants>

the result of the command would be:
He wants this value</FancyPants><FancyPants>and this value

Buma

August 03, 2011 1:01

@Buma, true. In that case you'd want something like

$content = [io.file]::ReadAllText('c:\test.txt'); $content -match '(?is)...'; ...

to kick the regex into Singleline mode. It's criminal IMO that Get-Content doesn't have a parameter to force reading the file as a single string. It would be very useful for situatations like this where a pattern to be matched can span multiple lines.

Also note that file paths do *not* need to be quoted in PowerShell unless there's a space or a special PowerShell character you need to escape.

Finally, I would select-string for this but the result is longer:

select-string '<FancyPants>(?<x>.*)<.FancyPants>' test.txt | %{$_.matches} | %{$_.Groups[1].value} | sort -u

Keith Hill

August 03, 2011 22:50

Good grief man, when do you ever sleep?

Thanks, as always, for another superb article that schooled me on techniques I will now use and claim as my own.

Stuart Thompson

August 28, 2011 15:27

Great article, you have indeed covered topic in detail with samples. I have also blogged my experience as 10 examples of grep command in unix ,let me know how do you find it. Thanks

javarevisited.blogspot.com

Comments are closed.