The Weekly Source Code 9 - WideFinder Edition
In my new ongoing quest to read source code to be a better developer, I now present the ninth in an infinite number of a weekly series called "The Weekly Source Code." Here's some source I'm reading this week that I enjoyed.
Ya, I know this one is just 4 days after the last one, but I was having too much fun and couldn't wait until Wednesday. Plus, it's a new week so poop on you.
Last month Tim Bray shared his experiences writing a program that does, well, here's Joe Cheng's succinct description. Tim calls the project WideFinder.
The Wide Finder challenge is to write a program that:
- Scans logfiles for hits on blog articles
- which are counted and
- sorted with the
- top 10 most popular being printed to stdout. It should also
- be about as elegant and concise as Tim’s Ruby version and
- its performance should scale with additional CPU cores.
And this is done on a fairly large log file of about 250 megs. While Item #6 is the most interesting, many folks are focusing on Item #5. Either way, it's a heck of a lot more interesting problem than FizzBuzz and worth adding to your interview arsenal pocket.
I encourage you to go check out Tim's site as he's continued to list the sources that he finds most interesting. As a primarily C# programmer who's always trying to stretch out of my comfort zone, here's what I've found interesting, in the order I found them interesting.
- Don Box's Naive Implementation in C# 3.0 - Apparently this is the kind of code Don can write after two beers. Notice the use of yield to make this "LINQ over a text file of CR/LF strings." That's one of those write-it-constantly-over-and-over-again helper methods that makes me wonder why it wasn't just included.
static void Main(string[] args) { var regex = new Regex(@"GET /ongoing/When/\d\d\dx/(\d\d\d\d/\d\d/\d\d/[^ .]+)"); var grouped = from line in ReadLinesFromFile(@"C:\temp\bray.txt") let match = regex.Match(line) where match.Success let url = match.Value group url by url; var ordered = from g in grouped let count = g.Count() orderby count descending select new { Count = count, Key = g.Key }; foreach (var item in ordered.Take(10)) Console.WriteLine("{0}: {1}", item.Count, item.Key); } // LINQ-compatible streaming I/O helper public static IEnumerable<string> ReadLinesFromFile(string filename) { using (StreamReader reader = new StreamReader(filename)) { while (true) { string s = reader.ReadLine(); if (s == null) break; yield return s; } }
using (new QuickTimer(“Total time”)) { IEnumerable<string> data = new LineReader(args[0]); Regex regex = new Regex(@”GET /ongoing/When/\d\d\dx/\d\d\d\d/\d\d/\d\d/([^ ]+) “, RegexOptions.Compiled | RegexOptions.CultureInvariant); var result = from line in data let match = regex.Match(line) where match.Success group match by match.Groups[1].Value into grp orderby grp.Count() descending select new { Article = grp.Key, Count = grp.Count() }; foreach (var v in result.Take(10)) Console.WriteLine(“{0}: {1}”, v.Article, v.Count); }
#light open System.Text.RegularExpressions open System.IO open System.Text let regex = new Regex(@"GET /ongoing/When/\d\d\dx/(\d\d\d\d/\d\d/\d\d/[^ .]+)", RegexOptions.Compiled) let seqRead fileName = seq { use reader = new StreamReader(File.OpenRead(fileName)) while not reader.EndOfStream do yield reader.ReadLine() } let query fileName = seqRead fileName |> Seq.map (fun line -> regex.Match(line)) |> Seq.filter (fun regMatch -> regMatch.Success) |> Seq.map (fun regMatch -> regMatch.Value) |> Seq.countBy (fun url -> url) *And here's the code to call it: for result in query @"file.txt" do let url, count = result
counts = {} counts.default = 0 ARGF.each_line do |line| if line =~ %r{GET /ongoing/When/\d\d\dx/(\d\d\d\d/\d\d/\d\d/[^ .]+) } counts[$1] += 1 end end keys_by_count = counts.keys.sort { |a, b| counts[b] <=> counts[a] } keys_by_count[0 .. 9].each do |key| puts "#{counts[key]}: #{key}" end
public class WideFinder { public static void main(String[] args) throws IOException { Map<String, Integer> counts = new HashMap<String, Integer>(); Pattern p = Pattern.compile("GET /ongoing/When/\\d\\d\\dx/(\\d\\d\\d\\d/\\d\\d/\\d\\d/[^ .]+) "); BufferedReader in = new BufferedReader(new InputStreamReader( new FileInputStream(args[0]), "US-ASCII")); String line = null; while ((line = in.readLine()) != null) { Matcher m = p.matcher(line); if (m.find()) { String key = m.group(); Integer currentCount = counts.get(key); counts.put(key, (currentCount == null ? 1 : (currentCount + 1))); } } in.close(); List<Entry<String, Integer>> results = new ArrayList<Map.Entry<String, Integer>>(counts.entrySet()); Collections.sort(results, new Comparator<Entry<String, Integer>>() { public int compare(Entry<String, Integer> o1, Entry<String, Integer> o2) { return o2.getValue().compareTo(o1.getValue()); } }); for(int i = 0; i < 10; i++) { System.out.println(results.get(i)); } } }
(defun run (&rest logs) (let ((counts (make-hash-table :test #'equal))) (dolist (filename logs) (with-open-file (stream filename :direction :input :external-format :latin-1) (loop for line = (read-line stream nil stream) until (eq line stream) do (cl-ppcre:register-groups-bind (match) ("GET /ongoing/When/\\d{3}x/(\\d{4}/\\d{2}/\\d{2}/[^ .]+) " line) (incf (gethash match counts 0)))))) (loop for key being the hash-keys of counts collect key into keys finally (map nil #'(lambda (x) (format t "~D: ~A~%" (gethash x counts) x)) (subseq (sort keys #'> :key #'(lambda (x) (gethash x counts))) 0 10)))))
A good way to understand other languages (programming or human) is to read the same story in each of these languages and compare them. Tim's problem serves that purpose well!
Oh, and if you want to see why we program in Managed Code, check out the C version.
Feel free to send me links to cool source that you find hasn't been given a good read.
About Scott
Scott Hanselman is a former professor, former Chief Architect in finance, now speaker, consultant, father, diabetic, and Microsoft employee. He is a failed stand-up comic, a cornrower, and a book author.
About Newsletter
Speaking of yield, I'd be interested to find anyone using it to seriously implement continuations in a client application. Or really, for anything more complicated than tic tac toe.
-Rick (currently dodging plumes of smoke at SoCal TechFest)
PS> $pattern = 'GET /ongoing/When/\d{3}x/(?<url>\d{4}/\d{2}/\d{2}/[^ .]+)'
PS> Get-Content *.log | Foreach {if ($_ -match $pattern){$matches.url}} | Group | Sort Count -desc | Select -first 10
I'm not claiming my blog is a paragon of visual goodness... it most definatley isn't. But it is at least readible in Google Reader.
Thanks for all the good work you do!
Comments are closed.
PLINQ is somewhat compelling but the algorithm here is just reading from a file sequentially to count the lines and then doing a sort. I'm not sure that either of these phases could be efficiently parallelizable.
Are there any test log files floating around for this? I took a look at some of the pages you linked to but couldn't find any.