Xml and the Nametable
I got a number (~dozen) of emails about by use of the Nametable in my XmlReader post recently. Charles Cook tried it out and noticed about a 10% speedup. I also received a number of poo-poo emails that said "use XPath" or "don't bother" and "the performance is good enough."
Sure, if that works for you, that's great. Of course, always measure before you make broad statements. That said, here's a broad statement. Using an XmlReader will always be faster than the DOM and/or XmlSerializer. Always.
Why? Because what do you think is underneath the DOM and inside of XmlSerialization? An XmlReader of course.
For documents larger than about 50k, you're looking at least one order of magnitude faster when plucking a single value out. When grabbing dozens, it increases.
Moshe is correct in his pointing out that a nice middle-place perf-wise is the XPathReader (for a certain subset of XPath). There's a number of nice XmlReader implementations that fill the space between XmlTextReader and XPathDocument by providing more-than-XmlReader functionality:
BTW, I would also point out that an XmlReader is what I call a "cursor-based pull implementation." While it's similar to the SAX parsers in that it exposes the infoset rather than the angle brackets, it's not SAX.
Now, all that said, what was the deal with my Nametable usage? Charles explains it well, but I will expand. You can do this if you like:
XmlTextReader tr =
new XmlTextReader("http://feeds.feedburner.com/ScottHanselman");
while (tr.Read())
{
if (tr.NodeType == XmlNodeType.Element && tr.LocalName == "enclosure")
{
while (tr.MoveToNextAttribute())
{
Console.WriteLine(String.Format("{0}:{1}",
tr.LocalName, tr.Value));
}
}
}
The line in red does a string compare as you look at each element. Not a big deal, but it adds up over hundreds or thousands of executions when spinning through a large document.
The NameTable is used by XmlDocument, XmlReader(s), XPathNavigator, and XmlSchemaCollection. It's a table that maps a string to an object reference. This is called "atomization" - meaning we want to think about atom (think small). If they see "enclosure" more than once, they use the object reference rather than have n number of "enclosure" strings internally.
It's not exactly like a Hashtable, as the NameTable will return the object reference if the string has already been atomized.
XmlTextReader tr =
new XmlTextReader("http://feeds.feedburner.com/ScottHanselman");
object enclosure = tr.NameTable.Add("enclosure");
while (tr.Read())
{
if (tr.NodeType == XmlNodeType.Element &&
Object.ReferenceEquals(tr.LocalName, enclosure))
{
while (tr.MoveToNextAttribute())
{
Console.WriteLine(String.Format("{0}:{1}",
tr.LocalName, tr.Value));
}
}
}
The easiest way, IMHO, to think about it is this:
- If you know that you're going to look for an element or attribute with a specific name within any System.Xml class that has an XmlNameTable, preload or warn the parser that you'll be watching for these names.
- When you do a comparison between the current element or attribute and your target, use Object.ReferenceEquals. Instead of a string comparison, you'll just be asking "are these the same object" - which is about the fastest thing that the CLR can do.
- Yes, you can use == rather than Object.ReferenceEquals, but the later makes it totally clear what your intent is, while the former is more vague.
This kind of optimization makes a big perf difference (~10% depending) when using an XmlReader. It makes less of one when using an XPathDocument because you are using Select(ing)Nodes in a loop.
Stealing Charles' words: "...because it involves very little extra code it is perhaps an optimization worth making prematurely."
Even the designers agree: "...using the XmlNameTable gives you enough of a performance benefit to make it worthwhile especially if your processing starts to spans multiple XML components in a piplelining scenario and the XmlNameTable is shared across them i.e. XmlTextReader->XmlDocument->XslTransform."
Oleg laments: "...that something needs to be done to fix this particular usage pattern of XmlReader to not ignore great NameTable idea."
Conclusion: The NameTable is there for a reason, no matter what System.Xml solution you use. This is a the correct and useful pattern and not using it is just silly. If you're going to develop a habit, why not make it a best-practice-habit?
About Scott
Scott Hanselman is a former professor, former Chief Architect in finance, now speaker, consultant, father, diabetic, and Microsoft employee. He is a failed stand-up comic, a cornrower, and a book author.
About Newsletter
Presumably this (the nametable stuff) is the same as string interning.
Thanks.
if (Object.ReferenceEquals(reader.NamespaceURI,foo))...
I put all the calls to NameTable.Add() in one place (some construcutor, usually), and that becomes a nice reference to what parts of the XML you are actually touching in that code.
Thanks Scott.
Comments are closed.
BTW. I must be getting pretty close to being an automated bot because these captcha strings are really starting to become difficult to read. Pretty soon I'll be hangin out with all the Spam comments ;) .