Stripping Out Empty XmlElements in a Performant Way and the Bus Factor
We have a system that uses templates to create XML. Something like:
<root>
<foo>{CUSTOMTEMPLATETHING1}</foo>
<bar>{CUSTOMTEMPLATETHING2}</bar>
</root>
And the result might be:
<root>
<foo>text content</foo>
<bar></bar>
</root>
Notice that <bar> has "" as it's content. For a string that's OK, but for a DateTime or Decimal, not so much. In those cases (and arguably in strings when String.IsNullOrEmpty is your primary semantic need) it'd be preferable to the XmlSerializer and any other consumers to have those elements stripped out.
So, we created what we called the Rectifier. You can feel free to ponder the root or roots of the word. The early versions of the Rectifier used an uber-regular expression to strip out these tags from the source string. This system returns a full XML Document string, not an XmlReader or IXPathNavigable.
I heard a cool quote yesterday at the Portland NerdDinner while we were planning the CodeCamp.
"So you've got a problem, and you've decided to solve it with Regular Expressions. Now you've got two problems."
Since the size of the documents we passed through this system were between 10k and 100k the performance of the RegEx, especially when it's compiled and cached was fine. Didn't give it a thought for years. It worked and it worked well. It looked like this:
private static Regex regex = new Regex(@"\<[\w-_.: ]*\>\<\!\[CDATA\[\]\]\>\</[\w-_.: ]*\>|\<[\w-_.: ]*\>\</[\w-_.: ]*\>|<[\w-_.: ]*/\>|\<[\w-_.: ]*[/]+\>|\<[\w-_.: ]*[\s]xmlns[:\w]*=""[\w-/_.: ]*""\>\</[\w-_.: ]*\>|<[\w-_.: ]*[\s]xmlns[:\w]*=""[\w-/_.: ]*""[\s]*/\>|\<[\w-_.: ]*[\s]xmlns[:\w]*=""[\w-/_.: ]*""\>\<\!\[CDATA\[\]\]\>\</[\w-_.: ]*\>",RegexOptions.Compiled);
Stuff like this has what I call a "High Bus Factor." That means if the developer who wrote it is hit by a bus, you're screwed. It's nice to create a solution that anyone can sit down and start working on and this isn't one of them.
Then, lately some folks started pushing larger amounts of data through this system, in excess of 1.5 Megs and this Regular Expression started to 4, 8, 12 seconds to finish on this giant XML strings. We'd hit the other side of the knee of the exponential performance curve that you see with string processing like this.
So, Patrick had the idea to use XmlReaders and create an XmlRectifyingReader or XmlPeekingReader. Basically a fake reader, that had a reader internally and would "peek" ahead to see if we should skip empty elements. It's a complicated problem when you consider nesting, CDATA sections, attributes, namespaces, etc. But, because XmlReaders are forward only, you have to hold a lot of state as you move forward, since there's no way to back up. We gave up on this idea, since we want to fix this in a day, but it remains, in our opinion, a cool idea we'd like to try. We wanted to do something like: xs.Deserialize(new XmlRectifyingReader(new StringReader(inputString))). But, the real issue was performance - over elegance.
Then we figured we'd do an XmlReader/XmlWriter thing like:
using(StringWriter strw = new StringWriter())
{
XmlWriter writer = new XmlTextWriter(strw);
XmlReader reader = new XmlTextReader(new StringReader(input));
reader.Read();
RectifyXmlInternal(reader, writer); //This is US
reader.Close();
writer.Close();
return strw.ToString();
}
We still have the unfortunate overhead of the strings, but that's what the previous input and output was, so we need, for now, to maintain the existing interface. So, we read in the XML, atom by atom, storing little bits of state and write out only those tags that we figure aren't empty. We read in a bit, write out a bit, etc. It's recursive, maintaining depth, and it's iterative as we go over siblings. The Attribute class is the best we could come up with to store everything about an attribute as we find them. We tried to grab the attributes as strings, or one big string, but the XmlReader doesn't support that coarse style.
private class Attribute
{
public Attribute(string l, string n, string v, string p)
{
LocalName = l;
Namespace = n;
Value = v;
Prefix = p;
}
public string LocalName = string.Empty;
public string Namespace = string.Empty;
public string Value = string.Empty;
public string Prefix = string.Empty;
}
internal static void RectifyXmlInternal(XmlReader reader, XmlWriter writer)
{
int depth = reader.Depth;
while (true && !reader.EOF)
{
switch ( reader.NodeType )
{
case XmlNodeType.Text:
writer.WriteString( reader.Value );
break;
case XmlNodeType.Whitespace:
case XmlNodeType.SignificantWhitespace:
writer.WriteWhitespace(reader.Value);
break;
case XmlNodeType.EntityReference:
writer.WriteEntityRef(reader.Name);
break;
case XmlNodeType.XmlDeclaration:
case XmlNodeType.ProcessingInstruction:
writer.WriteProcessingInstruction( reader.Name, reader.Value );
break;
case XmlNodeType.DocumentType:
writer.WriteDocType( reader.Name,
reader.GetAttribute( "PUBLIC" ), reader.GetAttribute( "SYSTEM" ),
reader.Value );
break;
case XmlNodeType.Comment:
writer.WriteComment( reader.Value );
break;
case XmlNodeType.EndElement:
if(depth > reader.Depth)
return;
break;
}
if(reader.IsEmptyElement || reader.EOF) return;
else if(reader.IsStartElement())
{
string name = reader.Name;
string localName = reader.LocalName;
string prefix = reader.Prefix;
string uri = reader.NamespaceURI;
ArrayList attributes = null;
if(reader.HasAttributes)
{
attributes = new ArrayList();
while(reader.MoveToNextAttribute() )
attributes.Add(new Attribute(reader.LocalName,reader.NamespaceURI,reader.Value,reader.Prefix));
}
bool CData = false;
reader.Read();
if(reader.NodeType == XmlNodeType.CDATA)
{
CData = true;
}
if(reader.NodeType == XmlNodeType.CDATA && reader.Value.Length == 0)
{
reader.Read();
}
if(reader.NodeType == XmlNodeType.EndElement && reader.Name.Equals(name))
{
reader.Read();
if (reader.Depth < depth)
return;
else
continue;
}
writer.WriteStartElement( prefix, localName, uri);
if (attributes != null)
{
foreach(Attribute a in attributes)
writer.WriteAttributeString(a.Prefix,a.LocalName,a.Namespace,a.Value);
}
if(reader.IsStartElement())
{
if(reader.Depth > depth)
RectifyXmlInternal(reader, writer);
else
continue;
}
else
{
if (CData)
writer.WriteCData(reader.Value);
else
writer.WriteString(reader.Value);
reader.Read();
}
writer.WriteFullEndElement();
reader.Read();
}
}
}
The resulting "rectified" or empty-element stripped XML is byte for byte identical to the XML created by the original Regular Expression, so we succeeded in keeping compatiblity. The performance on small strings of XML less than 100 bytes is about 2x slower, because of the all overhead. However, as the size of the XML approaches middle part of the bell curve that repsents the typical size (10k of 100k) this technique overtakes RegularExpressions in a big way. Initial tests are between 7x and 10x faster in our typical scenario. When the XML gets to 1.5 megs this technique can process it in sub-second times. So, the Regular Expression behaves in an O(c^n) way, and this technique (scary as it is) behaves more O(n log(n)).
This lesson taught me that manipulating XML as if it were a string is often easy and quick to develop, but manipulating the infoset with really lightweight APIs like the XmlReader will almost always make life easier.
I'd be interested in hearing Oleg or Kzu's opinions on how to make this more elegant and performant, and if it's even worth the hassle. Our dream of an XmlPeekingReader or XmlRectifyingReader to do this all in one pass remains...
About Scott
Scott Hanselman is a former professor, former Chief Architect in finance, now speaker, consultant, father, diabetic, and Microsoft employee. He is a failed stand-up comic, a cornrower, and a book author.
About Newsletter
I don't think this is specific to regex. Sometimes procedural code is just plain better and simpler; the same rule applies to incredibly byzantine XLST, for example.
<xsl:template match="*[not(node())]" />
XSLT isn't dead Scott... you just gotta know how to use it :)
XIncludingReader from XInclude.NET exposes virtual xml:base and it took overloading couple dozens of methods!
Much simpler solution would be to make use of Helena Kupkova's XmlBookmarkReader [1]. It allows to set a named "bookmark", read ahead and get back to the bookmarked place. Here is a sketch (not optimized nor tested well):
public class RectifyngXmlTextReader : XmlBookmarkReader
{
//Add more constructors as needed
public RectifyngXmlTextReader(TextReader reader)
: base (new XmlTextReader(reader)) {}
public override bool Read()
{
bool baseRead = base.Read();
if (baseRead && NodeType == XmlNodeType.Element)
{
//Skip empty elements
if (IsEmptyElement)
return Read();
base.SetBookmark("foo");
bool skipMe = false;
//Read ahead
while (base.Read())
{
//First end element tag - stop and skip current element
if (NodeType == XmlNodeType.EndElement)
{
skipMe = true;
break;
}
//Empty text or CDATA - keep reading ahead just in case
else if ((NodeType == XmlNodeType.Text ||
NodeType == XmlNodeType.CDATA) && Value.Length == 0)
{
skipMe = false;
}
//Anything else - non empty content
else
{
skipMe = false;
break;
}
}
if (skipMe)
{
base.RemoveBookmark("foo");
return Read();
}
else
{
base.ReturnToAndRemoveBookmark("foo");
}
}
return baseRead;
}
}
It would be interesting to benchmark it, I think it should be pretty fast.
[1] http://msdn.microsoft.com/library/en-us/dnxmlnet/html/XmlBkMkRead.asp
If it's a one liner, I say use it. What were your results?
Comments are closed.