We have a system that uses templates to create XML. Something like:
<root>
<foo>{CUSTOMTEMPLATETHING1}</foo>
<bar>{CUSTOMTEMPLATETHING2}</bar>
</root>
And the result might be:
<root>
<foo>text content</foo>
<bar></bar>
</root>
Notice that <bar> has "" as it's content. For a string that's OK, but for a DateTime or Decimal, not so much. In those cases (and arguably in strings when String.IsNullOrEmpty is your primary semantic need) it'd be preferable to the XmlSerializer and any other consumers to have those elements stripped out.
So, we created what we called the Rectifier. You can feel free to ponder the root or roots of the word. The early versions of the Rectifier used an uber-regular expression to strip out these tags from the source string. This system returns a full XML Document string, not an XmlReader or IXPathNavigable.
I heard a cool quote yesterday at the Portland NerdDinner while we were planning the CodeCamp.
"So you've got a problem, and you've decided to solve it with Regular Expressions. Now you've got two problems."
Since the size of the documents we passed through this system were between 10k and 100k the performance of the RegEx, especially when it's compiled and cached was fine. Didn't give it a thought for years. It worked and it worked well. It looked like this:
private static Regex regex = new Regex(@"\<[\w-_.: ]*\>\<\!\[CDATA\[\]\]\>\</[\w-_.: ]*\>|\<[\w-_.: ]*\>\</[\w-_.: ]*\>|<[\w-_.: ]*/\>|\<[\w-_.: ]*[/]+\>|\<[\w-_.: ]*[\s]xmlns[:\w]*=""[\w-/_.: ]*""\>\</[\w-_.: ]*\>|<[\w-_.: ]*[\s]xmlns[:\w]*=""[\w-/_.: ]*""[\s]*/\>|\<[\w-_.: ]*[\s]xmlns[:\w]*=""[\w-/_.: ]*""\>\<\!\[CDATA\[\]\]\>\</[\w-_.: ]*\>",RegexOptions.Compiled);
Stuff like this has what I call a "High Bus Factor." That means if the developer who wrote it is hit by a bus, you're screwed. It's nice to create a solution that anyone can sit down and start working on and this isn't one of them.
Then, lately some folks started pushing larger amounts of data through this system, in excess of 1.5 Megs and this Regular Expression started to 4, 8, 12 seconds to finish on this giant XML strings. We'd hit the other side of the knee of the exponential performance curve that you see with string processing like this.
So, Patrick had the idea to use XmlReaders and create an XmlRectifyingReader or XmlPeekingReader. Basically a fake reader, that had a reader internally and would "peek" ahead to see if we should skip empty elements. It's a complicated problem when you consider nesting, CDATA sections, attributes, namespaces, etc. But, because XmlReaders are forward only, you have to hold a lot of state as you move forward, since there's no way to back up. We gave up on this idea, since we want to fix this in a day, but it remains, in our opinion, a cool idea we'd like to try. We wanted to do something like: xs.Deserialize(new XmlRectifyingReader(new StringReader(inputString))). But, the real issue was performance - over elegance.
Then we figured we'd do an XmlReader/XmlWriter thing like:
using(StringWriter strw = new StringWriter())
{
XmlWriter writer = new XmlTextWriter(strw);
XmlReader reader = new XmlTextReader(new StringReader(input));
reader.Read();
RectifyXmlInternal(reader, writer); //This is US
reader.Close();
writer.Close();
return strw.ToString();
}
We still have the unfortunate overhead of the strings, but that's what the previous input and output was, so we need, for now, to maintain the existing interface. So, we read in the XML, atom by atom, storing little bits of state and write out only those tags that we figure aren't empty. We read in a bit, write out a bit, etc. It's recursive, maintaining depth, and it's iterative as we go over siblings. The Attribute class is the best we could come up with to store everything about an attribute as we find them. We tried to grab the attributes as strings, or one big string, but the XmlReader doesn't support that coarse style.
private class Attribute
{
public Attribute(string l, string n, string v, string p)
{
LocalName = l;
Namespace = n;
Value = v;
Prefix = p;
}
public string LocalName = string.Empty;
public string Namespace = string.Empty;
public string Value = string.Empty;
public string Prefix = string.Empty;
}
internal static void RectifyXmlInternal(XmlReader reader, XmlWriter writer)
{
int depth = reader.Depth;
while (true && !reader.EOF)
{
switch ( reader.NodeType )
{
case XmlNodeType.Text:
writer.WriteString( reader.Value );
break;
case XmlNodeType.Whitespace:
case XmlNodeType.SignificantWhitespace:
writer.WriteWhitespace(reader.Value);
break;
case XmlNodeType.EntityReference:
writer.WriteEntityRef(reader.Name);
break;
case XmlNodeType.XmlDeclaration:
case XmlNodeType.ProcessingInstruction:
writer.WriteProcessingInstruction( reader.Name, reader.Value );
break;
case XmlNodeType.DocumentType:
writer.WriteDocType( reader.Name,
reader.GetAttribute( "PUBLIC" ), reader.GetAttribute( "SYSTEM" ),
reader.Value );
break;
case XmlNodeType.Comment:
writer.WriteComment( reader.Value );
break;
case XmlNodeType.EndElement:
if(depth > reader.Depth)
return;
break;
}
if(reader.IsEmptyElement || reader.EOF) return;
else if(reader.IsStartElement())
{
string name = reader.Name;
string localName = reader.LocalName;
string prefix = reader.Prefix;
string uri = reader.NamespaceURI;
ArrayList attributes = null;
if(reader.HasAttributes)
{
attributes = new ArrayList();
while(reader.MoveToNextAttribute() )
attributes.Add(new Attribute(reader.LocalName,reader.NamespaceURI,reader.Value,reader.Prefix));
}
bool CData = false;
reader.Read();
if(reader.NodeType == XmlNodeType.CDATA)
{
CData = true;
}
if(reader.NodeType == XmlNodeType.CDATA && reader.Value.Length == 0)
{
reader.Read();
}
if(reader.NodeType == XmlNodeType.EndElement && reader.Name.Equals(name))
{
reader.Read();
if (reader.Depth < depth)
return;
else
continue;
}
writer.WriteStartElement( prefix, localName, uri);
if (attributes != null)
{
foreach(Attribute a in attributes)
writer.WriteAttributeString(a.Prefix,a.LocalName,a.Namespace,a.Value);
}
if(reader.IsStartElement())
{
if(reader.Depth > depth)
RectifyXmlInternal(reader, writer);
else
continue;
}
else
{
if (CData)
writer.WriteCData(reader.Value);
else
writer.WriteString(reader.Value);
reader.Read();
}
writer.WriteFullEndElement();
reader.Read();
}
}
}
The resulting "rectified" or empty-element stripped XML is byte for byte identical to the XML created by the original Regular Expression, so we succeeded in keeping compatiblity. The performance on small strings of XML less than 100 bytes is about 2x slower, because of the all overhead. However, as the size of the XML approaches middle part of the bell curve that repsents the typical size (10k of 100k) this technique overtakes RegularExpressions in a big way. Initial tests are between 7x and 10x faster in our typical scenario. When the XML gets to 1.5 megs this technique can process it in sub-second times. So, the Regular Expression behaves in an O(c^n) way, and this technique (scary as it is) behaves more O(n log(n)).
This lesson taught me that manipulating XML as if it were a string is often easy and quick to develop, but manipulating the infoset with really lightweight APIs like the XmlReader will almost always make life easier.
I'd be interested in hearing Oleg or Kzu's opinions on how to make this more elegant and performant, and if it's even worth the hassle. Our dream of an XmlPeekingReader or XmlRectifyingReader to do this all in one pass remains...
Hosting By