Scott Hanselman

The Recorded Version of my TechEd 2005 Code Generation Talk is Available

Sponsored By

Cool! Looks like the Recorded Version of my TechEd 2005 Talk "ARC305: Code Generation - Architecting a New Kind of Reuse" is up with the rest of the TechEd talks and available for viewing. They sure do a nice job recording all the sessions. It includes both the PPT slides and switches to video just during the demos to show you what I was doing on my system at the time.

The Microsoft Producer program that they used doesn't render FireFoxFriendly HTML, so sorry about that.

Hopefully those of you visit my site for all the technical content can put up with the tiny bits of humor early on. There are also pauses where folks laughed, but since I was miked you can't hear them. Forgive me. It was likely funny at the time.

Here's the URL for those of you with IE and decent bandwidth. Note, it will take a second to load a bunch of stuff in the background with Javascript and AJAX.

http://microsoft.sitestream.com/TechEd2005/ARC/ARC305.htm

I'll see what I can do about getting a downloadable version if you like.

It also turns out that they want me to do the session again as a Webcast in September as part of the "Best of TechEd" Webcast series. I'm not clear on the details, but it should be posted soon at http://www.microsoft.com/events/series/teched2005.mspx.

You should be able to get to all the TechEd 2005 recorded sessions here, but again, not with Firefox.

About Scott

Scott Hanselman is a former professor, former Chief Architect in finance, now speaker, consultant, father, diabetic, and Microsoft employee. He is a failed stand-up comic, a cornrower, and a book author.

About Newsletter

Hosting By

Hosted on Linux using .NET in an Azure App Service

Comment on this post [7]

Share on BlueSky or use the Permalink and post anywhere!

Stripping Out Empty XmlElements in a Performant Way and the Bus Factor

June 24, 2005 Comment on this post [6] Posted in PDC | XmlSerializer

Sponsored By

We have a system that uses templates to create XML. Something like:

<root>
<foo>{CUSTOMTEMPLATETHING1}</foo>
<bar>{CUSTOMTEMPLATETHING2}</bar>
</root>

And the result might be:

<root>
<foo>text content</foo>
<bar></bar>
</root>

Notice that <bar> has "" as it's content. For a string that's OK, but for a DateTime or Decimal, not so much. In those cases (and arguably in strings when String.IsNullOrEmpty is your primary semantic need) it'd be preferable to the XmlSerializer and any other consumers to have those elements stripped out.

So, we created what we called the Rectifier. You can feel free to ponder the root or roots of the word. The early versions of the Rectifier used an uber-regular expression to strip out these tags from the source string. This system returns a full XML Document string, not an XmlReader or IXPathNavigable.

I heard a cool quote yesterday at the Portland NerdDinner while we were planning the CodeCamp.

"So you've got a problem, and you've decided to solve it with Regular Expressions. Now you've got two problems."

Since the size of the documents we passed through this system were between 10k and 100k the performance of the RegEx, especially when it's compiled and cached was fine. Didn't give it a thought for years. It worked and it worked well. It looked like this:

private static Regex regex = new Regex(@"\<[\w-_.: ]*\>\<\!\[CDATA\[\]\]\>\</[\w-_.: ]*\>|\<[\w-_.: ]*\>\</[\w-_.: ]*\>|<[\w-_.: ]*/\>|\<[\w-_.: ]*[/]+\>|\<[\w-_.: ]*[\s]xmlns[:\w]*=""[\w-/_.: ]*""\>\</[\w-_.: ]*\>|<[\w-_.: ]*[\s]xmlns[:\w]*=""[\w-/_.: ]*""[\s]*/\>|\<[\w-_.: ]*[\s]xmlns[:\w]*=""[\w-/_.: ]*""\>\<\!\[CDATA\[\]\]\>\</[\w-_.: ]*\>",RegexOptions.Compiled);

Stuff like this has what I call a "High Bus Factor." That means if the developer who wrote it is hit by a bus, you're screwed. It's nice to create a solution that anyone can sit down and start working on and this isn't one of them.

Then, lately some folks started pushing larger amounts of data through this system, in excess of 1.5 Megs and this Regular Expression started to 4, 8, 12 seconds to finish on this giant XML strings. We'd hit the other side of the knee of the exponential performance curve that you see with string processing like this.

So, Patrick had the idea to use XmlReaders and create an XmlRectifyingReader or XmlPeekingReader. Basically a fake reader, that had a reader internally and would "peek" ahead to see if we should skip empty elements. It's a complicated problem when you consider nesting, CDATA sections, attributes, namespaces, etc. But, because XmlReaders are forward only, you have to hold a lot of state as you move forward, since there's no way to back up. We gave up on this idea, since we want to fix this in a day, but it remains, in our opinion, a cool idea we'd like to try. We wanted to do something like: xs.Deserialize(new XmlRectifyingReader(new StringReader(inputString))). But, the real issue was performance - over elegance.

Then we figured we'd do an XmlReader/XmlWriter thing like:

using(StringWriter strw = new StringWriter())
{
    XmlWriter writer = new XmlTextWriter(strw);
    XmlReader reader = new XmlTextReader(new StringReader(input));
    reader.Read();
    RectifyXmlInternal(reader, writer); //This is US
    reader.Close();
    writer.Close();
    return strw.ToString();
}

We still have the unfortunate overhead of the strings, but that's what the previous input and output was, so we need, for now, to maintain the existing interface. So, we read in the XML, atom by atom, storing little bits of state and write out only those tags that we figure aren't empty. We read in a bit, write out a bit, etc. It's recursive, maintaining depth, and it's iterative as we go over siblings. The Attribute class is the best we could come up with to store everything about an attribute as we find them. We tried to grab the attributes as strings, or one big string, but the XmlReader doesn't support that coarse style.

private class Attribute
{
    public Attribute(string l, string n, string v, string p)
    {
        LocalName = l;
        Namespace = n;
        Value = v;
        Prefix = p;
    }
 
    public string LocalName = string.Empty;
    public string Namespace = string.Empty;
    public string Value = string.Empty;
    public string Prefix = string.Empty;
}
 
internal static void RectifyXmlInternal(XmlReader reader, XmlWriter writer)
{
    int depth = reader.Depth;
 
    while (true && !reader.EOF)
    {
        switch ( reader.NodeType )
        {
            case XmlNodeType.Text:
                writer.WriteString( reader.Value );
                break;
            case XmlNodeType.Whitespace:
            case XmlNodeType.SignificantWhitespace:
                writer.WriteWhitespace(reader.Value);
                break;
            case XmlNodeType.EntityReference:
                writer.WriteEntityRef(reader.Name);
                break;
            case XmlNodeType.XmlDeclaration:
            case XmlNodeType.ProcessingInstruction:
                writer.WriteProcessingInstruction( reader.Name, reader.Value );
                break;
            case XmlNodeType.DocumentType:
                writer.WriteDocType( reader.Name, 
                    reader.GetAttribute( "PUBLIC" ), reader.GetAttribute( "SYSTEM" ), 
                    reader.Value );
                break;
            case XmlNodeType.Comment:
                writer.WriteComment( reader.Value );
                break;
            case XmlNodeType.EndElement:
                if(depth > reader.Depth)
                    return;
                break;
        }
 
        if(reader.IsEmptyElement || reader.EOF) return;
        else if(reader.IsStartElement())
        {
            string name = reader.Name;
            string localName = reader.LocalName;
            string prefix = reader.Prefix;
            string uri = reader.NamespaceURI;
 
            ArrayList attributes = null;
 
            if(reader.HasAttributes)
            {
                attributes = new ArrayList();
                while(reader.MoveToNextAttribute() )
                    attributes.Add(new Attribute(reader.LocalName,reader.NamespaceURI,reader.Value,reader.Prefix));
            }
 
            bool CData = false;
            reader.Read();
            if(reader.NodeType == XmlNodeType.CDATA)
            {
                CData = true;
            }
            if(reader.NodeType == XmlNodeType.CDATA && reader.Value.Length == 0)
            {
                reader.Read();
            }
            if(reader.NodeType == XmlNodeType.EndElement && reader.Name.Equals(name))
            {
                reader.Read();
                if (reader.Depth < depth)
                    return;
                else
                    continue;
            }
            writer.WriteStartElement( prefix, localName, uri);
            if (attributes != null)
            {
                foreach(Attribute a in attributes)
                    writer.WriteAttributeString(a.Prefix,a.LocalName,a.Namespace,a.Value);
            }
            if(reader.IsStartElement())
            {
                if(reader.Depth > depth)
                    RectifyXmlInternal(reader, writer);
                else
                    continue;
            }
            else
            {
                if (CData)
                    writer.WriteCData(reader.Value);
                else 
                    writer.WriteString(reader.Value);
                reader.Read();
            }
            writer.WriteFullEndElement();
            reader.Read();
        }
    }
}

The resulting "rectified" or empty-element stripped XML is byte for byte identical to the XML created by the original Regular Expression, so we succeeded in keeping compatiblity. The performance on small strings of XML less than 100 bytes is about 2x slower, because of the all overhead. However, as the size of the XML approaches middle part of the bell curve that repsents the typical size (10k of 100k) this technique overtakes RegularExpressions in a big way. Initial tests are between 7x and 10x faster in our typical scenario. When the XML gets to 1.5 megs this technique can process it in sub-second times. So, the Regular Expression behaves in an O(c^n) way, and this technique (scary as it is) behaves more O(n log(n)).

This lesson taught me that manipulating XML as if it were a string is often easy and quick to develop, but manipulating the infoset with really lightweight APIs like the XmlReader will almost always make life easier.

I'd be interested in hearing Oleg or Kzu's opinions on how to make this more elegant and performant, and if it's even worth the hassle. Our dream of an XmlPeekingReader or XmlRectifyingReader to do this all in one pass remains...

About Scott

About Newsletter

Hosting By

Comment on this post [6]

Share on BlueSky or use the Permalink and post anywhere!

ASP.NET 2.0 XmlDataSource's XPath doesn't support namespaces

June 23, 2005 Comment on this post [2] Posted in ASP.NET | Learning .NET | TechEd | Speaking | XML

Sponsored By

I'm working (again) on the XML Chapter to our upcoming book. The book is all about ASP.NET 2.0, but XML is such an important part of ASP.NET that this chapter gets bigger and bigger. I've been updating it from the original Beta 1 version this last few months and noticed that the namespace qualification for the XmlDataSource is still broken/incomplete as it was last year in September. I talked to a bunch of people at TechEd including a number of very helpful devs and PMs who were very much interested in resolving this issue. However, unfortunately it looks like this'll be one of those features that won't make it into the final, which means one of us will have to write our own.

The basic problem is this (from the book draft):

One unfortunate caveat of the new XmlDataSource is its XPath attribute does not support documents that use namespace qualification. Examples in this chapter use the Books.xml file with a default namespace of http://examples.books.com. It is very common for XML files to use multiple namespaces, including a default namespace. As you learned when you created an XPathDocument and queried it with XPath, the namespace in which an element exists is very important.

The regrettable reality is, there is no way use a namespace qualified XPath expression or to make the XmlDataSource Control aware of a list of prefix/namespace pairs via the XmlNamespaceManager class. However, the XPath function used in the ItemTemplate of the templated DataList control can take a XmlNamespaceManager as its second parameter and query XML returned from the XmlDataSource - as long as the control does not include an XPath attribute with namespace qualification or you can just omit it all together. That said, in order for these examples to work, you must remove the namespaces from your source XML and use XPath queries that include no namespace qualification, as shown in Listing xx-xx.

I was hoping to avoid having any caveats like this in the book, but this one will stay until there's a solution to the problem. It'd be nice if someone (Oleg, kzu, Don, me, you?) could add a namespace-aware XmlDataSource control to the Mvp.Xml project and have it ready by 2.0 ship.

As it is currently, you do this:

<asp:datalist id="DataList1" DataSourceID="XmlDataSource1" runat="server">
    <ItemTemplate>
        <p><b><%# XPath("author/first-name") %>
              <%# XPath("author/last-name")%></b>
                    wrote <%# XPath("title") %></p>
    </ItemTemplate>
</asp:datalist>
<asp:xmldatasource id="XmlDataSource1" runat="server"
    datafile="~/Books.xml"
    xpath="//bookstore/book"/>

And the root problem is that the boldfaced xpath expression can't use namespace qualified XPath expressions like //b:bookstore/b:book, because there's no way to pass in an XmlNamespaceManager to the XmlDataSource. Note that this doesn't apply to the TemplatedControl.XPath expression. You CAN pass in an XmlNamespaceManager like this: <%# XPath("b:author/b:first-name", myNamespaceMgr) %>. The only bummer is that there's no completely declarative way to do this; you have to have the XmlNamespaceManager in the code behind.

In an ideal world, there'd be a number of ways to let the XmlDataSource know about namespaces and prefix. Here's some ideas in order of preference:

"Infer" the namespace/prefixes and create an XmlNamespaceManager instance and associate it with the control automatically. Perhaps this calls for an XmlNamespaceInferringReader that creates an XmlNamespaceManager as a side effect (this wouldn't be hard, methinks)?
Pass in the prefixes/namespaces declaratively like Christopf wants:
<asp:xmldatasource runat="server" id="xds1"
datafile="~/app_data/namespacebooks.xml"
xpath="/ba:store/dx:book">
<asp:namespace prefix="ba" name=”http://bracketangles.net/names” />
<asp:namespace prefix="dx" name=”http://donxml.com/names” />
<asp:namespace prefix="ha" name=”http://hanselman.com/names” />
</asp:xmldatasource>
Have an event in the code behind like this, where you can help in an associated a NamespaceManager:
private void OnXmlDataSource1ExecutingXPath(object sender, XmlDataSourceXPathEventArgs e) {
   NameTable table = new NameTable();
   e.NamespaceManager = new XmlNamespaceManager(table);
   e.NamespaceManager.AddNamespace("a", "b");
   e.NamespaceManager.AddNamespace("myns1", "http://www.example.org/namespace1");
e.NamespaceManager.AddNamespace("myns2", "http://www.example.org/namespace2");
}

Any of these would be preferrable to the alternatives, which are, as I seem them:

Not using documents with namespaces
Not using the XPath attribute of the XmlDataSource
Using an XSLT Transformation to strip out namespaces before using the result in the XmlDataSource. Yikes.

It's a bummer, because, in my opinion, if you're using Xml without namespaces then you're just pushing around less-than/greater-than delimited files. Considering that so much effort was put into making schemas for most (all?) the ASP.NET config files and such, it's a shame if a control shipped without support for Xml Namespaces.

We're making the book very approachable for the beginner and intermediate dev, but there will be call-outs with gotchas like this that will hopefully save the advanced developer a lot of time. Also, if you're an advanced 1.1 dev, there is a lot of direct "it used to work like this, be careful because now..." exploration. I hope it'll save you time. It should be in bookstores just before ASP.NET 2.0 itself is.

About Scott

About Newsletter

Hosting By

Comment on this post [2]

Share on BlueSky or use the Permalink and post anywhere!

Matisyahu is coming to Portland

June 23, 2005 Comment on this post [3] Posted in Reviews

Sponsored By

I'm totally into this guy's music. He's a Hasidic Jew and an up and coming Reggae superstar. He's Matisyahu and he's the bomb. There's some live video of him here (a slower number, give it a minute, as with all reggae it takes a second to get going). He also does a mean beatbox.

There's a number of articles about Matisyahu that delve a little into the intertwining between Judaism and Rastafarianism and the Rasta's love of Ethiopia. He's coming to Portland in September.

About Scott

About Newsletter

Hosting By

Comment on this post [3]

Share on BlueSky or use the Permalink and post anywhere!

Hammering in Screws

June 22, 2005 Comment on this post [29] Posted in XML | Bugs | Tools

Sponsored By

I sent this email out to some Engineers at the company yesterday. It was intended to hurt no one's feelings and no one was singled out. I don't know the details of who or this or that, and that's not the point.

Hello folks,

Just a friendly reminder from the Hanselman. Today, roughly 10 people spent roughly half a day debugging a single problem and assigning culpability on various versions of various products. Various email threads were started and much ReplyToR’ing occurred.

The problem was badly formed XML. Specifically a space in a closing XML Tag.

These folks worked very hard and it’s commendable that they found the problem. That said, I would like to take this opportunity to gently remind everyone of these few principles so that we all, myself included, might learn from their suffering:

TextPad isn’t an XML editor.

It’s not even a good syntax highlighting editor. Visual Studio and XMLSpy are and would have caught an error like this on load. (Editor: as would have dropping it into IE)

“Entia non sunt multiplicanda praeter necessitatem” which is Latin for “Keep it simple, stupid.”

Return to the first principles when debugging anything and assert your assumptions.

The plural of “anecdote" is not “truth”

If you heard that such and such was broken or doesn’t work, ask around to those involved directly and get the truth (as it is known) directly from them. We’re not such a large company that we need to ReplyToAll as much as we do.

That said, thanks to the group who did put in the hard work and find this problem, allowing the [product] to continue forward.

Thanks for your patience and attention!

Reaction in the hallway was generally positive, while no doubt others cursed.

Of course, no one is beyond reproach and I've got MORE than my share of Doh! moments, but this email was meant to remind people to use the tools they've got in the toolbox, to question the basics when debugging and to deal in concrete facts rather than conjecture.

My questions to you, Dear Reader, are these:

Do you ever send emails to the team to remind folks of First Principles?
Do you have a growing ReplyToAll culture that should be stamped out and replaced with walking to someone's office?
How long should an email thread go before it is stopped and replaced with a meeting?
Is it the pressure to ship that causes all of us to miss misplaced semicolons and the like, or just human nature?
Is "flaming potato" a part of your software engineering culture? Do you struggle with it?
Too harsh? Not harsh enough?

About Scott

About Newsletter

Hosting By

Comment on this post [29]

Share on BlueSky or use the Permalink and post anywhere!

Newer Posts >>

<< Older Posts