Postprocessing AutoClosed SGML Tags with the SGMLReader

February 14, 2006 Comment on this post [5] Posted in XML | Bugs

Sponsored By

Chris Lovett's SGMLReader is an interesting and complex piece of work. It's more complex than my brain can hold, which is good, since he wrote it and not I. It's able to parse SGML documents like HTML. However, it derives from XmlReader, so it tries (and succeeds) to look like an XmlReader. As such, it Auto-Closes Tags. Remember that SGML doesn't have to have closing tags. Specifically, it doesn't need closing tags on primitive/simple types.

Sometimes I need to parse an OFX 1.x document, a financial format that is SGML like this:

<OFX>
<SIGNONMSGSRQV1>
<SONRQ>
<DTCLIENT>20060128101000
<USERID>654321
<USERPASS>123456
<LANGUAGE>ENG
  <FI>
   <ORG>Corillian
   <FID>1001
  </FI>
<APPID>MyApp
<APPVER>0500
</SONRQ>
...etc...

Notice that ORG and DTCLIENT and all the other simple types have no end tags, but complex types like FI and SONRQ do have end tags. The SgmlReader class attempts to automatically insert end tags (to close the element) as I use the XmlReader.Read() method to move through the document. However, he can't figure out where the right place for an end tag is until he sees an end elements go by. Then he says, oh, crap! There's </FI>! I need to empty my stack of start elements in reverse order. This is lovely for him, but gives me a document that looks (in memory) like this:

<OFX>
<SIGNONMSGSRQV1>
<SONRQ>
  <DTCLIENT>20060128101000
  <USERID>654321
   <USERPASS>123456
     <LANGUAGE>ENG
      <FI>
          <ORG>Corillian
           <FID>1001</FID>
          </ORG>
       </FI>
    </LANGUAGE>
    </USERPASS>
  </USERID>
</DTCLIENT>
...etc...

...which totally isn't the structure I'm looking for. I could write my own SgmlReader that knows more about OFX, but really, who has the time. So, my buddy Paul Gomes and I did this.

NOTE: There's one special tag in OFX called MSGBODY that is a simple type but always has an end tag, so we special cased that one. Notice also that we did all this WITHOUT changing the SgmlReader. It's just passed into the method as "reader."

protected internal static void AutoCloseElementsInternal(SgmlReader reader, XmlWriter writer)
{
    object msgBody = reader.NameTable.Add("MSGBODY");
 
    object previousElement = null;
    Stack elementsWeAlreadyEnded = new Stack();
 
    while (reader.Read())
    {
        switch ( reader.NodeType )
        {
            case XmlNodeType.Element:
                previousElement = reader.LocalName;
                writer.WriteStartElement(reader.LocalName);
                break;
            case XmlNodeType.Text:
                if(Strings.IsNullOrEmpty(reader.Value) == false)
                {
                    writer.WriteString( reader.Value.Trim());
                    if (previousElement != null && !previousElement.Equals(msgBody))
                    {
                        writer.WriteEndElement();
                        elementsWeAlreadyEnded.Push(previousElement);
                    }
                }
                else Debug.Assert(true, "big problems?");
                break;
            case XmlNodeType.EndElement:
                if(elementsWeAlreadyEnded.Count > 0 
                    && Object.ReferenceEquals(elementsWeAlreadyEnded.Peek(), 
                       reader.LocalName))
                {
                    elementsWeAlreadyEnded.Pop();
                }
                else
                {
                    writer.WriteEndElement();
                }
                break;
            default:
                writer.WriteNode(reader,false);
                break;
        }
    }
}

We store the name of the most recently written start tag. If we write out a node of type XmlNodeType.Text, we push the start tag on a stack and immediately write out our own EndElement. Then, when we notice the SgmlReader starting to auto-close and send us synthetic EndElements, we ignore them if they are already at the top of our own stack. Otherwise, we let SgmlReader close non-synthetic EndElements.

The resulting OFX document now looks like this:

<OFX>
<SIGNONMSGSRQV1>
<SONRQ>
  <DTCLIENT>20060128101000</DTCLIENT>
  <USERID>411300</USERID>
  <USERPASS>123456</USERPASS>
  <LANGUAGE>ENG</LANGUAGE>
  <FI>
   <ORG>Corillian</ORG>
   <FID>1001</FID>
  </FI>
  <APPID>MyApp</APPID>
  <APPVER>0500</APPVER>
</SONRQ>
...etc...

...and we can deal with it just like any other Xml Fragment, in our case, just allowing it to continue along its way in the XmlReader/XmlWriter Pipeline.

Thanks to Craig Andera for the reminder about Object.ReferenceEquals(), it's nicer than elementsWeAlreadyEnded.Peek() == (object)reader.LocalName.

About Scott

Scott Hanselman is a former professor, former Chief Architect in finance, now speaker, consultant, father, diabetic, and Microsoft employee. He is a failed stand-up comic, a cornrower, and a book author.

About Newsletter

Hosting By

Hosted on Linux using .NET in an Azure App Service

Comment on this post [5]

Share on BlueSky or use the Permalink and post anywhere!

February 15, 2006 1:56

(To avoid 'potentially dangerous request' warning, i've changed pointy brackets to the far less sexy square brackets)

i see -- so if in the example, there was text straight after [SONRQ] then you would assume that [SONRQ] is a 'simple' element type, and give it a closing tag straight after the text.
Hence:
[SONRQ]Some text
[DTCLIENT]...etc...

would become:
[SONRQ]Some text[/SONRQ]
[DTCLIENT]...etc...

Which is unlikely i imagine, but common in olde worldy html (i.e. not xhtml), for example:

[p]Fred[br]Jack[/p]

Would become something like:

[p]Fred[/p][br]Jack[/br][/p]

(hence, poorly formed and quite wrong)

am i readin this right??

February 15, 2006 8:41

Yes, given the code I've written, you're totally right. However, that's against the OFX spec and would qualify as an invalid document and be kicked out later by the validation step. Remember, I'm using the SGML reader to parse OFX, not HTML.

Scott Hanselman

February 15, 2006 14:20

do you have a feeling that there's something kind of 'hammering in screws' about 'turning sgml into xml'?

i do love a bit of xml-esoterica though!!

i'm guessing that all the 'simple' elements are mandatory.
because otherwise i think you'd choke on something like this:

(in this example, the ORG tag has no value...)

[FI]
[ORG]
[FID]1001
[/FI]...etc...

which would be mis-interpreted as:
[FI]
[ORG]
[FID]1001[/FID]
[/ORG]
[/FI]...etc...

i've never seen real-live SGML in the wild before, only its good little children, html and xml.

i'm really suprised that it allows these open tags.
Does the ofx have some kind of dtd/schema like document, that says "the following elements are complex (and can contain elements) while these other elements are simple, hence require no closing tags". If so, maybe it's safest to use the spec for choosing when to close a tag?

cheers
lb

secretGeek

February 16, 2006 10:17

Heh, the OFX DTDs are dodgey, but I'll take a look again. They have XSDs, but they don't reflect reality.

Scott Hanselman

February 16, 2006 12:54

And to think, our financial data is being passed around in this format. Don't I feel safe. ;)

Haacked

Comments are closed.