Postprocessing AutoClosed SGML Tags with the SGMLReader
Chris Lovett's SGMLReader is an interesting and complex piece of work. It's more complex than my brain can hold, which is good, since he wrote it and not I. It's able to parse SGML documents like HTML. However, it derives from XmlReader, so it tries (and succeeds) to look like an XmlReader. As such, it Auto-Closes Tags. Remember that SGML doesn't have to have closing tags. Specifically, it doesn't need closing tags on primitive/simple types.
Sometimes I need to parse an OFX 1.x document, a financial format that is SGML like this:
<OFX>
<SIGNONMSGSRQV1>
<SONRQ>
<DTCLIENT>20060128101000
<USERID>654321
<USERPASS>123456
<LANGUAGE>ENG
<FI>
<ORG>Corillian
<FID>1001
</FI>
<APPID>MyApp
<APPVER>0500
</SONRQ>
...etc...
Notice that ORG and DTCLIENT and all the other simple types have no end tags, but complex types like FI and SONRQ do have end tags. The SgmlReader class attempts to automatically insert end tags (to close the element) as I use the XmlReader.Read() method to move through the document. However, he can't figure out where the right place for an end tag is until he sees an end elements go by. Then he says, oh, crap! There's </FI>! I need to empty my stack of start elements in reverse order. This is lovely for him, but gives me a document that looks (in memory) like this:
<OFX>
<SIGNONMSGSRQV1>
<SONRQ>
<DTCLIENT>20060128101000
<USERID>654321
<USERPASS>123456
<LANGUAGE>ENG
<FI>
<ORG>Corillian
<FID>1001</FID>
</ORG>
</FI>
</LANGUAGE>
</USERPASS>
</USERID>
</DTCLIENT>
...etc...
...which totally isn't the structure I'm looking for. I could write my own SgmlReader that knows more about OFX, but really, who has the time. So, my buddy Paul Gomes and I did this.
NOTE: There's one special tag in OFX called MSGBODY that is a simple type but always has an end tag, so we special cased that one. Notice also that we did all this WITHOUT changing the SgmlReader. It's just passed into the method as "reader."
protected internal static void AutoCloseElementsInternal(SgmlReader reader, XmlWriter writer)
{
object msgBody = reader.NameTable.Add("MSGBODY");
object previousElement = null;
Stack elementsWeAlreadyEnded = new Stack();
while (reader.Read())
{
switch ( reader.NodeType )
{
case XmlNodeType.Element:
previousElement = reader.LocalName;
writer.WriteStartElement(reader.LocalName);
break;
case XmlNodeType.Text:
if(Strings.IsNullOrEmpty(reader.Value) == false)
{
writer.WriteString( reader.Value.Trim());
if (previousElement != null && !previousElement.Equals(msgBody))
{
writer.WriteEndElement();
elementsWeAlreadyEnded.Push(previousElement);
}
}
else Debug.Assert(true, "big problems?");
break;
case XmlNodeType.EndElement:
if(elementsWeAlreadyEnded.Count > 0
&& Object.ReferenceEquals(elementsWeAlreadyEnded.Peek(),
reader.LocalName))
{
elementsWeAlreadyEnded.Pop();
}
else
{
writer.WriteEndElement();
}
break;
default:
writer.WriteNode(reader,false);
break;
}
}
}
We store the name of the most recently written start tag. If we write out a node of type XmlNodeType.Text, we push the start tag on a stack and immediately write out our own EndElement. Then, when we notice the SgmlReader starting to auto-close and send us synthetic EndElements, we ignore them if they are already at the top of our own stack. Otherwise, we let SgmlReader close non-synthetic EndElements.
The resulting OFX document now looks like this:
<OFX>
<SIGNONMSGSRQV1>
<SONRQ>
<DTCLIENT>20060128101000</DTCLIENT>
<USERID>411300</USERID>
<USERPASS>123456</USERPASS>
<LANGUAGE>ENG</LANGUAGE>
<FI>
<ORG>Corillian</ORG>
<FID>1001</FID>
</FI>
<APPID>MyApp</APPID>
<APPVER>0500</APPVER>
</SONRQ>
...etc...
...and we can deal with it just like any other Xml Fragment, in our case, just allowing it to continue along its way in the XmlReader/XmlWriter Pipeline.
Thanks to Craig Andera for the reminder about Object.ReferenceEquals(), it's nicer than elementsWeAlreadyEnded.Peek() == (object)reader.LocalName.
About Scott
Scott Hanselman is a former professor, former Chief Architect in finance, now speaker, consultant, father, diabetic, and Microsoft employee. He is a failed stand-up comic, a cornrower, and a book author.
About Newsletter
i do love a bit of xml-esoterica though!!
i'm guessing that all the 'simple' elements are mandatory.
because otherwise i think you'd choke on something like this:
(in this example, the ORG tag has no value...)
[FI]
[ORG]
[FID]1001
[/FI]...etc...
which would be mis-interpreted as:
[FI]
[ORG]
[FID]1001[/FID]
[/ORG]
[/FI]...etc...
i've never seen real-live SGML in the wild before, only its good little children, html and xml.
i'm really suprised that it allows these open tags.
Does the ofx have some kind of dtd/schema like document, that says "the following elements are complex (and can contain elements) while these other elements are simple, hence require no closing tags". If so, maybe it's safest to use the spec for choosing when to close a tag?
cheers
lb
Comments are closed.
i see -- so if in the example, there was text straight after [SONRQ] then you would assume that [SONRQ] is a 'simple' element type, and give it a closing tag straight after the text.
Hence:
[SONRQ]Some text
[DTCLIENT]...etc...
would become:
[SONRQ]Some text[/SONRQ]
[DTCLIENT]...etc...
Which is unlikely i imagine, but common in olde worldy html (i.e. not xhtml), for example:
[p]Fred[br]Jack[/p]
Would become something like:
[p]Fred[/p][br]Jack[/br][/p]
(hence, poorly formed and quite wrong)
am i readin this right??