How to load HTML into mshtml.HTMLDocumentClass with UCOMIPersistFile and my ignorance
What a weird one. I'm looking at the source for NDoc.Document.HtmlHelp2.Compiler.HtmlHelpFile. It uses the Microsoft.mshtml interop Assembly to load an HTML file into the HTMLDocumentClass for easy parsing.
It's code looks like this (DOESN'T WORK):
private
HTMLDocumentClass GetHtmlDocument( FileInfo f ){
HTMLDocumentClass doc = null;
try
{
doc = new HTMLDocumentClass();
UCOMIPersistFile persistFile = (UCOMIPersistFile)doc;
persistFile.Load( f.FullName, 0 );
int start = Environment.TickCount;
while( doc.body == null )
{
if ( Environment.TickCount - start > 10000 )
{
throw new Exception( string.Format( "The document {0} timed out while loading", f.Name ) );
}
}
}
}
I went searching as it was taking up 100% CPU for an hour and never completed. Now I know why! :)
What's weird is this, the only way I could get it to work (as IPersistFile is loading on another Thread) was with this change (NOW IT WORKS):
private HTMLDocumentClass GetHtmlDocument( FileInfo f )
{
HTMLDocumentClass doc = null;
try
{
doc = new HTMLDocumentClass();
UCOMIPersistFile persistFile = (UCOMIPersistFile)doc;
persistFile.Load( f.FullName, 0 );
int start = Environment.TickCount;
while( doc.readyState != "complete" )
{
System.Windows.Forms.Application.DoEvents();
if ( Environment.TickCount - start > 10000 )
{
throw new Exception( string.Format( "The document {0} timed out while loading", f.Name ) );
}
}
}
}
When I Reflector into DoEvents() I can see that it's doing more than a Sleep(0) (yield), it's actually running the message pump. Am I missing something? Apparently IPersistFile needs the message pump? Well, it works, but it's gross.
About Scott
Scott Hanselman is a former professor, former Chief Architect in finance, now speaker, consultant, father, diabetic, and Microsoft employee. He is a failed stand-up comic, a cornrower, and a book author.
About Newsletter
NDoc, in beta 1.3, has gotten rid of the above implementation (which was a stop gap measure).
In another NDoc project however (the one that prepares our user's guide for publishing online) we still do a similar thing. In this case however that utility creates a hidden window (which get's us a message pump) and does the work in the context of that window. This allows the utility to be invoked from a console app.
Does the Application.DoEvents() implementation work in a console App? I wouldn't think it would unless you called Application.Run as some point prior to its invocation.
Don, it does actually work (amazingly) from a console app (NAnt, specifically). If you reflector DoEvents the setup a ThreadContext and message pump if one doesn't exist. I did this see code in the 1.3 tree though. Are you saying it will be removed in lieu of the NativeHtmlHelp project? When will it be removed?
The original HTML 2 documneter used a MS utility to convert the NDoc generated CHM to HxS compatible html. We then post-processed the Html to add some additonal meta data in order to support VS.NET integration. That where the above code came from.
The NativeHtml2 documenter generates the html directly (via xslt) so there's no need to parse the output as html.
We're hoping to get a refresh of our beta out in the next week or so and have a finalized version of 1.3 a couple of weeks after that.
I didn't realize DoEvents would create a message pump for you. Interesting. As it stands and AFAIK there is no way to use MSHTML as an HTML parser without a message pump, so your implementation is probably the best way to go.
My background on multi-threading has more to do with monitoring and control and stems from Win32. One would never guess that MSHTML.HTMLDocument requires a window.
Comments are closed.