Scott Hanselman

How to load HTML into mshtml.HTMLDocumentClass with UCOMIPersistFile and my ignorance

June 25, 2004 Comment on this post [6] Posted in PowerShell
Sponsored By

What a weird one.  I'm looking at the source for NDoc.Document.HtmlHelp2.Compiler.HtmlHelpFile.  It uses the Microsoft.mshtml interop Assembly to load an HTML file into the HTMLDocumentClass for easy parsing.

It's code looks like this (DOESN'T WORK):

private HTMLDocumentClass GetHtmlDocument( FileInfo f )
{
  HTMLDocumentClass doc = null;
  try
  {
    doc = new HTMLDocumentClass();
    UCOMIPersistFile persistFile = (UCOMIPersistFile)doc;
    persistFile.Load( f.FullName, 0 );
    int start = Environment.TickCount;
    while( doc.body == null ) 
    {
      if ( Environment.TickCount - start > 10000 )
      {
        throw new Exception( string.Format( "The document {0} timed out while loading", f.Name ) );
      }
    }
  }
}

I went searching as it was taking up 100% CPU for an hour and never completed.  Now I know why! :)

What's weird is this, the only way I could get it to work (as IPersistFile is loading on another Thread) was with this change (NOW IT WORKS):

private HTMLDocumentClass GetHtmlDocument( FileInfo f )
{
  HTMLDocumentClass doc = null;
  try
  {
    doc = new HTMLDocumentClass();
    UCOMIPersistFile persistFile = (UCOMIPersistFile)doc;
    persistFile.Load( f.FullName, 0 );
    int start = Environment.TickCount;
    while( doc.readyState != "complete" )
  

     
System.Windows.Forms.Application.DoEvents();
      if ( Environment.TickCount - start > 10000 )
      {
        throw new Exception( string.Format( "The document {0} timed out while loading", f.Name ) );
      }
    }
  }
}

When I Reflector into DoEvents() I can see that it's doing more than a Sleep(0) (yield), it's actually running the message pump.  Am I missing something?  Apparently IPersistFile needs the message pump?  Well, it works, but it's gross.

About Scott

Scott Hanselman is a former professor, former Chief Architect in finance, now speaker, consultant, father, diabetic, and Microsoft employee. He is a failed stand-up comic, a cornrower, and a book author.

facebook bluesky subscribe
About   Newsletter
Hosting By
Hosted on Linux using .NET in an Azure App Service
June 25, 2004 13:45
I ran into that one a few days ago in an external webpage thumbnail maker. Maybe because it’s is running as STA? So window messages won’t get handle in another thread because there is only one thread.
June 25, 2004 15:49
A better solution is to call this method from a worker thread and do a Thread.Sleep(20) from System.Threading inside the while loop.
June 25, 2004 18:36
It's not IPersistFile that needs the message pump, but rather MSHTML.HTMLDocument's implementation of it that requires a message pump. I suspect that the reason for this is that MSHTML exists for IE, and is not a generic HTML parsing mechanism.

NDoc, in beta 1.3, has gotten rid of the above implementation (which was a stop gap measure).

In another NDoc project however (the one that prepares our user's guide for publishing online) we still do a similar thing. In this case however that utility creates a hidden window (which get's us a message pump) and does the work in the context of that window. This allows the utility to be invoked from a console app.

Does the Application.DoEvents() implementation work in a console App? I wouldn't think it would unless you called Application.Run as some point prior to its invocation.
June 25, 2004 20:58
Roland, I hear what you're saying, but this is more complicated than just adding a thread. The problem, as Don points out, is that MSHTML's implementation of IPersistFile requires a window message pump, and DoEvents (in .NET) is the only way to start one up.

Don, it does actually work (amazingly) from a console app (NAnt, specifically). If you reflector DoEvents the setup a ThreadContext and message pump if one doesn't exist. I did this see code in the 1.3 tree though. Are you saying it will be removed in lieu of the NativeHtmlHelp project? When will it be removed?
June 26, 2004 9:11
It's going to be removed (well left in CVS but not part of NDoc) as soon as we release 1.3.

The original HTML 2 documneter used a MS utility to convert the NDoc generated CHM to HxS compatible html. We then post-processed the Html to add some additonal meta data in order to support VS.NET integration. That where the above code came from.

The NativeHtml2 documenter generates the html directly (via xslt) so there's no need to parse the output as html.

We're hoping to get a refresh of our beta out in the next week or so and have a finalized version of 1.3 a couple of weeks after that.

I didn't realize DoEvents would create a message pump for you. Interesting. As it stands and AFAIK there is no way to use MSHTML as an HTML parser without a message pump, so your implementation is probably the best way to go.
June 26, 2004 11:17
Okay, thanks for the lesson :-)

My background on multi-threading has more to do with monitoring and control and stems from Win32. One would never guess that MSHTML.HTMLDocument requires a window.

Comments are closed.

Disclaimer: The opinions expressed herein are my own personal opinions and do not represent my employer's view in any way.