Scott Hanselman

Is the Library at Alexandria burning every day? How do we cluster the cluster?

March 07, 2007 Comment on this post [4] Posted in Musings
Sponsored By

Stuart discovered splogs today and Jeff learned to lower his blog's bandwidth. Hard learned lessons both, but both got me thinking.

Splogs: If you look at SplogSpot, their weekly splog dump XML file is 56Megs this week. I guess if you filled a library with 90% pron and 10% content (or 99% and 1%) you'd have a pretty interesting library. Does it make things hard to find? Sure, especially if the goo is mixed in along site the good stuff.

Distribution of Responsibility: Jeff's starting to distribute his content. Images here, feed there, markup here. Ideally his images would be referenced relatively in the markup and stored locally, and he'd rewrite the URLs of those images on the way out, be they hosted on S3 or Akamai, or Flickr.

Aside: The Rails guys are definitely ahead of the .NET folks on this stuff, with things like asset_host, and gems that support hosting of files at S3 and elsewhere. Distribution of content and load is a good thing, but only if you can turn it off at any time, and easily. Every external dependency you add is a potential weak point in your content delivery - and content permanence - strategy.

I went looking for something yesterday and found it, I thought, on an old broken-down Tripod.com site. When I got there, however, it was just the text, the links to CSS, some JavaScript and more importantly, images, were long gone.

Broken images on a web site are the equivalent to broken windows in a building; fix them, or they mark the beginning of the end. - Me.
(Call back to old partially-related-but-not-really-but-he'll-tell-me-it-is Atwood post :P )

Which leads us to the Day's Questions:

  • Is the addition of splogs to the Global Index representative of a watering-down of content? Does the proliferation of content-free MySpace pages increase the internet's usefulness, or decrease it?
  • Does the breaking apart of "atoms" of content - like this post, for example - into "quarks" of text, images, styles, etc, all hosted at different locations, affect it's permanence and, forgive me, historical relevance?

I would propose that in both cases, there are emerging problems. Spam and Splogs must exist because there are eyeballs looking at them. Otherwise they (the evil-doers) would stop, right?

Breaking apart content into multiple delivery channels at different hosts helps to offset the cost to host the content. Right now the bandwidth costs for hosting this blog are covered by advertising because I update the blog regularly.

But, if I stopped adding new content, I'd stop getting advertisers, then I'd stop paying the bandwidth bill and the blog would rot. Folks might stumble upon the rotting carcass of this blog in some far-flung theoretical future (like two years from now...WAY out there in Internet Time, people) and find only text, no images, broken javascript and wonder if a library burned? How is content permanence possible? If I don't pay my DNS bill, the site disappears. If my ISP goes out of business, the site disappears. If flickr goes out of business, many photo links on this site disappear. Is it reasonable to depend on these external services?

When the Library at Alexandria was at its peak, apparently 100 scholars lived and worked there. In the time it took to read this sentence, I'm sure 100 MySpacers have joined up. Not exactly scholars, but you get the idea. Things are moving fast, and they aren't lasting long. Some might argue that Wikipedia itself isn't "scholarly" and lowers the bar as well, although I find it useful more often than not. Either way, there's a crapload of information out there with 20% of the planet adding new content everyday.

Alexandria failed because it had no geo-located redundancy. Like the vast majority of of human knowledge, it wasn't clustered. The internet, on the other hand, is a cluster in more ways than one. But is it useful and is its usefulness permanent?

If I may mix my metaphors, is the future of the Internet a worldwide library like Alexandria at its peak, or are we doomed to collectively search a Bathroom Wall for the wisdom of the ages?

I don't know if the flash-mob that is Digg qualifies as a good filter.

About Scott

Scott Hanselman is a former professor, former Chief Architect in finance, now speaker, consultant, father, diabetic, and Microsoft employee. He is a failed stand-up comic, a cornrower, and a book author.

facebook bluesky subscribe
About   Newsletter
Hosting By
Hosted on Linux using .NET in an Azure App Service
March 07, 2007 7:38
(This is a non-web developer speaking on a late night of what have been a tough last 6 months)

Would it be feasible to implement (in the sense "easy to build, edit and manage") a web site where all URLs are local, but underneath, many of those links are redirected non-permanently to temporary alternatives determined on the spot based on availability, with a fall back system that may revert to returning the real local resource, but even better, a static resource globally available "a la" archive.org, probably datetime stamped?

Here's my idea: You build a web site, totally local, but with the proper handlers already plugged. Then you only have to turn a switch "on" somewhere to have your resources spread to the proper alternatives (images here, RSS feeds there, CSS here, etc), but also "sent" to the archive collective for stamping. With that switch on, the handlers would also start redirecting (non-permanent, important) to the proper alternatives. If all alternatives disappear for a given resource, it returns the local copy (and may queue an action to find a new alternative).

Now, this does not address the case where the local host itself disappears. Replace "local URLs" by "redirector URLs". Today, we have URL shrinkers, that let's you replace any server's URL by a single server's synonym. Let's invent the same, but for resources in time, and more than a simple redirection, but a fall back system.

From now on, everybody builds web sites using URLS in the form "http://spreadster.com/hanselman.com/blog/themes/zenGarden2/PubComputerZen_Final.jpg?ts=20070306", or "http://spreadster.com/hanselman.com/feed.rss?ts=latest", etc. Then we let that service handle the proper redirect, which may as well return to our own server (yes, with a way to prevent a loop :) ) most of the time. I have an idea how this service would know of what to return/where to redirect for each URL, but I'll stop here.

It's fun to be a non-web developer. You may write stupid ideas, you can always say "well, it's not my bag anyway". But if the idea isn't that stupid, wanna start Web 3.0, the distributed era?
March 08, 2007 0:27
You can get an account at Crystal Tech for $8 a month that uses .NET 2.0 and SQL Server. Don't know why you are worried about hosting costs unless, as usual, IDKWTHFITA.
March 08, 2007 1:07
John, the couple hundred gigs a month I use would probably be noticed by an $8 host. Bandwidth isn't free.
March 09, 2007 7:05
It's funny that splogs kind of address some of this, at least for the basic content. Like it or not, your content is being mirrored.

There are some external services that do a little bit of mirroring - short term, there's the Google archive, and longer term there's archive.org (http://www.archive.org/web/web.php). I think that a lot of the information that's being posted these days has a pretty short expiration date, anyhow - how necessary is if for us to read your (then cutting edge) Everett tips now, or in 5 years? Hopefully, most of the enduring information will be found and stored before it fades. Plus, as storage keeps growing, we may end up archiving just about everything as archive.org does. I've got all the files from my previous computers in subfolders on newer computers because storage capacity keeps growing so quickly; will we do that with the entire internet as we go?

Comments are closed.

Disclaimer: The opinions expressed herein are my own personal opinions and do not represent my employer's view in any way.