...more recent posts
Good background on the upcoming Leonids meteor storm.
Always propitious, Tom writes:
If the hot topic of the moment happens to be "Anthrax in violin varnish," then when I type those words, some crawl begins to sniff that thread - first among the bloggers I know and read all the time, then extending out to the great blogging ocean beyond. It does this without my having to tell it to. Then when I want to see what everyone has written about this topic, I click, and a cloud of threads from all the blogs comes captured in a snapshot array, duly attributed with links, inside some page or realm so that it's there, somewhat collated, just as whatever I wrote in my blog on that same topic is sniffable by anyone else.The thing is, the words "Anthrax in violin varnish" do not constitute a unique identifier. URLs, on the other hand, almost do. That's why daypop and blogdex use URLs as the basis for determining who's talking about the same thing. Words are too fluid. Is "Anthrax on violins" the same as "Anthrax in violin varnish"? Software will be hard pressed to decide. Yet this is what humans do well, and this is why blogs are important: because they harness a mulititude of human linguistic processing units (that's you and me) to work on these very un-binary questions of meaning. Go the other way, towards full automation, and you wind up talking about XML and the semantic web. And then the whole thing dies because writing is too tedious if you have to make it machine processable.
People have to do the work. We have to be the filter. That's blogging. You have to do the crawl yourself. "...first among the bloggers I know and read all the time, then extending out to the great blogging ocean beyond..." This is exactly what happens already, without any additional technology, when you're tuned into blogspace. You're the linguistic engine. By keeping up with your own corner of the world wide web (parts of which keep up with other corners which contain parts that keep up with still other corners, etc...) you are doing the crawl. And there is no better machine to do it. Blog on.
I've been thinking that all the hits weblogs get from search engines usually don't result in the searcher connecting with the information sought because in most cases the information a search engine has seen on a weblog will already have been pushed off into the archives by the time the searcher comes along. I wish google would spider my archives and not my main page. Probably I could set up robots.txt to create this outcome, but because google ranks results based on an algorithm that pays attention to how many other pages are linked to yours, having google spider your archives (which it would see as different from your main page which most people would have links to) would probably hurt your search result positioning.
A different idea I had would be to look at the refering page when a page here is requested. If it's coming from google (or another search engine) you could parse the refering URL, extract the search phrase that was entered into google, and feed that into the search engine here to bring up the requested page, but with only those posts that mention the search phrase. Maybe the top of the page could be a standard explanation like: "I see you are looking for something specific. I've tried to provide you just that information. If you'd like to see this page as it would normally appear, click here."
At least that way all my "antrax symptons" searchers would find what they are, errr, looking for.
Oh yeah, I know, "that won't scale" but not everybody is trying to scale. Why not take advantage of unpopularity by building in more features then you could for a high traffic site?