dmtree 2.0

Last updated: 4/29/01 3:15pm

O.K., first draft. Revisions forthcoming. Comments welcome.

It might be the case that I can't really explain what I want to. Perhaps we'll have to wait for a few days until I have some examples of what I am talking about. Fine. This document will still be here. Also, one of my main goals was to make the new system function in a way that is backwards compatible with the present system. So if the stuff I'm talking about here (some of the new features) are of no use to you, or even seem counter to your needs, you can just ignore it all and the system will continue to work as the present one does. Needless to say, all your old data is compatible as well, and will be transfered over for you. With these things in mind, let's dig in.

The problem is data retrieval. You could state this a lot of different ways, but I'm looking for a very short statement about digital media content on the web. And for me, it all boils down to data retrieval. This is obvious in the case of building inventory systems, or databases holding climate data, or gene data, or anything like that. How do you build a system that enables people to retrieve the data they want. But it's equally true for web "publishing" ventures, whether the publishing is textual or graphic. It's all about enabling retrieval. When you have a small amount of data, retrieval questions are trivial. As the amount grows, the solutions become harder. I think we've all put enough bytes up on the web to be able to extrapolate. What if we keep up our pace of data creation? What will our database look like in 5 years? 10? Here's a clue: it will be freakin' huge.

Weblogs are a first attempt at structuring a retrieval system for the sort of data we're talking about. And they are very successful in their simplicity. Just put the most recent content at the top of the page, and scroll slightly older content down the screen, and then eventually off into some sort of dated archive. Brilliant. A regular reader just hits the page and reads down until they hit something they've read before. Update complete.

The problem, of course, is that not everything is chronological. Or even if it can be forced into that configuaration, the retrieval strength of the system degrades rapidly for older content. How do you find a particular piece of information that you posted a long time ago? The most straightforward way is to just do a search through the entire text of all posts. This is still possible now, but as the database grows this will quickly become too time consuming. Eventually it will just be impossible. So what to do? How do we enable better retrieval?

The basic answer most people are giving involves metadata. Metadata is information about information. A summary is metadata. An id number is meta data. The idea is just that the metadata is smaller than the data, and therefore facilitates searching. A simple form of metadata is just to store a list of keywords connected to each particular post. Searching all the posts might be impracticle, but just searching the metadata keywords can be quite efficient. Find the keyword, and by association you find the longer post. Great.

And metadata can go way beyond that. Our posts are stored in a database which you can imagine as a large stack of papers. Each paper in this stack would represent a record, and each record shares a similar sturcture. Imagine that each sheet of paper is divided into several boxes. Into each box goes a particular type of data. Maybe there is a large box for the post itself, and then smaller boxes for the author's name, the date of the post, a list of keywords, a page it is associated with, a position on the page, a pointer to where the comments for this post would be found, etc.... The key is that this structure is set before hand, and is the same for each record. Now when we stack these pages on top of each other (that is, when we put all these records into the database) the database software is very good at searching through all the records and pulling out just the ones that match a certain criteria. We tell the database, give me all records where the author = x, or all records where the page = y and the author equals x, or all pages where the page = y, author = x, and the date > a and the date < b. Every time you request a page on this site a rather large call like this is made to the database which spits out just the results we want. As the stream of results comes back we wrap it inside some rather crude HTML formatting and send it out across the web to a browser.

The much ballyhooed XML (eXtensible Markup Language) boils down to much the same thing in a very different guise. The main idea, as with databases, is to store the information you want (the post itself) along with an associated array of smaller data which helps to define the data itself. It's something like putting a bunch of handles on our object (our posts) so that we can better grab it from whatever direction we want. Already it should be clear how much more powerful this is than relying only a chronological storage and retrieval method. Take the Sustenance page for example. Often we write posts about specific restaurants. Fine. These appear in chronological order on the page which is good for people who constantly read the page. But imagine now if we had a structure of key-value pairs (which could be fields in each record in the database, or similarly could be built into a specialized restaurant version of XML) which specified, say, the Name, Country, State/Region, Address, Phone Number, Price Range, etc., of each restaurant. Now if we were looking to retrieve a post about a particular restaurant, or just looking to see if a post exists for a particular restaurant, we'd have lots of ways to find it. Not just by name (which is good, because to a computer 71CFF, and 71 Clinton, and 71 Clinton Fresh Food, and 71 Clinton Fresh Foods are all different restaurants) but also by location, phone number, etc. The more meta data you have, the more ways you have to find the post you are looking for. This sounds pretty good, don't you think?

The problem is that I don't think this will really work. I've talked about this before, but the main problem is that for metadata to really be effective you have to specify your categories (the keys in the key-value pairs) very rigidly. And in practice I've found that our posts don't fall into very rigid categories. In one sense there is no way around this problem, and really it may not be a problem at all. Perhaps it is the case that humans are good at a kind of fuzzy thinking where useful connections can be made between apparently disperate items. To use the classic example, my love is not really a red rose, but much of what we take to be our humanity would not exist if you didn't understand the connection I'm suggesting. That's a little off target, but maybe you see what I'm saying. Computers aren't very good at stuff like that, but that makes them very good at something we are not: finding exact matches in extremely large data sets.

Anyway, to get back on track, the new system is taking a slightly different approach to metadata. The one sentence explanation is this: Instead of data coupled with categorizing key-value pairs of metadata, I want to encourage multiple instances of the same data located in different positions within a hierarchical tree like structure. The answer is not metadata, but multiposition. I'll try to explain.

Computer file systems can be thought of as trees. This is a technical name, and not so accurate, as they more closely resemble upside down trees. The base of a file tree, called the root (and symbolized as: /) is located at the top. From here the tree cascades down with any number of nodes underneath the top, root, node, and in turn any number of nodes below any of these second level nodes. A node above another node is a "parent," and the lower node is the "child." Your computer's file system is arranged like this, as is the file system on the server where all our data is stored. The web (which is basically a giant global file system) is the same way. Except instead of being represented as a hierarchical graph, the path to any file is instead written out in linear order. http://www.digitalmediatree.com/ is the location of the root of our filesystem. Anything that follows this in the URL is called the 'path' and this represents a particular journey from the root, down through the tree to a specific file. So http://www.digitalmediatree.com/jim/index.html takes you down from the root into a child node called 'jim' and then down again from /jim to the file called '/index.html'. Trees are fantastic structures for storing data. They enable very fast retrieval of information.

But only if you design your tree well. A good design has one simple facet. The more data you have to store, the deeper you want your hierarchies. If you have a lot of data, and your tree is only one level deep, then you basically have no structure at all and this is no help for retieval. We could just name every file uniquely, and put them all into the same directory right under root. Then every file would be of the form http://www.digitalmediatree.com/xxxx.html where xxxx would be a unique name. In this case to find a file the system has to search through the filenames of every file on the site. Making even one more level (which is basically what we have now) greatly helps matters. By looking for a file in /jim instead of as above in / we significantly cut the size of our search space.

My idea now hinges upon extending our hierarchies to a greater depth. This is what I mean when I have been saying lately that we are moving away from a web page system, and toward a web site system. I want everyone's page to expand into a whole site, and the expansion happens not out or up, but down. Deep file hierarchies, coupled with lots of cross posting (thats the multi position data replication I talked about above) might be a workable system.

Let's take my page as an example. I'm going to set it up so that /jim is a rather static page with, as Ethel says, "...ostensibly more permanent stuff...." Contact information, a little bit about how my part of the site is set up, a little background on what I am doing, plus links to the other pages that will be underneath. And the main page underneath will be my blog, at say, /jim/slog. (As the system grows there will be other pages on this level as well, like say, /jim/email and /jim/calendar and /jim/photos.) But, and here's where we're getting to the meat, underneath /jim/slog will be a whole array of subpages. I'm imagining these will be like /jim/slog/computers (and then /jim/slog/computers/hardware, /jim/slog/computers/software, /jim/slog/computers/software/linux, /jim/slog/computers/hardware/apple/airport,) /jim/slog/personal (/jim/slog/personal/travel/2000/amsterdam, jim/slog/personal/nyc/sites,) /jim/slog/science (/jim/science/space/photos), and on and on and on. Creating new pages is ridiculously simply in the new system, and better, I've done away with the cumbersone draw.php3?global=0:1:1:x:x: URL system which I now see was merely an attempt to replicate the power of the hierarchical file system that is already in place. I want the URL's to be as pretty as possible. Perhaps this is a left over from Cam's "what's up with that funky back end" comment. But there are still numerous problems to be addressed.

These problems are what I have been working hard on solving. The first key, I think, is the built in cross posting ability of the system. For each page you can specify any number of other pages that will be connected to it for crossposting. Then when you go to make a post you see the usual textarea posting box, and underneath that a list of checkboxes representing all the other pages that the page you are posting to is connected to. Leaving them checked will duplicate the post onto all of the pages in question. There is no rules on which pages can be cross posted to, but the main idea is what I have been calling "percolation". You post to the most specific page possible, and let the post percolate up onto the other pages. As I build my hierarchies (/jim/slog, /jim/slog/computers, /jim/slog/computers/hardware, etc...) I make each higher page (up to /jim/slog) a crossposting page for the ones below it. Then when I want to post, I go to the deepest possible page and post there. All the pages above are automatically updated to carry the post as well. I've done everything I could to make the crossposting no extra effort.

So now, if someone is interested in computers, but doesn't care about the occasional post about my personal life, they can just follow along at /jim/slog/computers. Similarly someone looking for information about a trip I took, can start their search on /jim/slog/travel, and right off the bat they are almost there. Where databases and XML try to attack the problem by breaking the search space down at the time of the search, I'm talking about breaking the search space down at the time of data entry. What we're building is something like an "anticipatory database" in that the tree structure you define is in anticipation of certain data retrieval needs. A post to Sustenance about a dinner at Prune could be posted to /sustenance/restaurants/us/newyork/newyorkcity, and would thereby show up on /sustenance, /sustenance/restaurants, /sustenance/restaurants/newyork, etc...) So instead of metadata which helps to define what the post is about, we have multiposition, where the multiple positions a post occupies in the hierarchy help define what the post is about. I think this could be a powerful system, but there are still some problems.

One bad thing about deep hierarchies is that the path to a particular file can become quite long. This is very bad for someone who has to type in the whole name. One way I've tried to solve this problem is by making the system try hard to locate the file you want even with an incomplete URL. As long as you specify enough information to escape ambiguity, your file will be found. Typing in http://www.digitalmediatree.com/jim/linux will find the page /jim/slog/computers/software/linux. Building that algorithm was one of the early successes that got me motivated to build the rest. We'll see how well it works in practice as we go.

Another problem was the usual distinction between folders and files. In regular file trees, a folder doesn't hold the data itself, but does hold the child pages that fall below it, while a file actally contains the information, but doesn't itself contain other child pages. In the new system here, files and folders are the same thing. There aren't any .html or .php3 file type suffixes. Everthing is the same. You might think of it as folders where the outside of the folder is the same as a regular page that might be inside the folder. I needed this because I think it's important for the data to be replicated at all levels. So instead of having a folder at /sustenance/restaurants/newyork/newyorkcity that holds files like /71Clinton, /prune, /veritas, the folder itself contains and displays (just like a regular page) a copy of all the information present on all those lower pages, arranged either chronologically like a blog, or alphabetically on a particular keyword. One of the main ideas is to enable both specific retrieval, and browsing (i.e., non specific retrieval.) If you're looking for prune, fine, go right to /sustenance/restaurants/newyork/newyorkcity/prune (although you'd only have to type /sustenance/prune to be sure to get it,) but if you only know you're looking for something in new york city, then just go up a level and start browsing through all the listings. We're just putting the data, ahead of retrieval time, into as many places as possible. If someone knows a lot about what they're looking for they'll end up with a tiny search space in which to retrieve information from; if they don't know much, then they get a larger search space to browse through. The key is that the info already exists in all these locations.

This is sort of out of place here, but in case you're wondering I'll just mention that no, the whole post is not actaully stored multiple times (which would be a huge waste of space.) Each page (each file/folder) just contains a list of post id numbers. So the post itself is stored once in the database with a unique id, and then this id is associated with x number of pages. And this is important not just for space reasons, but also so we don't screw up what I think is one of the most important aspects of our site: the ability to have a front page that lists, in real time, the location of all new content. Obviously if stuff is crossposted all over the place you will see a lot of duplicate information on the front page (1 new post here might be the same as 1 new post on some other page.) Because each crosspost is, at heart, just a copy of the same post, reading a post in one location registers as you having read that post at every location.

And hopefully (I'm pretty sure) there are other applications as well. I know one thing I am going to do is specifically NOT percolate up every post. Sometimes I have something to post about linux, but it doesn't really warrant a spot on /jim/slog, so in that case I'd post it to /jim/slog/computers/software/linux, and then only percolate it up to /jim/slog/computers (I'd do this with just one click on the posting page - a click to uncheck crossposting to /jim/slog.) In this way /jim/slog will be like a page of pull quotes guiding people to the best of what is contained on the far more detailed pages below. Depending on what the universal interest is in a particular post determines how far up I will percolate it. I'm heavily indebted to the more like this system at whump.com for starting me down this thought path. Ours is implemented quite differently, but the basic idea of replicating posts under a bunch of different category headings is the same. This will get us out of the retrieval straightjacket imposed by the chronological only weblog. Or so I hope. And I haven't even touched on many other aspects (like the expanded table of contents like pages that exist throughout the site, or the ability to make alphabetical instead of chronological pages, etc...) I think the system is rich enough that many different solutions can be made with it. The hard part, as always, is with documentation. Or, that is to say, the hard part is communicating to others what the possibilities are, even when I'm not really sure of the limits myself. I have specific ideas for some of your pages which I will discuss with you all seperately. Hopefully as we go, and as people see what we do on other pages, ideas will be sparked and we'll see some novel uses. I think there is considerable potential. Thanks for listening.