Message-Id: <199402160426.XAA06538@wilma.cs.utk.edu>
From: Keith Moore <moore@cs.utk.edu>
To: "Karen R. Sollins" <sollins@lcs.mit.edu>
Subject: Re: caching
In-Reply-To: Your message of "Tue, 15 Feb 1994 19:49:04 EST."
<9402160049.AA02520@zippy.lcs.mit.edu>
Date: Tue, 15 Feb 1994 23:26:33 -0500
Karen writes:
> Good point. Can someone please argue for why we should have some
> support for caching in the URN functional spec, if anyone has a strong
> opinion on this subject? (I certainly can't.) I believe that several
> people, at least, in Houston, felt strongly about this, which is why
> it's in. If there are no opinions about it, can we please have some
> murmuring to remove it.
Not just caching, but transparent replication also.
In a better world, file servers would contain not just the files but also meta-data
about each file; such data could include both the content-type of the file, and a
location-independent file name (LIFN) for that file. When another server mirrored
such a file, it would also get the meta-data and LIFN, and as part of the process
could also update a location database informing it that that particular file was now
available on the mirror also. Of course, URLs would be used to designate file
locations.
Given a LIFN, when a client wanted to fetch a file, it would ask the location
database where the file could be found; the database would respond with a list of
zero or more URLs (and possibly other location-specific information).
(Of course, the client might want to check a "local cache" first to see if it
already had the file with that LIFN; it could then avoid both the database
lookup and the time/bandwidth which would have been required to fetch the file from a
remote site.)
A conceptually separate database could, given a LIFN, supply file-specific
meta-information like content-type. (I think of file-specific information as seperate
from location-specific information because the former is presumably static, while the
location information changes more frequently. Also, there could be multiple locator
services but the meta-information about a particular file should be consistent no
matter who provides that information.)
Files could also be indexed by LIFN so that a search would return LIFNs instead of
(or perhaps in addition to) file location information.
Sound good so far? Okay, now here's some potential problems with using URNs
(as currently conceived in the collective mindset) as LIFNs:
1) Some insist that URNs should refer to something fuzzier than a specific
pattern of bits. If you ask for a URN, you should get back not a file, but
some version of that resource -- perhaps translated into a different
content-type.
Now, I don't really have a problem with giving servers the ability to translate files
on behalf of their clients. But say I'm not an ordinary client, but instead I'm a
mirror that's trying to replicate a file. Now it's *very* important that I get the
authentic bucket of bits, and not some converted version. (Or, at least if I get the
converted version, I shouldn't use the same LIFN to describe it.) One conversion
from server to client is bad enough; cumulative conversions can cause unacceptable
loss of information.
2) Now say that I am an ordinary client, and I get a file with HTTP or whatever
protocol from a server that translates for me. Maybe this is a file that I get
frequently but yet doesn't change very often relative to how often I get it. (like
maybe the Mosaic home page?) Seems like the ability to do caching would help a lot.
Now if my local machine has a cached copy of a file, which a file server had
translated into a different content-type, how do I know whether this version of the
file is usable for my purposes? I think it was Tim that suggested that the client or
user could make that decision. The problem is that it's very hard to keep track of
enough information to make that decision. It depends heavily on the kind of data and
the types of conversion being made.
A GIF file of line art might be converted to JPEG without too much loss, but a GIF of
a photograph so converted might be usless. I can reasonably convert a 16-bit 44kHz
audio file into g.721 if the program material didn't actually have a wide frequency
range, but if the same material were originally sampled at 8000 sample/sec mu-law,
translating it into g.721 might render the material too painful to listen to.
For caching to work right, we also need a way to know whether the cached copy of the
file is still valid. Our file retrival protocols don't have a way to do this, but
there's no reason why we can't have the system that keeps up with file locations and
file meta-information, keep track of the current versions of documents also. This
might be done with a time-to-live mechanism, or maybe what's needed is two kinds of
name for a file -- a "resource" name and an LIFN. Then the client could ask the
server "is this LIFN still the current version of this resource?" and perhaps save
the overhead of fetching the whole file.
The point here is that, even if a server is willing to convert content-types on
behalf of a client, a LIFN must always be understood to refer to the same collection
of bits. URNs, as currently defined, are not so constrained and are useless as
LIFNs.
Of course, there's still a need for some kinds of URN that don't refer to a
particular bucket of bits -- like the one for the current weather map. Maybe the
best way to solve this is to have both LIFNs and URNs, or maybe we can just add a
distinguishing mark to a URN to say "this is a LIFN". But however we do it, we need
LIFNs.
Right now, the three biggest problems facing the MetaWeb are scalability,
scalability, and scalability.
The first problem is that the services it offers aren't scalable. Resource names are
currently tied to locations; so any popular server gets the crap beaten out of it
because there's nowhere else to go. Of course there are various mirroring packages,
but they just take care of getting the files propagated; they don't actually keep
track of where all of the copies of the files are. Services like archie, veronica,
etc. have a hard time keeping their data current, presumably because the data
collection is so intensive.
The second problem is the lack of scalability of resource location. The Web is too
big now (and has been so for some time) to just browse around pointing and clicking
to find things; and the search services I've tried don't work too well. (Using it
reminds me of an adventure game.)
The third problem is the lack of scalability of maintenance. Too much data must be
maintained by hand. There are lots of documents and databases out there that point
to URLs that are obsolete. Tools exist to identify such conditions, but what's
needed is the ability to *track* changes in locations of files. And search databases
need to be automatically maintained as changes take place.
I think we need to build a system that provides for (a) transparent mirroring of
files, (b) one-stop-shopping to find locations for a file, (c) effective and useful
local caching, and (d) search databases that are automatically maintained and kept
up-to-date.
Actually, I've always thought that's what URNs were meant for. But all of the
discussion about what they should look like, instead of what they are to be used for,
makes me think that we really don't agree on what URNs are supposed to be used for.
If this is true, then trying to describe what URNs *look like* is a bit premature.
Keith