Message-Id: <9312030150.AA09349@interval.interval.com>
Date: Thu, 2 Dec 1993 17:50:01 -0800
To: uri@bunyip.com
From: winograd@interval.com (Terry Winograd)
Subject: URN functionality from URLs
The recent traffic on URL++s and URNs prompted me to pull together some
thoughts on the URN/URL issue, which I am sending along here in a pre-draft
version. The first part is a reflection on the issues, and the rest a
proposal for a mechanism that meets the goals for URNs. I assume the
reader is familiar with the discussion on URNs, URLs, etc. from the
uri@bunyip.com list, IETF meetings, etc. The paper makes
contextually-dependent references to recent traffic on the uri list, which
would have to be fleshed out if it were to become an independent document.
----
Stable Network File URLs as a mechanism for uniform naming
Terry Winograd
Early draft version 12/2/93 -- Fire away!
1. INTRODUCTION
There are two different starting points for understanding what a URL really
is, and these lead to quite different intuitions. The first is the
classical notion of "citation" which is most obviously evident when URLs
appear in the reference list of a document or message. That is, the URL
serves to identify an existing object in the information world, and to
provide the reader with the information needed to find that object and read
(or view, hear, etc.) it.
In the beginning there were network protocols based on a straightforward
use of hierarchical file systems and URLs (though they weren't called that
yet) which referred to objects stored in those systems from a citation
perspective. The objects these protocols could deal with were directories
and files, and the protocol structure mirrored this directly. FTP is the
classic example but Gopher doesn't go far from it.
The second starting point is the generic notion of a server request, in
which the URL specifies a server name and some kind of string that the
server can interpret to provide some kind of response. This is the
direction that has been proposed in some of the recent messages about
"URL++" from Peter and Karen.
With the development of HTTP and corresponding uses in the Web, people
realized that the same "file pointer" notation that worked for
citation-like use could also be adapted as a general request protocol.
That is, something of the form <URL:HTTP://host/filename> could be
thought of as <URL:HTTP://host/arbitrary command> and only the host server
needed to know how the command actually related to existing files, files
generated on the fly, commands to take other actions, etc. This has been
used creatively, and the web is now full of traffic in URLs that do not
point to anything much resembling files in a hierarchical file system
(search queries, forms, x-y coordinates, etc.) Peter and Karen are
proposing to generalize and regularize this.
The first (citation-like) view is primarily STATIC, and is based on
practices that have evolved over centuries for use with print media (which
are indeed quite static). The second (query-like) view is DYNAMIC and is
much more akin to information retrieval and database query protocols, where
the "same request" is assumed to return different "answers" at different
times, rather than being a "pointer" to some information object like a
file.
One of the most interesting things about the burgeoning world of network
information systems is the clever interplay between these two perspectives.
The web is quickly moving away from the static version (which is the
original hypertext vision and I assume was its original motivation) with an
ever-expanding vocabulary of dynamic query types. Systems like Gifford's
Semantic File System adapt the syntax of static file pointers to encode
dynamic database queries, and so forth.
The fact that there is no uniform semantics defined for URLs makes it
possible to do all these things in mix-and-match fashion. But not
everything is happy in this Garden of Eden. The ability to treat a URL as
"a server name plus whatever the server wants to do with the rest of the
string" gives great flexibility and also lacks stability. Anyone who has
spent any time in the web knows that a URL may or may not correspond to a
file that exists now, or a server that is willing to respond. As we start
to build linked information structures that are intended to stay around for
a while or be comfortable for people who aren't network-savvy, we need some
of the characteristics of citations (uniqueness, longevity,
server-independence, etc. as outlined by Larry and Karen in their
"Specification of Uniform Resource Names"). Hence the desire to create the
URN.
2. DESIGNING A URN
The key problem in designing a URN structure is understanding the
connection between URNs and access. A URL is inherently present-oriented.
The operational semantic grounding of
<URL:HTTP://pcd.stanford.edu/courses/cs247.ps> is "Whatever the host
pcd.stanford.edu will send you when you give it the request string
'courses/cs247.ps' ". There are no guarantees that you will get the same
thing twice from the same request (for an example, try
<URL: HTTP://www.cis.ohio-state.edu:84/>), that you won't get the same
thing from many different requests, etc.
On the other hand, the semantic grounding of a URN such as
<URN://ISBN/0-201-11297-3> is past-oriented: it refers to "The unique item
to which the ISBN has previously assigned the number 0-201-11297-3"
regardless of where it is or what is happening now (or even whether the
ISBN organization still exists, or the item is out of print, or...)
There is no direct operational way to access something based on a fact
about its past. There needs to be an indirect link -- some kind of index
or server that can use a record of the fact about the past (in this case
the naming act by an authority) to find a corresponding present-oriented
access identifier. This indirection needs to satisfy several constraints:
1) It can't create bottlenecks. For example, if every reference to every
on-line document requires a connection to nic.merit.edu to look up the URL
for the URN, we are in trouble.
2) It can't depend on the ongoing existence of the naming authority, or the
willingness of that authority to provide real-time network services. It
needs to be robust in the face of unreliable or uncooperative servers.
3) It should be as low-overhead as possible for the 90% case. Even though
there may be some lookups that require complex search and lookup chains,
the "normal" functioning should minimize the number of network actions
needed to retrieve the information resource corresponding to a URN.
Furthermore if it is to be effectively introduced on the net, a URN
mechanism needs to satisfy another property of incrementally:
4) Reference based on URN-like naming needs to be introduced in to the net
in a way that allows existing software to continue working (pretty well)
with minimal changes, and for there to be a smooth incremental path to
higher functionality.
3. A PROPOSED MECHANISM
The following proposal is a first attempt to satisfy these criteria.
Rather than postulate a separate URN object, it is based on adding another
protocol (a virtual protocol) to the ones already used in URLs (ftp,
gopher, http, etc.). It might make sense to call this the "URN protocol"
but I don't want to get tied up in whether it actually corresponds to what
we have been calling URNs. The desire is to have it satisfy the needed
functionality, not fit the previous mold. So I am calling it the STAble
Network File protocol (STANF).
The basic idea is that there be a class of URLs that look more or less like
the current ones that refer to files (with ftp:, gopher:, http:, etc.),
where the protocol name indicates that the server provides additional
services and guarantees: it is kind of a file pointer with a pedigree.
Before detailing the structure, I will list some of the underlying
assumptions:
1) There will continue to be a well-defined and appropriately-managed space
of host names, which guarantees things like uniqueness of hosts and the
availability of routing information. This may end up being an extension of
the DNS, or may change or incorporate other forms, but since it needs to be
done for many other protocols and purposes, I simply assume we can
piggyback on it as it evolves. For the purposes of this proposal I will
simply use internet domain names, adding a new class of "virtual" names as
explained below.
To be specific, we can assume that there is some collection of strings
which correspond to hosts in a many-one fashion (each string refers to only
one host at a given time); that every host from which we want information
has at least one of these strings as a name; and that given one of these
strings it is possible to establish connections to that host.
WE CAN USE THIS NAME SPACE AS THE SPACE OF NAMING AUTHORITIES FOR
INFORMATION OBJECTS. Note that "naming authority" coincides with a host
name, not a host. That is, if the same ip address (or other form of
network address) has multiple names, each is a separate naming authority.
If a name moves to a different host, the authority goes with the name. Of
course we must deal gracefully with cases where a name that is valid at one
time becomes invalid at some later time, or is assigned to a host that is
unrelated to the one originally having the name.
2) Every server is able to maintain a hierarchical file system with
one-to-one or many-to-one naming. That is, we can assume that at a given
moment there will not be two different objects with the same name (where
name is the complete file path), but there may or may not be two different
names for the same object. Also this structure is stable over time. That
is, unless people choose to make changes, a file with a given name will be
identically retrievable with that same name in the future. This simply
means that the decision to modify, rename or delete a file is a
human/administrative one, not something forced by the underlying storage
system. These properties are taken for granted in all standard file
systems such as UNIX, Macintosh, etc. This means that if we can count on a
collection of files to be administered in a consistent way so that the same
object is not given two distinct names, WE CAN USE ORDINARY FULL-PATH FILE
NAMES AS UNIQUE IDENTIFIERS WITHIN THE CONTEXT OF A HOST.
Given these two assumptions, then a URL of the form <URL:STANF://:host/file
path>has the basic properties of a URN. Its uniqueness and stability can be
assured by the naming authority, in this case, the administration of the
host. In the 90% case, this URL will also be effective as an immediate
access: the host and file path can be used with ftp, gopher, http, etc. to
retrieve the file. <<we should probably designate one of these as the
standard file access method. -- note that STANF: sits on top of them,
rather than being yet another way to get the contents of a file>>
To get the required properties of stability and source-independence, we
need some constraints on the server and an extra mechanism or two.
The first constraint is that the files that can be accessed by a STANF: URL
need to be treated specially by the administrator. Two files with
different names need to be unique (with respect to the uniqueness criteria
of that naming authority) and files should not be changed, moved, or
deleted without taking special actions to leave appropriate stubs (see
below). My vision is that in general there will be a separate subtree in
the file hierarchy (which we will label "STANF/") used for these files, and
that they will be cross-linked (with aliases, symbolic links, or whatever)
to the normal file hierarchies. So the management of the STANF/ subtree
can be under more careful management than the normal file management of the
system. Also, its branching structure may be quite different, since it is
not intended to facilitate human browsing. Human browsable hierarchies
tend to be relatively deep, with few nodes at each branch and with
informative directory and file names. Name space hierarchies will tend to
be flat and opaque. For example, in organizing the internet drafts, we
might create a "browsable" hierarchy, with pointers like:
<URL:FTP://nic.merit.edu/internet-drafts/ietf/uri/url-01.txt>
There would also be a "pure-name" directory for the naming authority
"nic.merit.edu" with names like:
<URL:STANF://nic.merit.edu/DR19930714BERN>
The two preceding URLs might refer to the same file (if it had been set up
that way), and the STANF: URL would always refer to the same file as:
<URL:FTP://nic.merit.edu/STANF/DR19930714BERN>
The STANF: URL differs from the FTP: URL in the expectations for how it
will be treated in the future, and how it will be indexed at other sites
(see below). The administrator is free to move things around in the
browsable hierarchy at will, as long as the links are maintained so that
names in the STANF/ subtree point stably at a constant file.
When a file is deleted, it needs to be replaced with an explicit marker
indicating so, which, when possible, points to a host that still has a copy
of that file. This use of indirection requires two things: a protocol for
indicating where to look, and a way of accessing files from one host that
were named on another. The mechanism for giving an indirect pointer is
discussed below. Here we describe the mechanism for allowing one server to
provide files that "belong" to another.
Note the assumption here that every file has a "primary naming source"
which has a host domain name that is forever associated with that file. If
you use the host named in a STANF: URL directly for access, you are going
to the host that was the naming authority. But in many cases (caching,
dead hosts, replication, bottleneck hosts, etc.) you will need to go to
some other server. This can be based on a simple file-naming convention.
Let's assume that I know somehow that the internet drafts (or at least some
of them) are being mirrored or cached on pcd.stanford.edu, which has an
HTTP server. For some reason (locality, accessibility, performance
characteristics...) my client looks for them first locally rather than
connecting directly to the naming authority. Then when the client gets the
URL:
<URL:STANF://nic.merit.edu/DR19930714BERN>
instead of treating it as equivalent to
<URL:FTP://nic.merit.edu/STANF/DR19930714BERN>
which would access the naming host, it treats it as equivalent to:
<URL:HTTP://pcd.stanford.edu//NNA/nic.merit.edu/DR19930714BERN>
where NNA is a standard marker for Nonlocal Naming Authority. The server
at pcd.stanford.edu can handle this in various ways. It might be set up to
simply maintain subtrees in an NNA directory for each of the other servers
it caches for, or (more powerfully) a pre-processor in the server notices
the "/NNA/.." path and does some smarter things, like normalizing host
names, dynamically looking for files not already cached, etc.
With this indirection structure, we satisfy the bottleneck and longevity
constraints: That is, the fact that a particular host serves as the naming
authority for some file does not mean that you need to access that host to
get to it. You can access any host that maintains (or can generate on
demand) NNA files that include the one you want. Of course, this opens up
the problem of URN lookup -- how do you know who has it?
There can be different levels of sophistication. A "smart" client, when
presented with a STANF: URL first checks with the local cache to see if it
has the file, or has an entry indicating a host other than the naming host,
which should be checked first (details of protocols for this are discussed
below). Failing to find it, it tries the naming host directly. Failing
that, it tries some kind of "resource location" service which can combine
massive indexing (like Archie and Veronica) with heuristic discovery
techniques (like NetFind) to locate some server that has the file. This is
of course expensive, but it is tried only as a last resort.
A "dumb" client would simply try the named host directly, and failing that
either fail, or, if it is slightly smarter, go to the resource location
service. Note that the 90% case is the one where the simplest thing simply
works (the original naming host is accessible and can provide either the
file itself or another URL that points to it). The only cases in which
access to the original naming host fails correspond to the ones in which
the current usage of URLs now gives an error message with no hope of doing
anything more.
This mechanism also satisfies the incrementality constraint.
All that is needed for clients to use the new URLs is a patch that converts
<URL:STANF://x/...>into <URL:FTP://x/STANF/...> <<assuming we pick FTP as
the standard file transfer>> Files that are stored under the STANF
convention will simply be accessed as files, as long as they stay available
on their naming host. The cases where this fails are cases where existing
clients are already lost. Clients can then be updated to add the "dumb"
way of looking for files whose naming hosts won't give them, and then to
smarter versions.
Servers can start out by simply creating the new directories and managing
them appropriately, along with doing the simple URL conversion like that
described in the previous paragraph on URLs they receive as requests. They
can incrementally add more sophisticated capacities (see below).
In general, the choice of when to produce a STANF: URL instead of a
conventional one will be guided by a desire to provide long-term stability
for the pointer, and a willingness to pay the price, in terms of keeping
track of it for the indefinite future. So the incrementalism is also on a
URL by URL basis. For "work in progress" it is likely that it would not
be worth the trouble, but once something is "released" then it would be
important to convert to STANF: URLs. There is a whole culture of what it
means to responsibly provide stable information on the net, which has yet
to evolve.
4. SOME FURTHER DETAILS
1) Virtual hosts
The use of host names as naming authorities makes the simple case simple,
but leads to some problems in cases where a naming authority does not
correspond to a server. For example, nobody may be willing or able to set
up an "ISBN.com" server to send out documents (or even indirect URL
pointers) for everything it indexes. In some cases, naming authorities may
predate the on-line system altogether (e.g., if we were to actually use
ISBN numbers). This can be easily handled by assigning pseudo-domain names
to any naming authority that does not want to use a real host name. These
"virtual host" names would be assigned by whatever mechanisms are used to
assign network host names by the IANA or its successors. For example we
might have "ISBN.vir" which would appear in URLs such as:
<URL:STANF://ISBN.vir/0-201-11297-3>
The .vir would key the client immediately to use one of the indirect access
modes described above since it obviously can't directly access the naming
authority host. For example it might convert it based on local knowledge
of who can provide ISBN pointers to:
<URL:HTTP://library.stanford.edu/NNA/ISBN.vir/0-201-11297-3>
The server at library.stanford.edu might handle this request by doing a
database lookup and returning a URL to some other server on the net that
stores the particular document. In this kind of case, the responsibility
for making files available on the net would be separated from the authority
for naming them (e.g., a consortium of libraries might work together to
provide access to the ISBN-named documents.)
2) Standard indirection
One of the primary reasons for instability in existing file systems is the
need to get rid of things in order to free up space (net news illustrates
this problem spectacularly). In general, keeping around links can be far
less costly than keeping the content. There needs to be a standard way
that a request from a URL returns a further URL rather than the actual
contents <<I believe HTTP supports something like this now but don't know
the details>>. So if we don't have room for all our technical reports then
a request using:
<URL:STANF://pcd.stanford.edu/techreports/19751225WIN.PS>
might return a string containing the URL
<URL:FTP://archives.u-stor-it.com/stanford/pcd/tech/19751225Win.ps>
A convention is needed here <<magic characters at the beginning?>> to
distinguish this kind of marker from real contents. There are several
types of indirection:
1) "Here's a URL to follow." Note that the server can be set up to do
this based on some kind of pattern match on the file name in the request
URL (file system subtree, file type or extension, or something more
sophisticated), without actually storing a stub for each one. But for
simplicity a first implementation could simply store a stub file containing
the indirection string (like the stub now left for internet drafts that
have expired). The returned URL might be another STANF:, in which case the
same information object was given names by two different authorities. More
often it will be an access-based (present-oriented) URL.
2) "Here's a host that might respond to the original URL". This implies
the use of the STANF/NNA convention discussed above. There might be ways
to say "go here in general for things I am the naming authority for" which
could be remembered for future requests, rather than responding with one of
these indirections on each request.
3) "I Don't have it, or know where it is, even though I should". This is a
polite error message indicating that the URL is valid and the host is the
naming authority, but for some reason the host can't provide more
information. This is bound to happen, as hosts are not going to be
administered in accord with the conventions 100% of the time. In this case
it may be fruitful for the client to go searching for a host that has the
file, where in a normal error case one is more likely to assume the URL
didn't correspond to something that could be accessed at all.
----
More possibilities <<to be developed later>>:
a. Given that the hierarchy is being treated specially (administratively)
there is an opportunity to constrain it to follow more consistent
conventions about file typing, variant naming, versions, etc. which could
in turn be used by clients that know about them. As a simple example, we
could effectively restrict file names to a small subset of characters,
limited length, etc. as useful.
b. The protocol could be augmented with a general meta-information protocol
and the special status of the hierarchy would be suited to providing the
right kinds of information (for example, see the paper on IAFA templates).
It is potentially much easier to make the provision of additional
information a standard condition of putting something into the special
STANF hierarchy than trying to get it done for every file accessible by
FTP, Gopher, HTTP, etc.
c. The issue of "sameness" of files has been sidestepped here in the
standard way -- the semantics of uniqueness is defined by the person who
puts files into the STANF hierarchy on a host. I believe the basic
approach of this proposal can be expanded to deal with standard
"invariance" relations between files that differ in some ways (e.g.,
character set, formatting, language,...) but are the same in others. Thus
it would be possible to provide a STANF: URL, plus further information
(e.g., "I want 70dpi resolution, JPEG encoded"). But that's for another
discussion.
--t
--------------------------------------------
Terry Winograd, Professor of Computer Science, Stanford University
1993 address:
Interval Research winograd@interval.com
1801 Page Mill Road 415/354-0854
Palo Alto, CA 94304 Fax: 415/354-0872
Long-term address:
Stanford University winograd@cs.stanford.edu
Computer Science Dept. 415/723-2780
Stanford, CA 95305-2140 Fax: 415/724-7411