Re: Current URN syntax

Roy T. Fielding (fielding@avron.ICS.UCI.EDU)
Thu, 27 Oct 1994 09:42:31 -0700

To: uri@bunyip.com
Subject: Re: Current URN syntax
In-Reply-To: Your message of "Wed, 26 Oct 1994 12:00:46 CDT."
<199410270120.VAA03944@lysithea.lcs.mit.edu>,
Date: Thu, 27 Oct 1994 09:42:31 -0700
From: "Roy T. Fielding" <fielding@avron.ICS.UCI.EDU>
Message-Id: <9410270942.aa27557@paris.ics.uci.edu>

I guess before I start replying to everyone's kind comments, I should
explain what appears to be a misunderstanding. I am not asking for
a rigid syntax for all URIs. I am simply asking for a compatible
syntax -- one that allows the following to be true:

1) Any URI can be extracted from a structured document (one where
delimiters may be explicit and required) without knowing a priori
whether the URI is a URL or a URN (i.e. they share a common data
type and can be delimited in the same way).

2) All URIs start with a common <scheme>: syntax (i.e. the first
part starts with /^[a-z0-9\-\+\.]+:/ -- a perl regexp) which
determines how the rest of the URI can be parsed.

Furthermore, in cases where we are inventing a new URI:

3) If the URI is hierarchical in nature AND there is a choice as
to what character to use to separate components, use "/".

4) If the URI includes parameter info AND there is a choice
as to how the parameters are to be represented, use ";name=value"
and place it after the hierarchical part.

5) If the URI includes trailing query info AND there is a choice as
to how the query info is to be separated from the rest, use "?".

Is this too much to ask? This is a serious question, because the first
two constraints allow URIs to be used within systems like WWW
without changing existing software.

The latter three simply make sense and allow for greater reuse (of both
software and brain cells).

Karen Sollins wrote:

> Once we have decided on a syntax and any other constraints on URNs, if
> there is any chance that the same string might be recognized as a URN
> and a URL, we had better label it with something that indicates which
> it is.

Why? If the same string is both a URL and a URN, how should it be labelled?
Further, should the label be a part of the string or a part of the context
in which the string is obtained? Finally, is the action to be performed
on the string determined by the type of string, by the context in which
the string was obtained, by the action desired by the user, or some
combination of the above?

For example, let's consider the different ways I can refer to your message.
[I will use WWW constructs here only for convenience -- the actual actions
are generic to any information system.]

Assuming I have a convenient URC server containing:

URN: msgid:199410270120.VAA03944@lysithea.lcs.mit.edu
URL: http://www.acl.lanl.gov/URI/archive/uri-94q4.messages/97.html
URL: file://localhost/usr/fielding/Mail/uri/inbox/3

Then supposedly I could do something like this:

<A href="msgid:199410270120.VAA03944@lysithea.lcs.mit.edu">
or
<A href="http://www.acl.lanl.gov/URI/archive/uri-94q4.messages/97.html"
URN="msgid:199410270120.VAA03944@lysithea.lcs.mit.edu">
or
<A URI="(msgid:199410270120.VAA03944@lysithea.lcs.mit.edu|
file://localhost/usr/fielding/Mail/uri/inbox/3|
http://www.acl.lanl.gov/URI/archive/uri-94q4.messages/97.html)">

Note, however, that these imply a particular method (or purpose) -- GET.
What if what I really want to refer to is the URC associated with this URI?
We certainly don't want a different identifier for every purpose. We want
something like

<A method="URC" href="msgid:199410270120.VAA03944@lysithea.lcs.mit.edu">

Note, however, that it is just as valid to want

<A method="URC" href="file://localhost/usr/fielding/Mail/uri/inbox/3">

Similarly, we may not need URCs if we have the message stored "nearby" and
our application is "smart enough". For instance, let's assume we handle all
our mail, news, web, gopher, etc., via some super-browser called GILA.
GILA sees something like
<A href="msgid:199410270120.VAA03944@lysithea.lcs.mit.edu">
get selected and does the following:

1. Search through internal cache for that message, return it if found.
2. Search through all saved messages (news and mail) for that msgid,
return it if found.
3. Send a request for that msgid to the local news server to see if
it is still around.
4. Send a request to the local URC server/search facility to see if
we can find a URC associated with it and, hopefully, at
least one associated URL.

Now this is a fairly complex process and it should be noted that it is
ONLY valid for URNs of type "msgid". The process changes for others, e.g.
<A href="isbn:0-13-949876-1">
could prompt GILA to do

1. Search through internal cache for that ISBN, return it if found.
2. Search through my personal database of "books I own" to see if it
is sitting on my shelf, returning directions on where it can be found.
3. Send a request to the local URC server/search facility to see if
we can find a URC associated with it and if it's available on-line.
4. Send a request to the UCI library's PAC to see if it's available,
perhaps returning a form that would allow automated check-out.
5. Send a request to UC's Melvyl system to see if it can be retrieved
via inter-library loan.
6. Query the local bookstore for availability/pricing info.
7. Send a message to the publisher regarding whether or not it is
still "in print."

Some of the above are fantasies, though I think they are all in line with
what we want URNs to be capable of identifying. The key is that the
resolution process is defined by the Client, not by the URI type, and that
resolution procedures are defined by the scheme name, not by whether the
identifier is a URN or URL.

Compare this to the process of obtaining something identified by a known URL:
<A href="http://www.ics.uci.edu/">
could prompt GILA to do

1. Search through internal cache for that URL, return it if found.
2. Search through local persistent cache and, if found, use its
last-modified date to perform a conditional GET via HTTP.
3. Send "GET http://www.ics.uci.edu/" to regional caching proxy,
looping through 1-3 until getting result or exhausting hierarchy.
4. Send "GET /" to www.ics.uci.edu, port 80.

A strict reading of the URL specification only reveals step 4, because that
specification ignores the fact that URLs are _identifiers_, not just access
methods, and they are quite often utilized as temporary Names.

On a side note:
One question I've ignored above is: what does it mean for a URL
(like the http one above) which contains the content of your message,
but is in fact an enhanced version of it and thus "not equal" to the msgid?
Similarly, how do we refer to a research article which contains a single
content, but which may appear in several bound compilations (each with
their own ISBN numbers). If a search is performed for that article,
our hit list should include those compilations. I think these are issues
for which good solutions may exist in the library community. Is there
a "Contains-URN:" element in URCs?

============================================================================

Daniel LaLiberte <liberte@ncsa.uiuc.edu> wrote:

[a whole bunch of stuff which I agree with] ...

> 2) does not place arbitrary constraints on scalability
>
> Not sure what you are getting at here. Do you mean that it is up
> to the name resolver side to decide whether to delegate subtrees
> to subresolvers?

Yep, that's what I meant. Actually, I'm not sure it can be accomplished
with DNS either, but that's what I would want it to do.

> 3) does not prevent other URN schemes from coexisting
> (i.e. Message-IDs)
>
> Any flat name scheme could fit in with a hierarchical scheme at the
> top level, provided that there is no conflict with the existing top
> level names and with the syntax of the hierarchical naming scheme.
> E.g. dns:isbn/1234 fits right in, although it looks a bit silly. But
> perhaps those provisos are too restrictive.

Right, which is why I wanted to get away from the arbitrary distinctions
between URNs and URLs. For instance,

http://www.ics.uci.edu/
isbn:0-13-949876-1
msgid:9410261700.AA07853@void.ncsa.uiuc.edu
dns:/edu/uci/ics/urn/People/Fielding/Roy/public/vitae

are all distinct naming schemes (i.e. the rules for interpreting
what comes after the FIRST colon depends on the scheme name that
precedes it). Compare the above to

urn:dns:/edu/uci/ics/urn/People/Fielding/Roy/public/vitae

and you will find the presence of "urn:" only obscures things.
It also means we have to decide up-front whether or not "dns:"
represents a true URN, rather than letting that be decided by the client
and/or URC creator. The same problem occurs with "xfn:".

In contrast, there do exist some contexts in which we would like to
discern between things that are URN vs those that are URL. One such
context is within URC templates, which is why the URN: and URL: headers
exist. URN: does not need to be permanently attached to the string.

> ...
>
> Hmm, for scalability, handling of services must be pushed down to
> clients as much as possible (and then the remainder to be handled by
> remote servers must be managable). The more different kinds of
> schemes that clients are required to know about, the larger their
> code grows. I suppose market pressures will tend to keep this within
> reasonable bounds though.

Yes, and so will proxy servers. This answers Larry's question about
why in WWW their exists separate scheme_proxy variables instead of just one
proxy for all schemes. New proxies can be invented as required and their
distribution can reflect the needs of a particular scheme, rather than the
total needs of all schemes.

> ...
> ... Similarly, any given entity is likely to have
> multiple URN names.
>
> This seems like a completely different subject and a big can o' worms.

Yes, but it is a very important can of worms.

> I'll agree that any particular entity might be *referenced* indirectly
> via the URCs of multiple URNs. For example, a URN for the collection
> of all versions of a document might intersect with a URN for the
> collection of all representations of the current version of a
> document. But every URN is still unique and corresponds to a unique
> entity. I like to think of the URC as the object identified by a URN.

Any given entity may be identified by multiple URNs -- this is inevitable
if we are to allow grandfathering of existing name systems.
Consider, for instance, what would happen if some enterprising young
clerk were to assign ISBN numbers to all of the Internet RFCs. Does that
invalidate RFC numbers from being considered URNs? No. How about a book
that is published first on-line and later in hardcopy? Just think of how
many different URN-like identifiers are assigned to you as a person.
Why should IIIA objects be any different?

The important requirement for URNs is that there be a one-to-one
correspondence per URN *scheme*. This needs to be considered when
defining the structure of URCs. It should also be noted that only the
client knows which URN scheme is preferable for any given situation.

We could, however, obtain a one-to-one correspondence between
each URN and each named object, but only if there is a strict partition
of all objects among the naming authorities. Unfortunately, that would
be quite impossible with grandfathering, and unlikely in any case.

> [more good stuff deleted to save space ...]
>
> URNs are essentially URLs for URCs. The fact that URCs are returned
> does make the use of URNs different from that of URLs, unless we unify
> URCs with documents and identify a URC with a document type. e.g. the
> first line before a URC might be "Content-type: urc/asn.1". Maybe
> that is not a bad idea.

That is why I have, in the past, referred to URNs as a subset of URLs --
because this type of redirection is an implementation detail, not a
fundamental difference between the two types of identifiers. What is
important is not that a URN can be used to retrieve URCs, but rather
that URNs are *much better* for retrieving URCs due to their "longer
life" and one-to-one correspondence.

=========================================================================

I have been an active member of this list for six months, and before that
I read through ALL of the messages on the archive (it took three days to
do so). Although these issues may have been discussed in private, or
decided upon at a particular IETF meeting, they have not been discussed
on the list as a whole. Nor, by any stretch of the imagination, has any
consensus been reached on the nature of URNs and the actual URN syntax.
The reason these issues are coming up now is because this group has recently
shifted its focus from URLs to URN/URCs, not because of any influx of
newbees.

Finally, I feel a need to point out to everyone here that the URN Functional
Requirements document has been submitted as an *Informational* RFC.
If, in the course of attempting to implement URNs, one or more of the stated
"requirements" must be relaxed, than those requirements WILL be re-opened
for discussion. Under no circumstances should those requirements be
considered "written in stone" -- they merely reflect the desired
characteristics of what this group currently considers to be URNs.

......Roy Fielding ICS Grad Student, University of California, Irvine USA
<fielding@ics.uci.edu>
<URL:http://www.ics.uci.edu/dir/grad/Software/fielding>