To: uri@bunyip.com
Subject: Re: Another snapshot of the URL document.
In-Reply-To: Your message of "Sat, 02 Jul 1994 00:19:40 PDT."
<94Jul2.001942pdt.2760@golden.parc.xerox.com>
Date: Tue, 05 Jul 1994 04:49:12 -0700
From: "Roy T. Fielding" <fielding@simplon.ICS.UCI.EDU>
Message-Id: <9407050449.aa07241@paris.ics.uci.edu>
What follows are my comments/suggestions for the July 2 version of the
URL spec. I have included minor details such as grammar and spelling
errors since I think the document is close to being complete, even though
it still fails to define a uniform syntax for URLs.
> ========================================================================
> INTERNET DRAFT T. Berners-Lee
> Uniform Resource Locators L. Masinter
> Expires January 6, 1995 M. McCahill
> Editors
> July 2, 1994
>
> Uniform Resource Locators (URL)
>
> A Syntax for the Expression of
> Access Information of Objects on the Network
I'm sorry, but I don't like the title (it makes my tongue hurt ;-)
How about:
A Common Syntax for Identifying Information
Objects by Access Method and Network Location
> Status of this memo
>
> This document is an Internet-Draft. Internet-Drafts are
> working documents of the Internet Engineering Task Force
> (IETF), its areas, and its working groups. Note that other
> groups may also distribute working documents as
> Internet-Drafts.
>
> Internet-Drafts are draft documents valid for a maximum of six
> months. Internet-Drafts may be updated, replaced, or obsoleted
> by other documents at any time. It is not appropriate to use
> Internet-Drafts as reference material or to cite them other
> than as a ``working draft'' or ``work in progress.''
>
> To learn the current status of any Internet-Draft, please check
> the 1id-abstracts.txt listing contained in the Internet-Drafts
> Shadow Directories on ds.internic.net, nic.nordu.net,
> ftp.isi.edu, or munnari.oz.au.
>
> This Internet Draft expires January 5, 1995.
>
> 0. Abstract
>
> This document specifies a Uniform Resource Locator (URL), the
> syntax and semantics of formalized information for location and
> access of resources on the Internet.
>
> 1. Introduction
>
> The work is derived from concepts introduced by the World-Wide Web
> global information initiative, whose use of such objects dates
> from 1990 and is described in "Universal Resource Identifiers in
> WWW", RFC 1630.
>
> This document was written by the URI working group of the Internet
> Engineering Task Force. Comments may be addressed to the editor,
> Tim Berners-Lee <timbl@info.cern.ch>, or to the URI-WG
> <uri@bunyip.com>. Discussions of the group are archived at
> <URL:http://www.acl.lanl.gov/URI/archive/uri-archive.index.html>
>
> 2. Recommendations
>
> This section describes the syntax for "Uniform Resource Locators"
> (URLs): that is, basically physical addresses of objects which are
> retrievable using protocols already deployed on the net. The
> generic syntax provides a framework for new schemes for names to be
> resolved using as yet undefined protocols.
URLs do not represent physical addresses, just as simplon.ics.uci.edu is not
the physical address of the machine I'm typing on. I would rewrite it as:
This section describes the syntax for "Uniform Resource Locators"
(URLs): addresses of information objects which are retrievable
using protocols deployed on the net. The generic syntax provides
a framework for new naming schemes to be resolved using as yet
undefined protocols.
> The syntax is described in two parts. First, we give the syntax
> rules of a completely specified name; second, we give the rules
> under which parts of the name may be omitted in a well-defined
> context.
>
> 2.1. URL SYNTAX
>
> A URL consists of a naming scheme specifier followed by a string
> whose format is a function of the naming scheme. A BNF description
> of the URL syntax is given in section 5. URLs are written as
>
> URL:<scheme>:<scheme-specific-part>
Hmmm.. I've always thought of it as an "access scheme" rather than a
"naming scheme", but I guess they are effectively equivalent. If it is
the latter, then there is no difference whatsoever between the URL and URN
syntax. So, I must therefore ask what is the purpose of this document?
> 2.1.1. URL Label
>
> URLs that appear in other streams of data must start with a
> constant prefix "URL:". This prefix is used to identify the URL and
> distinguish it from other possible protocol elements.
Could someone please define "other streams of data"? In my opinion, this
whole idea of prefixing all URLs with "URL:" is just plain silly. I seriously
disagree with the idea of standardizing something which has neither been
proven by implementation nor even legitimized by reasoned argument.
Are there no champions for this cause? If not, save us all a lot of grief
and delete it from the spec.
> 2.1.2. Scheme
>
> Within the URL of a object, the first element is the name of the
> scheme, separated from the rest of the object by a colon. The rest
> of the URL follows the colon in a format depending on the scheme.
Uh, the first sentence does not parse, the second is redundant, and neither
describe the syntax of scheme (and the later BNF for scheme is incorrect).
How about:
The first element of a URL is the scheme name, followed immediately
by a colon. The scheme name defines the specific rules to be used
in parsing the remainder of the URL and, in most cases, the access
method for obtaining the information object identified by the URL.
Existing scheme names are defined in section 3. Section 4 describes
how new schemes can be defined and registered.
The allowed character set for scheme names includes the lowercase
ASCII letters ("a"--"z"), the ASCII digits ("0"--"9"), and the
characters plus ("+"), hyphen ("-"), and underscore ("_"). Characters
in a scheme name must not be encoded (see section 2.3). There is
no defined limit to the length (number of characters) of scheme
names, though brevity is recommended.
Is there any reason to include other characters in the allowed set?
> 2.1.3. Scheme Specific Part
>
> The syntax for the rest of the URL varies depending on the scheme.
> However, there is a common syntax shared by many schemes that use
> IP-based protocols to specified Internet hosts.
This section cries out for more definition along the lines Larry proposed
in an earlier e-mail message. In particular, I do not think it is valid
to call these "Uniform" Resource Locators if there is no useful uniformity
across the access schemes. At a minimum, the usefulness of "/" indicating
hierarchy and the full appendix describing relative URLs should be
reinstated. THIS DOCUMENT IS INCOMPLETE WITHOUT THEM. A relative URL
is of universal scope when it is placed within the context of an absolute
base URL (i.e. as a tuple, they meet the URI requirements of global scope).
To ignore that fact is simply irresponsible, since it leaves unspecified an
important quality of URLs.
>
> 2.2. Encoding Characters within URL components
>
> URLs are represented as a sequence of characters in a limited
> character set. Characters are used to represent the 8-bit byte
> that corresponds to their ASCII encoding.
>
> That character set consists of the alphanumeric characters and some
> printable characters; the set of allowed characters consists of the
> alphanumerics and most printable ASCII characters, with the
> exception of "#" and "%".
I believe that this should also include the characters commonly used
as delimiters for URLs, i.e. "<", ">", and """ (doublequote). If not,
then those characters cannot be reliably used as delimiters.
> Many URL schemes require a representation of arbitrary 8-bit bytes,
> including those whose ASCII characters are not allowed within their
> syntactic component. There is a standard way, known as
> `URL-encoding', to encode bytes that are otherwise disallowed:
> bytes are encoded by representing them as a percent sign "%"
> followed by two hexadecimal digits (0-9, A-F).
>
> The characters space, "#" and "%", even though they are printable
> ASCII, are not allowed within any URL and must be encoded. The
> definition of a scheme will identify which additional characters
> are reserved within the scheme or component.
delete
xxxxxxxxxxx/
> Any allowed character (even an alphanumeric characters) may
> optionally be encoded within the scheme specific part of a URL.
> Note, however, that encoding a reserved character for a particular
> scheme may change the semantics of a URL; thus, encoding characters
> within a URL cannot be be done by gateway agents or other software
> agents.
That should be "encoding reserved characters within a URL cannot be ..."
it
\/
> To avoid confusion, is strongly recommended that all `unsafe'
> characters be encoded; that is, all characters except the
> alphanumerics, "$", "-", "_", "@", ".", "&", "+".
>
> 3. Specific Schemes
>
> The mapping for some existing standard and experimental protocols
> is outlined in the BNF syntax definition. Notes on particular
> protocols follow. The schemes covered are:
>
> ftp File Transfer protocol
> http Hypertext Transfer Protocol
> gopher The Gopher protocol
> mailto Electronic mail address
> news USENET news
> nntp USENET news using NNTP access
> telnet Reference to interactive sessions
> wais Wide Area Information Servers
>
> Other schemes may be specified by future specifications. Section 4
> of this document describes how new schemes may be registered.
>
> 3.1. Common Internet Scheme Syntax
>
> While the syntax for the rest of the URL may vary depending on the
> particular scheme selected, URL schemes that involve the direct use
> of an IP-based protocol to a specified host on the Internet use a
> common syntax for the initial part of the scheme-specific data:
>
> //<user>:<password>@<host>:<port>
> //<user>:<password>@<host>:<port>/<url-path>
>
> This initial part starts with a double slash "//" to indicate its
> presence, and continues until the following slash "/", if any.
> Within this section are:
>
> user
> An optional user name. Some schemes (e.g., ftp) allow the
> specification of a user name.
>
> password
> An optional password. If present, it follows the user
> name separated from it by a colon.
>
> The user name (and password), if present, are followed by a
> commercial at-sign "@". Within the user and password field,
> any ":", "@", or "/" must be encoded.
>
> host
> The fully qualified domain name of a network host, or its IP
> address as a set of four decimal digits separated by periods.
>
> port
> The (optional) port number to connect to. Most schemes designate
> protocols that have a default port number. Another port number
> may optionally be supplied, in decimal, separated from the
> host by a colon.
>
> url-path
> The rest of the locator consists of data specific to the
> scheme, and is known as the "url-path". It supplies the
> details of how the specified resource can be accessed.
>
> The url-path is interpreted in a manner dependent on the scheme
> being used.
>
> 3.2. FTP
>
> The FTP URL scheme is used to designate files and directories on
> Internet hosts accessible using the FTP protocol (RFC959).
>
> FTP URLs follow the syntax described in section 3.1. The port
> number, if present, gives the port of the FTP server if not the FTP
> default (23).
FTP is ---> 21
> 3.2.1. FTP Name and Password
>
> A user name and password may be supplied. If no user name or
> password is supplied and one is requested by the FTP server, the
> conventions for "anonymous" FTP are to be used, as followed:
>
> The user name "anonymous" is supplied.
>
> The password is supplied as the Internet e-mail address
> of the end user accessing the resource.
>
> If the URL supplies a user name but no password, and the remote
> server requests a password, the program interpreting the FTP URL
> should request one from the user if the remote FTP server requests
.xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx
> a password.
xxxxxxxxxxx = delete above words.
[... skipping stuff I have no comments for ...]
> 4. REGISTRATION OF NAMING SCHEMES
>
> A new naming scheme may be introduced by defining a mapping onto a
> conforming URL syntax, using a new prefix. Experimental prefixes
> may be used by mutual agreement between parties, and must start
> with the characters "x-". The scheme name "URN:" is reserved for
> the work in progress on a scheme for more persistent names.
I disagree with the notion that experimental scheme names should begin
with an "x-". This is a very bad habit -- defining names such that
they can never be truly tested before they are written in stone. Although
this has been the past practice for Internet names, I believe it would
be a serious mistake to continue applying it to names that are not
involved in protocol handshake mechanisms (this comment applies equally
to MIME types, but that bridge has already burned to the ground).
> The Internet Assigned Numbers Authority (IANA) will maintain a
> registry of URL schemes. Any submission of a new URL scheme must
> include a definition of an algorithm for accessing of resources
> within that scheme.
>
demonstrable
------^------
> URL schemes must have demonstratable utility and operability. One
such a s
\/ \/ xxxx \/
> way to provide a demonstration is via gateway which will provide
> objects in the new scheme for clients using an existing protocol.
> If the new scheme does not locate resources that are data objects,
> the properties of names in the new space must be clearly defined.
>
> It is likewise recommended that, where a protocol allows for
> retrieval by URL, that the client software have provision for being
> configured to use specific gateway locators for indirect access
> through new naming schemes.
>
> 5. BNF for specific URL schemes
** Changes are marked with an asterisk on left margin
++ Additions are marked with a plus on left margin
?? Question marks point out questionable syntax based on prior comments
Note that wpath is still ambiguous.
> This is a BNF-like description of the Uniform Resource Locator
> syntax, using the conventions of RFC822, except that "|" is used to
> designate alternatives. Briefly, literals are quoted with "",
> optional elements are enclosed in [brackets], and elements may
> be preceded with <n>* to designate n or more repetitions of the
> following element; n defaults to 0.
>
?? url = "URL:" unlabelled
> unlabelled = httpaddress | ftpaddress | newsaddress |
> nntpaddress | telnetaddress | gopheraddress |
> waisaddress | mailtoaddress | otheraddress
>
> otheraddress = scheme ":" schemepart
** scheme = 1*[ lowalpha | digit | "+" | "-" | "_" ]
> schemepart = *xchar
>
> login = [ user [ ":" password ] "@" ] hostport
> hostport = host [ ":" port ]
> host = hostname | hostnumber
** hostname = alpha *uchar
> hostnumber = digits "." digits "." digits "." digits
> port = digits
** user = 1*[ uchar | ";" | "?" ]
** password = 1*[ uchar | ";" | "?" ]
>
** ftpaddress = "ftp://" login [ path [ ";type=" ftptype ] ]
** path = "/" [ segment *[ "/" segment ] ]
> segment = 1*[ uchar | "?" | ":" | "@" ]
** ftptype = "a" | "i" | "d"
>
** httpaddress = "http://" hostport [ path [ "?" search ] ]
> search = *[ uchar | ";" | ":" | "@" ]
>
> gopheraddress = "gopher://" hostport [ / [ gtype [ selector
> [ "%09" search [ "%09" gopher+_string ] ] ] ] ]
> gtype = xchar
> selector = *xchar
** gopher+_string = 1*xchar
>
> mailtoaddress = "mailto:" encoded822addr
** encoded822addr = 1*xchar
>
> newsaddress = "news:" grouppart
> grouppart = "*" | group | article
> group = alpha *[ alpha | digit | "-" | "." ]
** article = 1*[ uchar | ";" | "/" | "?" | ":" ] "@" host
>
** nntpaddress = "nntp://" hostport [ "/" group [ "/" digits ] ]
>
> telnetaddress = "telnet://" login [ "/" ]
>
> waisaddress = waisindex | waisdoc
> waisindex = "wais://" hostport [ "/" [ database [ "?" search ] ] ]
> waisdoc = "wais://" hostport "/" database "/" wtype "/" wpath
> database = 1*uchar
> wtype = 1*uchar
** wpath = 1*[ digits "=" *xchar ";"]
++ lowalpha = "a" | "b" | "c" | "d" | "e" | "f" | "g" | "h" | "i" |
++ "j" | "k" | "l" | "m" | "n" | "o" | "p" | "q" | "r" |
++ "s" | "t" | "u" | "v" | "w" | "x" | "y" | "z" |
++ hialpha = "A" | "B" | "C" | "D" | "E" | "F" | "G" | "H" | "I" |
++ "J" | "K" | "L" | "M" | "N" | "O" | "P" | "Q" | "R" |
++ "S" | "T" | "U" | "V" | "W" | "X" | "Y" | "Z"
** alpha = lowalpha | hialpha
> digit = "0" | "1" | "2" | "3" | "4" | "5" | "6" | "7" |
> "8" | "9"
> safe = "$" | "-" | "_" | "." | "&" | "+"
** extra = "!" | "*" | "'" | "(" | ")" | "," | "="
> national = "{" | "}" | "|" | "\" | "^" | "~" | "[" | "]"
** punctuation = "<" | ">" | """
> reserved = ";" | "/" | "?" | ":" | "@"
** hex = digit | "A" | "B" | "C" | "D" | "E" | "F"
> escape = "%" hex hex
** unreserved = alpha | digit | safe | extra | national
> uchar = unreserved | escape
> xchar = unreserved | reserved | escape
> digits = 1*digit
>
>
[...]
APPENDIX A: URLs in Plain Text.
>
This is the only case in which URL: should be recommended as a prefix, i.e.
<URL:xxxx> indicates that xxxx is a URL in plain text.
[...]
APPENDIX B: Partial URLs relative to an absolute base URL
That's it (as if that weren't too much already ;-)
....Roy Fielding ICS Grad Student, University of California, Irvine USA
(fielding@ics.uci.edu)
<A HREF="http://www.ics.uci.edu/dir/grad/Software/fielding">About Roy</A>