Re: how to make progress on the URL document

Tim Berners-Lee (timbl@ptpc00.cern.ch)
Wed, 23 Mar 94 18:58:05 +0100

Date: Wed, 23 Mar 94 18:58:05 +0100
From: Tim Berners-Lee <timbl@ptpc00.cern.ch>
Message-Id: <9403231758.AA14263@ptpc00.cern.ch>
To: "Mark P. McCahill" <mpm@boombox.micro.umn.edu>
Subject: Re: how to make progress on the URL document

Mitra and Mark, you ask for diffs. You're not going to like them
because the formatting messes it up quiet a lot but for what it's worth
here it is.

Tim

diff url-spec.txt /pub/www/doc/draft-uri-url-02.txt
2,3c2,3
< draft-ietf-uri-url-03.{ps,txt} URI working Group
< Expires 21 September 1994 21 March 1994

---
> draft-ietf-uri-url-02.{ps,txt}                             CERN
> Expires 1 July 1994                                  1 Jan 1994
8,9c8,9
<                   A Syntax for the Expression of
<              Access Information of Objects on the Network
---
>              A Unifying Syntax for the Expression of
>           Names and Addresses of Objects on the Network
12,23c12
<                          ABOUT THIS DOCUMENT
<                                    

< This document specifies a Uniform Resource Locator (URL), the < syntax and semantics of formalized information for location and < access of resources on the Internet. <

< This document was written by the URI working group of the Internet < Engineering Task Force. Comments may be addressed to the editor, < Tim Berners-Lee <timbl@info.cern.ch>, or to the URI-WG < <uri@bunyip.com>. Discussions of the group are archived at

<

< <http://www.acl.lanl.gov/URI/archive/uri-archive.index.html>

---
> Status of this memo
25,41d13
<    This document is bound by the Requirements Specification in
<    preparation.
<    

< The work is derived from concepts introduced by the World-Wide Web < global information initiative, whose use of such objects dates < from 1990 and is described in "Universal Resource identifeirs for < the World-Wide Web", RFCXXX. <

< This document is available in hypertext form, with links to < background information, as:

<

< <http://info.cern.ch/hypertext/WWW/Addressing/URL/Overview.html> <

< . <

< STATUS OF THIS MEMO <

53c25,29 < Distribution of this document is unlimited.

---
>    Distribution of this document is unlimited.  Please send comments
>    to the author as timbl@info.cern.ch. or to the discussion list 

> ietf-url@merit.edu.

>

> Abstract

54a31,53 > Many protocols and systems for document search and retrieval are > currently in use, and many more protocols or refinements of > existing protocols are to be expected in a field whose expansion is > explosive.

>

> These systems are aiming to achieve global search and readership of > documents across differing computing platforms, and despite a > plethora of protocols and data formats. As protocols evolve, > gateways can allow global access to remain possible. As data > formats evolve, format conversion programs can preserve global > access. There is one area, however, in which it is impractical to > make conversions, and that is in the names and addresses used to > identify objects. This is because names and addresses of objects > are passed on in so many ways, from the backs of envelopes to > hypertext objects, and may have a long life. >

> A common feature of almost all the data models of past and proposed > systems is something whicch can be mapped onto a concept of "object" > and some kind of name, address, or identifier for that object. One > can therefore define a set of name spaces in which these objects > can be said to exist. >

> Practical systems need to access and mix objects which are part of 56a56 >

58a59,467 > different existing and proposed systems.

>

> This paper discusses the requirements on a universal syntax which > can be used to encapsulate a name in any registered name space.

> This will allow names in different spaces to be treated in a common > way, even though names in different spaces have differing > characteristics, as do the objects to which they refer >

> The universal syntax to objects available using existing protocols, > and may be extended with technology. It makes a recommendation for > a generic syntax, and for specific forms for "Uniform Resource > Locators" (URLs)of objects accessible using existing Internet > protocols. >

> The syntax has been in widespread use by World-Wide Web software > since 1990. >

> Terms >

> The objects on the network which are to be named and addressed > include typically objects which can be retrieved, and objects which > can be searched. There is a great variety of other objects which > may support other operations. We imply nothing about the contents > of objects in this document. Whereas human-readable documents are > currently the center of interest of the field, we envisage all > aspects discussed in this paper applying to generalized objects > when systems to handle them become available. The "object" is the > unit of reference and need not correspond to any unit of storage. > We refer to objects which can be searched as "indexes". We > emphasize that this is the abstract view of the client, and these > objects need not correspond to physical files on computers. We > refer to the person who does the retrieval or searchiing as the > user.

>

> Within this document, we use the terms "name" very generally for a > string of characters describing an object, whatever its > combination of properties mentioned below. (The term usually has a > narrower meaning but we needed some term for the universal set.).

> This uniform syntax applied to a generic name is known as a Uniform > Resource Identifier (URI). The term "address" is reserved for an > string which specifies a more or less physical location. The term > "locator" refers to a URL as here defined. URIs which have a > greater persistence than URLs are referred to as URNs. >

> Characteristics

>

> This section characteristics of various naming schemes, > requirements which some ofexisting schemes meet, and requirements > for the URL scheme itself. URLs, as an introduction of and > background for the Recommendations section.

>

> USES OF NAMES AND ADDRESSES

>

>

>

>

> Berners-Lee 2 >

> A name allows a user, with the help of a "client" program, to > retrieve or operate on objects via a "server" program. A name may > be passed for example:

>

> In communication of any form between two people, to refer to a > document, or part of a document;

>

> As part of the description of a link associated with a hypertext > document;

>

> As part of the result of searching an index.

>

> Some typical requirements on a name which are met to a varying > degree by various schemes are for example that the name is

>

> Persistent A given name will remain valid as long as it > is needed;

>

> Extensible A given naming syntax will remain valid > through the introduction of new protocols and > directory technologies;

>

> Resolvable A name will contain enough information to > allow the document or index to which it > refers to be accessed, perhaps via resolution > into an intermediate, more physical, name.

>

> Unique Each object can only have one such name.

> The fact that two such names are different > implies that the objects to which they refer > are different (in some way).

>

> Unambiguous The fact that two names are identical > implies that the objects named are the same > (in some way).

>

> The syntax discussed is the syntax of one name, be it a lasting > name or a physical address. When a directory server or hypertext > link contains a set of alternative names, then that is beyond the > scope of this syntax. Similarly, a syntax for describing a > compound object is outside the scope of this syntax. The specific > locator name spaces (defined under the umbrella of the general > syntax) each meet the requirements above to a greater or lesser > extent.

>

> CURRENT PRACTICE

>

> Current protocols use many different standards for names. For some > protocols, such as ISO-10163 Search and Retrieve protocol[16], the > names returned in a search are only valid during the session. For > others, such as FTP[9], they are lasting names which may be used > for object retrieval at a later time. Typically, however, they are > not long-lasting names which are independent of the location of the >

>

>

> Berners-Lee 3 >

> object. Such names may be provided using directory servers such as > x.500. They will refer to the registration, however formal or > informal, of a object with a particular organisation or person.

> Both hypertext and manual references rely on long- lasting names.

> Current names are basically location specifiers (addresses). These > may be known as Uniform Resource Locators (URLs). They give the > necessary parts of an address for a reader to access an information > provider using the given protocol, and ask for the object required. > Examples of names used by various protocols include

>

> File Transfer Protocol (Postel 1985): >

> Host name or IP-address

>

> [TCP port]

>

> [user name, password]

>

> Filename

>

> W.A.I.S. (Kahle 1990) >

> Host name or IP-address

>

> [TCP port]

>

> local document id

>

> Gopher (Alberti 1991) >

> Host name or IP-address

>

> [TCP port]

>

> database name

>

> selector string

>

> HTTP (Berners-Lee 1991) >

> Host name or IP-address

>

> [TCP port]

>

> local object id

>

> NNTP (Kantor 1986) >

> NNTP group >

> Group name

>

> NNTP article >

>

>

> Berners-Lee 4 >

> Host name

>

> unique message identifier

>

> Prospero links (Neuman 1992) >

> Host name or IP address

>

> [UDP port]

>

> Host specific object name

>

> [version]

>

> [identifier]*

>

> x.500 distinguished name >

> Country

>

> Organisation

>

> Organisational unit

>

> Person

>

> Local object identifier

>

> Other systems with their own naming schemes include BITNET > "LISTSERV" application, FTAM file retrieval, SQLnetTM remote > database search, proprietary distributed file systems, etc. > Conventional syntax for writing these addresses involve various > forms of punctuation to separate these parts. This sometimes, but > not always, allows the naming scheme to be deduced from the > punctuation. For example, a name of the form > xxx.yyy.zz.edu:/pub.aa.bb.cc often implies anonymous FTP access. > However, there is no well-defined algorithm for parsing an > arbitrary name, as there is no common syntax.

>

> EXPANDABILITY

>

> There will necessarily be a phase during which lasting names will > become more common, as the deployment of directory services > increases to the point where every user has direct or indirect > access to one. Even then, however, one can envisage more than one > competing directory system, and cases in which physical names are > still required. A directory service takes a lasting name and > reduces it to a physical address (or set of addresses) which, > though less useful for lasting reference, is the only way to > actually retrieve the object. An addressing syntax is required > which will be able to encompass existing physical address spaces, > and be extendible to any future protocols. This requires that it > contain an identifier for the protocol in use. The format of the >

>

>

> Berners-Lee 5 >

> rest of the address will necessarily depend to a certain extent on > the protocol.

>

> RELEVANCE

>

> The life of a name is limited by any information contained within > it which may become prematurely invalid. It is therefore necessary > to limit the contents of a name to the information required for the > operations above. Other extraneous information about the object > (its size, data format, authorisation details, etc.) may in general > change with time and should not be part of the name. One might > expect such information to be part of the "header" of a object, and > for protocols to allow the header information to be retrieved > independently of the objects themselves. Any physical address may > be subject to change with time: hence we encourage the move to > lasting names and directory services.

>

> UNIQUENESS

>

> Clearly one requires unambiguous names in the sense that one name > should refer to only one logical object. This is the case with all > the addressing schemes in use, whether they are directory systems > or physical addresses. (The internet addresses all rely on the > domain name (Mockapetris 1987) of the host to achieve this). > However, given that names can be translated, many apparently > different names may lead to the same object. Any object may > therefore be referred to by many names. One needs to be able to > know whether two objects, retrieved through different paths, are > in fact the same object. It is suggested that each object have a > unique "official" name. This name could be stored in the object in > some representations, or stored in a database accessible to the > server, for example. Any references within that object should be > parsed in the context of the official name. In the presence of a

> directory service, the official name will normally be the > registered name of the object. However, a name in any scheme will > do, so long as it is completely specified. On systems which do not > allow the name to be stored (such as anonymous FTP archive sites), > a possible ambiguity will always exist as to whether two similarly > named objects are in fact the same. Note that Internet newsgroup > names are unique world-wide, and news articles carry a unique > message id. In most other cases, however, there is no guarantee > that dereferencing a URL will work, or that if it does the object > it refers to will in fact be the object intended. URLs such as FTP > addresses are transient in that files may be moved and even > replaced by different files of the same name. This disorganisation > may be limited by good server management, but a naming scheme which > is independent also of internet host name is obviously preferable.

>

> READABILITY BY PEOPLE

>

> This requirement has been put forward by several people (Clifford > Lynch, Douglas Engelbart among others), and disputed by others.

> The author's view is that it will be a while before technology and >

>

>

> Berners-Lee 6 >

> standardisation have reached the point at which names and addresses > will be hidden from human beings. As long as they must be written > on the backs of envelopes and "cut and pasted" between workstation > windows, there is a strong need for names to be

>

> Short

>

> Composed of printable (preferably non-white) characters

>

> To a certain extent, understadable by a human being.

>

> STRUCTURE OF NAMES AND ADDRESSES >

> A physical address is required in order for:

>

> The user's program to contact the server;

>

> The server to perform the operation (e.g. search and index, > retrieve a object, or look up the name) and return a result;

>

> The user's program to locate an individual position or element > within a returned object.

>

> This suggests that a name be structured, such that the parts > necessary for these three operations be separate and only used by > those system elements which need those parts. This corresponds to > the basic principle of information hiding. In fact, four parts > are necessary, including the indicator of the naming scheme to be > used:

>

> The naming scheme: a registered identifier for the protocol.

>

> The name of a suitable server. The format of this part must be > well defined. It will depend on the lower-layer protocols in > use. Systems which use widely distributed information, such as > x.500 and NNTP, do not need this part as each client generally > contacts his nearest server (or a particular server).

>

> Information to be passed to the server. This may be private to > the server, as all names may be generated and used by the same > server. This part of the name should be opaque to the client.

>

> Information to be used by the application once the object has > been retrieved. This part is private to the application (or, > more strictly, the data format) and so cannot be defined here.

>

> Both lasting names and physical addresses often share a > hierarchical structure. This follows often from the organisation of > the system. From the naming point of view, it has the advantage > that a reference in one object to another object need not include > that part of the structure which is common to both names.

>

> CHOICES FOR A UNIVERSAL SYNTAX

>

>

>

> Berners-Lee 7 >

> The requirements above leave little room for choice save for the > order and punctuation of the elements of an address. It is only > reasonable for the order of writing of the parts to be consistently > from left to right (or right to left) with increasing specificity.

> Punctuation schemes fall into two categories (Huitema 1991): tagged > schemes in which field are given names, and fields which use > special characters and field order. The latter tend to be more > compact schemes.

>

>

> protocol: aftp host: xxx.yyy.edu path:

>

> /pub/doc/README >

> PR=aftp; H=xx.yy.edu; PA=/pub/doc/README; >

> PR:aftp/xx.yy.edu/pub/doc/README >

> /aftp/xx.yy.edu/pub/doc/README >

> Fig 1. Some alternative tagged and untagged representations

>

> The choice of special symbols for punctuation tends to be a matter > of taste. It is easier to read addresses whose symbols correspond > to those of one's favourite operating system. A variety of symbols > is needed so that when a name is abbreviated it is possible to tell > which parts have been omitted.

>

> The recommendation below uses special characters in order to > achieve a compact name, and uses where possible punctuation symbols > established in the internet or unix community. >

> The choice of escape character for introducing representations of > non-allowed characters also tends to be a matter of taste. An ANSI > standard exists in the C language, using the back-slash character > "\". The use of this character on unix command lines, however, can > be a problem as it is interpreted by many shell programs, and would > have itself to be escaped.

>

> There is a conflict between the need to be able to represent many > characters including spaces within a URL directly, and the need to > be able to use a URL in environments which have limited character > sets or in which certain characters are prone to corruption. This > conflict has been resolved by use of an hexadecimal escaping method > which may be applied to any characters forbidden in a given > context. When URLs are moved between contexts, the set of > characters escaped may be enlarged or reduced unambiguously. >

> The use of multiple white space characters is discouraged in URLs > to be printed or sent by electronic mail. This is because of the > frequent introduction of extraneous white space when lines are > wrapped by systems such as mail, or sheer necessity of narrow > column width, and because of the inter-conversion of various forms >

>

>

> Berners-Lee 8 >

> of white space which occurs during character code conversion and > the transfer of text between applications. >

72c481 < URL SYNTAX

---
>   FULL FORM  

82,90c491,492 < PrePrefix <

< To be a Uniform Resource Locator as currently defined by the URI < working group, the whole string must start with a constant prefix < "URL:". Note that to save space in this document, URLs have been < quoted throughout without this preprefix.

<

< Scheme

<

---
>   SCHEME  

>

97,99c499,501 < Those schemes which refer to internet protocols mostly have a < common syntax for the rest of the object name. This starts with a < double slash "//" to indicate its presence, and continues until the

---
>    Those schemes which refer to internet protocols have a common
>    syntax for the rest of the object name. This starts with a double
>    slash "//" to indicate its presence, and continues until the
112,116d513
< 

<

<

< Berners-Lee 2 <

121c518,522 <

---
> 

>

>

> Berners-Lee 9 >

156c557 < the syntax shall not be used unencoded in a URL.

---
>    the syntax shall not be used in a URL. 

162,167c563,566 < awkward in a given environment. Because a % sign always indicates < an encoded character, a URL may be made safer simply by encoding < any characters considered unsafe, while leaving already encoded < characters still encoded. Similarly, in cases where a larger set < of characters is acceptable, % signs can be selectively and < reversibly expanded.

---
>    awkward in a given environment.  As a % sign always indicates an
>    encoded character, a URL may be made safer simply by encoding any
>    characters considered unsafe, while leaving already encoded
>    characters still encoded.  

170,174d568 <

<

<

< Berners-Lee 3 <

176c570 < hexadecimal or base 64 would be more appropriate.)

---
>    hex or base 64 would be more appropriate.)  

177a572,574 > The same considerations apply to mapping local fragment identifiers > onto the fragmentid part of a URL. >

179a577,580 >

>

> Berners-Lee 10 >

182c583 < protocols follow. The schemes covered are

---
>    protocols follow. 

184,208c585,593 < http Hypertext Transfer Protocol

<

< ftp File Transfer protocol

<

< gopher The Gopher protocol

<

< mailto Electronic mail address

<

< mid Message identifiers for electroni mail

<

< cid Content identifiers for MIME body part

<

< news Usenet news

<

< nntp Usenet news for local NNTP access only

<

< prospero Access using the prospero protocols

<

< telnet , rlogin and tn3270

< Reference to interactive sessions

<

< wais Wide Area Information Servers

<

< The schemes for x.500, network management database and whois++ have < not been specified and may be the subject of futher study.

---
>   HTTP  

>

> The HTTP protocol specifies that the path is handled transparently > by those who handle URLs, except for the servers which de-reference > them. The path is passed by the client to the server with any > request, but is not otherwise understood by the client. The > fragmentid part is not sent with the request. The search part, if > present, is sent. Spaces in URLs should be escaped for transmission > in HTTP.

210,214d594 < The url: prefix is reserved for use in encoding a Uniform Resource < Name when that has been developed by the IETF working group. <

< New schemes may be registered at a later time. <

218,223c598,603 < file system of the given host. The FTP protocol is used, as defined < in RFC957 or any successor. The port number, if present, gives the < port of the FTP server if not the FTP default. (A client may in < practice use local file access to retrieve objects which are < available though more efficient means such as local file open or < NFS mounting, where this is available and equivalent).

---
>    file system of the given host. The FTP protocol is used. The port
>    number if given gives the port of the FTP server if not the FTP
>    default. (A client may in practice use local file access to
>    retrieve objects which are available though more efficient means
>    such as local file open or NFS mounting, where this is available
>    and equivalent). 

225,232c605 < User name and password <

< The syntax allows for the inclusion of a user name and even a <

<

<

< Berners-Lee 4 <

---
>     The syntax allows for the inclusion of a user name and even a
236,237c609
<    is "anonymous" and the password the user's Internet-style mail
<    address .
---
>    is "anonymous" and the password the user's mail address. 

239,242c611,620 < Where possible, this mail address should correspond to a usable < mail address for the user, and preferably give a DNS host name < which resolves to the IP address of the client. Note that servers < currently vary in their treatment of the anonymous password.

---
>    The adoption of a unix-style syntax involves the conversion into
>    non-unix local forms by either the client or server. Some non-unix
>    servers do this, but clients wishing to access sites which do not
>    have unix-style naming will need certain algorithms to enable 

> other file systems to be identified and treated. Client software > may also have to be flexible in terms of the sequence of FTP > commands used with different varieties of server. In view of a > tendency for file systems to look increasingly similar, it was felt > that the URL convention should not be weighed down by extra > mechanisms for identifying these cases.

244,296d621 < Path <

< The FTP protocol allows for a sequence of CWD commands (change < working directory) prior to a RETR (retrieve) which actually < accesses a file. The arguments of any CWD commands are successive < segment parts of the URL, and the filename argument to the RETR < command is the final segment of the URL path.

<

< Note <

< In the case in which the file system of the server is known or < guessed by the client, the path may possibly converted into a < filename. This may (in some cases) allow the file to be retrieved < in one RETR command with no CWD command. In the case of unix, the < filename will in fact look the same as the URI path. This must NOT < be taken to indicate that the URL is a unix filename. In < practice, as many FTP servers in fact have or emulate unix file < systems, it may in fact be time-efficient to attempt first a direct < retrieval guessing unix syntax, and, if that fails, to attempt the < official sequence of succession of directory changes followed by a < RETR command. <

< There is no common hierarchical model to the FTP protocol, so if a < directory change command has been given, it is impossible in < general to deduce what sequence should be given to navigate to < another directory for a second retrieval, if the paths are < different. The only reliable algorithm is to disconnect and < reestablish the control connection. However, if no directory < changes have been made, but direct retrieval has been done, then < the control connection may be kept. Another possible < uninvestigated method is to use CDUP on the trial assumption of a < hierarchical structure to return a point in common between the < first and second URLs. <

< (This note previously read: "The adoption of a unix-style syntax < involves the conversion into non-unix local forms by either the < client or server. Some non-unix servers do this, but clients < wishing to access sites which do not have unix-style naming will < need certain algorithms to enable other file systems to be < identified and treated. Client software may also have to be < flexible in terms of the sequence of FTP commands used with < different varieties of server. In view of a tendency for file <

<

<

< Berners-Lee 5 <

< systems to look increasingly similar, it was felt that the URL < convention should not be weighed down by extra mechanisms for < identifying these cases." )

<

< Data type <

303c628 < but it is outside the scope of this paper.

---
>    but it outside the scope of this paper. 

305,328c630 < An FTP URL may specify the method by which an object is to be < retrieved. Two of the modes correspond to the FTP "Data Types" < ASCII and IMAGE for the retrieval of a document, as specified in < FTP by the TYPE command. One mode indicates directory access. <

< The data type is specified by a suffix to the URL separated by an < unencoded exclamation mark (ASCII 21 hex). Possible suffixes are:

<

< !I Use FTP image (I) mode to perform data < transfer.

<

< !A Use FTP ASCII (A) mode to perform data < transfer

<

< !D Use FTP directory list commands to read < directory

<

< [suggestion: tenex. reference?]

<

< Transfer Mode <

< Stream Mode is always used. <

< HTTP

---
>   NEWS  

330,343c632,633 < The HTTP protocol specifies that the path is handled transparently < by those who handle URLs, except for the servers which de-reference < them. The path is passed by the client to the server with any < request, but is not otherwise understood by the client. The < fragmentid part is not sent with the request. The search part, if < present, is sent. Spaces and control characters in URLs must be < escaped for transmission in HTTP. <

< GOPHER <

< Gopher selector strings may contain any characters other than tab, < return, or linefeed, so it is important to encode all disallowed < characters and encode any space characters so these characters are < not altered during transport of the URL. Note that since gopher

---
>    The news locators refer to either news group names or article
>    message identifiers which must conform to the rules of RFC 850.  A
347c637
< Berners-Lee                                                          6
---
> Berners-Lee                                                          11
349,357c639,642
<    selector string are opaque and in many cases map to  native file
<    system of the gopher server, so encoding of disallowed characters 

< in the selector string is to map to binary codes rather than ISO < character sets. In other words, the "%" character followed by two < hexadecimal digits is used to encode binary data. Clients shall < not interpret gopher selector strings. While many Gopher servers < map to Unix file systems, you cannot assume that "/" characters < imply a heirarchy since Gopher servers on non-Unix file systems may < use the "/" as part of a file name.

---
>    message identifier may be distinguished from a news group name by
>    the presence of the commercial at "@" character. These rules imply
>    that within an article, a reference to a news group or to another
>    article will be a valid URL (in the partial form). 

359,361c644,645 <

<

< The format of a gopher URL is:

---
>    A news URL may be dereferenced using NNTP or using any other
>    protocol for the conveyance of usenet news articles. 

363,508c647 < 1. A single-character field to denote the Gopher type of the < resource to which the URL refers.

<

< 2. The gopher selector string. Note that some gopher selector < strings begin with a copy of the gopher type character, in which < case that character will occur twice consecutively. Also note < that the gopher selector string may be an empty string since < this is how gopher clients refer to the top-level directory on < a gopher server.

<

< 3. An encoded tab character (%09) to seperate the gopher < selector string from the optional search string (see 4 below).

<

< 4. If the URL does not refer to a Gopher+ item and if there is < no gopher search string then parts 3, 4, 5, and 6 of the URL < are optional

<

< 4.) The gopher search string. If the URL refers to a search to < be submitted to a gopher search engine, the search string is < required. Otherwise this is an empty string.

<

< 5.) A question mark [suggestion: an encoded tab character < (%09)] to seperate the gopher search string from the optional < gopher+ string (see 6 below). [suggestion: Note that if the URL < refers to a gopher+ item and does not have a gopher search < string, there will be two encoded tab characters in a row.]

<

< 6.) The Gopher+ string. Gopher+ strings consist of a one or more < characters and are used to represent information required for < retrieval of the Gopher+ item. Gopher+ items may have alternate < views, arbitrary sets of attributes, and may have electronic < forms associated with them. To accomodate the various Gopher+ < objects, the Gopher+ string in the URL must accomodate a < mapping of the information a Gopher+ client sends to the server. < This makes this section a bit long since we basically cover the < entire Gopher+ protocol here.

<

< When a Gopher server returns a directory listing to a client, < Gopher+ items are tagged with either a "+" (denoting gopher+ items) <

<

<

< Berners-Lee 7 <

< or a "?" (denoting items which have a +ASK form associated with < them). A Gopher+ string which is only a "+" refers to the default < view (data representation) of the item. To retrieve this item a < gopher+ client should send

<

< a_gopher_selector<tab>+<cr><lf> <

< to the gopher+ server. <

< Note that items which have a +ASK asssociated with them (ie. < Gopher+ items tagged with a "?") require the client to fetch the < item's +ASK attribute to get the form definition, and then ask the < user to fill out the form and return the user's responces along < with the selector string to retrieve the item. Gopher+ clients < know how to do this but depend on the "?" tag in the gopher+ item < description to know when to handle this case. The "?" is used in < the Gopher+ string to be consistent with Gopher+ protocol's use of < this symbol. <

< To refer to the Gopher+ attributes of an item, the Gopher+ string < might consist of "!" or "$". "!" refers to the all of a gopher+ < item's attributes. "$" refers to all the item attributes for all < items in a Gopher directory. To retrieve an item or directory's < attributes, a gopher client will send:

<

< a_gopher_selector<tab>!<cr><lf> <

< for items or

<

< a_gopher_selector<tab>$<cr><lf> <

< for directories to the gopher+ server. <

< To refer to specific attributes, the Gopher+ string is < "!attribute_name" or "$attribute_name". For example, to refer to < the attribute containing the abstract of an item, the Gopher+ < string would be "!+ABSTRACT". To refer to several attributes, < clients send the server the attribute names seperated by spaces so < it is neccesary to seperate the attribute names with coded spaces. < To retrieve a collection of item attributes specified with a < gopher+ string of "!+ABSTRACT%20+SMELL" a gopher client would send

<

< a_gopher_selector<tab>!+ABSTRACT +SMELL<cr><lf> <

< to the gopher server. <

< Gopher+ allows for optional alternate data representations < (alternate views) of items. To retrieve a Gopher+ alternate view, < the gopher+ client sends the appropriate view and language < identifier (found in the item's +VIEW attribute). To refer to a < specific Gopher+ alternate view, the URL's Gopher+ string would be < in the form "+view_name%20language_name". For example, a gopher+ < string of "+application/postscript%20Es_ES" refers to the spanish <

<

<

< Berners-Lee 8 <

< language postscript alternate view of a gopher+ item. To retrieve < this alternate view the client would send

<

< a_gopher_selector<tab>+application/postscript Es_ES<cr><lf> <

< to the gopher server. <

< The gopher+ string for a URL that refers to an item referenced by < an ASK form filled out with specific values is essentially a coded < version of what the client sends to the server. The gopher+ string < will be of the form

<

< +%091%0D%0A+-1%0D%0Aask_item1_value%0D%0Aask_item2_value%0D%0A.%0D%0 < A

<

< To retrieve this item, the gopher client sends:

<

< a_gopher_selector<tab>+<tab>1<cr><lf> < +-1<cr><lf> < ask_item1_value<cr><lf> < ask_item2_value<cr><lf> < .<cr><lf> <

< to the gopher server. <

< For a really complex example, consider a URL that refers to an < alternate view of an item that is referenced with a filled-out < Gopher +ASK form. The gopher+ string will be of the form:

<

<

< +view_name%20language_name%091%0D%0A+-1%0D%0Aask_item1_value%0D%0A < ask_item2_value%0D%0A.%0D%0A

<

< To retrieve this item, the gopher client sends:

<

< a_gopher_selector<tab>+view_name language_name<tab>1<cr><lf> < +-1<cr><lf> < ask_item1_value<cr><lf> < ask_item2_value<cr><lf> < .<cr><lf> <

< to the gopher server.

<

< Summary: gopher+ string part of Gopher URL

---
>     Note1: 

510,621c649 <

<

< To refer to an item which has an ASK form associated with it where < the intent is to allow the user to enter values into the form as < part of the retrieval process:

<

< %3F [was: ?]

<

<

<

<

< Berners-Lee 9 <

< To refer to all or specific attributes of a gopher item:

<

< ![attribute_name][%20attribute_name][%20attribute_name]... <

<

< To refer to all or specific attributes of a gopher directory:

<

< $[attribute_name][%20attribute_name][%20attribute_name]... <

<

< To refer to the content of a gopher+ item (including an item < referred to by specific values in a filled-out ASK form):

<

< +[view_name[%20language_name]] < [%091%0D%0A+-1%0D%0Aask_item1_value%0D%0Aask_item2_value...%0D%0A. < %0D%0A] <

<

<

< Overall summary and examples <

<

< The general format of a Gopher URL path refering to a gopher type < "T" item is:

<

< gopher://host [port]/T[gopher_selector]%09[search_string]?[gopher+_s < tring] <

<

< Examples: <

< An example of a URL pointing to a gopher type 0 item (a document) < is:

<

< gopher://host [port]/0a_gopher_selector <

<

< An example of a URL pointing to a gopher type 7 item (a search < engine) where the string foobar is to be submitted to the search < engine is:

<

< gopher://host [port]/7a_gopher_selector%09foobar <

<

< An example of a URL pointing to a Gopher+ type 0 item (a document) < is:

<

< gopher://host [port]/0a_gopher_selector%09%09some_gplus_stuff <

<

< An example of a URL pointing to a Gopher+ type 0 (document) item's < attribute information is:

<

<

<

<

< Berners-Lee 10 <

< gopher://host [port]/0a_gopher_selector%09%09! <

<

< An example of a URL pointing to a Gopher+ document's spanish < postscript representation is:

<

< gopher://host [port]/0a_gopher_selector%09%09+application/postscript < %20Es_ES <

< . <

< MAILTO <

< This allows a URL to specify an RFC822 addr-spec mail address.

< Note that use of % , for example as used in forming a gatewayed < mail address, requires conversion to %25 in a URL. <

< This semantics may be considered to be that the object referred to < by the mailto: URL is the set of messages sent to or from that < address. There is no algorithm to retrieve this set, but the SMTP < protocol allows messages to be added to it, and any given user may < be aware of a subset of its members. <

< NEWS <

< The news locators refer to either news group names or article < message identifiers which must conform to the rules for a < Message-Idof RFC 1036 (Horton 1987). A message identifier may be < distinguished from a news group name by the presence of the < commercial at "@" character. These rules imply that within an < article, a reference to a news group or to another article will be < a valid URL (in the partial form).

<

< A news URL may be dereferenced using NNTP (RFC977, Kantor 86) (The < ARTICLE by message-id command ) or using any other protocol for the < conveyance of usenet news articles, or by reference to a body of < news articles already received.

<

< Note1:

<

< Among URLs the "news" URLs are anomalous in that they are

---
>    Among URLs the news: URLs are anomalous in that they are
629,630c657,658
<       Note 2:
<       

---
>     Note 2:
>     

634,638d661 <

<

<

< Berners-Lee 11 <

641,643c664,666 < Suggested subject of study in conjunction with NNTP working group.

< Further extension possible may be to allow the naming of subject < threads as addressable objects.

---
>    Suggested subject of study in conjunction with NNTP WG.  Further
>    extension possible may be to allow the naming of subject threads as
>    addressable objects. 

645,646c668,669 < NNTP <

---
>   NNTP
>   

650,651c673 < message identifier. In all other cases the "news" scheme should be < used.

---
>    message identifier.
655d676
<    The NNTP protocol must be used. 

657,661c678,684 < Note1. <

< This form of URL is not of global accessability, as typically NNTP < servers only allow access from local clients. Note that the < article numbers within groups vary from server to server.

---
>     Note1.
>     

> This form of URL is not of global accessiablity, as typically NNTP > servers only allow access from local clients. This form or URL > should not be quoted outside this local area. It should not be > used within news articles for wider circulation than the one > server.

663,668c686,699 < This form or URL should not be quoted outside this local area. It < should not be used within news articles for wider circulation than < the one server. This is a local identifier for a resource which is < often available globally, and so is not recommended except in the < case in which incomplete NNTP implementations on the local server < force its adoption.

---
>   WAIS  

>

> The current WAIS implementation public domain requires that a > client know the "type" of a object prior to retrieval. This value > is returned along with the internal object identifier in the search > response. It has been encoded into the path part of the URL in >

>

>

> Berners-Lee 12 >

> order to make the URL sufficient for the retrieval of the object. > Within the WAIS world, names do not of course not need to be > prefixed by "wais:" (by the partial form rules).

679c710 < version number. If present, the version number is separated from

---
>    version number. If present, the version number is seperated from
681c712
<    zero zero), this being an escaped string terminator (null).
---
>    zero zero), this being an escaped string terminator (null). 

683c714 < access method and are not represented as Prospero URLs.

---
>    access method and are not represented as Prospero URLs. 

684a716,740 > GOPHER

>

> The first character of the URL path part (after the initial single > slash) is a single-character "type" field which is that used by the > Gopher protocol. The rest of the path is the "selector string", > with disallowed characters encoded. Note that some selector strings > begin with a copy of the gopher type character, in which case that > character will occur twice consecutively in the URL. If the type > character and selector are omitted, the type defaults to "1". > Gopher links which refer to non-Gopher protocols are represented > directly as URLs of the underlying access method and are not > represented as Gopher URLs.

>

> MAILTO >

> This allows a URL to specify an RFC822 addr-spec mail address.

> Note that use of % , for example as used in forming a gatewayed > mail address, requires conversion to %25 in a URL. >

> This semantics may be considered to be that the object referred to > by the mailto: URL is the set of messages sent to or from that > address. There is no algorithm to retrieve this set, but the SMTP > protocol allows messages to be added to it, and any given user may > be aware of a subset of its members.

>

691a748,749 > this is a less desirable, though currently common, solution.

>

695c753 < Berners-Lee 12

---
> Berners-Lee                                                          13
697c755,762
<    this is a less desirable, though currently common, solution.
---
>   X500  

>

> The mapping of x500 names onto URLs is not defined here. A decision > is required as to whether "distinguished names" or "user friendly > names" (ufn), or both, should be allowed. If any punctuation > conversions are needed from the adopted x500 representation (such > as the use of slashes between parts of a ufn) they must be defined. > This is a subject for study.

699c764 < WAIS

---
>   WHOIS  

701,707c766,770 < The current WAIS implementation public domain requires that a < client know the "type" of a object prior to retrieval. This value < is returned along with the internal object identifier in the search < response. It has been encoded into the path part of the URL in < order to make the URL sufficient for the retrieval of the object. < Within the WAIS world, names do not of course need to be prefixed < by "wais:" (by the partial form rules).

---
>    This prefix describes the access using the "whois++" scheme in the
>    process of definition. The host name part is the same as for other
>    IP based schemes. The path part can be either a whois handle for a
>    whois object, or it can be a valid whois query string. This is a
>    subject for further study. 

708a772,775 > NETWORK MANAGEMENT DATABASE

>

> This is a subject for study.

>

712,715c779,785 < conforming URL syntax, using a new prefix. Experimental prefixes < may be used by mutual agreement between parties, and must start < with the characters "x-". The scheme name "urn:" is reserved for < the work in progress on a scheme for more persistent names.

---
>    conforming URL syntax, using a new scheme identifier. Experimental
>    scheme identifiers may be used by mutual agreement between parties,
>    and must start with the characters "x-".  The scheme name "urn:" is
>    reserved for the work in progress on a scheme for more persistent
>    names.  Therefore URNs (Names) and URLs (Locators)  be
>    distinguishable. An object which is either a URL or a URN is known
>    as a URI (Identifier).
731c801
<    retrieval by URL, that the client software have provision for being
---
>    retrieval by URI, that the client software have provision for being
735c805
< BNF for specific URL schemes
---
> BNF syntax
737a808,812
> 

>

>

> Berners-Lee 14 >

739,742c814,817 < [brackets] indicate optional parts. Spaces are represented by the < word "space", and the vertical line character by "vline". Single < letters stand for single letters. All words of more than one letter < below are entities described somewhere in this description.

---
>    [brackets]  indicate optional parts.  Spaces are representated by
>    the word "space", and the vertical line character by "vline".  

> Single letters stand for single letters. All words of more than one > letter below are entities described somewhere in this description.

744,745c819,820 < The current IETF URI working group preference is for the < prefixedurl production. (Nov 1993. July 93: url).

---
>    The current IETF URI working group prefereence  is for the
>    prefiexedurl production. (Nov 1993. July 93: url).
749,754c824
<    characters do not appear in any productions and therefore may not
< 

<

<

< Berners-Lee 13 <

---
>    characters fo not appear in any productions and therefore may not
769c839
<                          | mailtoaddress  | midaddress | cidaddress 

---
>                          | mailtoaddress  

778c848 < ftpaddress f t p : / / login / path [ ! ftptype ]

---
>   ftpaddress              f t p : / / login / path 

786,789d855 < midaddress m i d : addr-spec

<

< cidaddress c i d : content-identifier

<

799a866,870 >

>

>

> Berners-Lee 15 >

808,812d878 <

<

<

< Berners-Lee 14 <

839,840d904 < ftptype A | I | D

<

851c915 < path void | segment [ / path ]

---
>   path                    void |  xpalphas  [  / path ]   

853,854d916 < segment xpalphas

<

862,865d923 <

< gtype xalpha

<

< xalpha alpha | digit | safe | extra | escape

869c927 < Berners-Lee 15

---
> Berners-Lee                                                          16
870a929,932
>   gtype                   xalpha   

>

> xalpha alpha | digit | safe | extra | escape

>

885c947 < digit 0 |1 | 2 | 3 | 4 | 5 | 6 | 7 | 8 | 9

---
>                           0 |1 | 2 | 3 | 4 | 5 | 6 | 7 | 8 | 9   

889c951 < extra " | ' | ( | ) | : | ; | , | space

---
>   extra                   ! | * | " |  ' | ( | ) | : | ; | , | space  

891,892d952 < reserved ! | *

<

910,911d969 < (end of URL BNF)

<

920,923c978,980 < A URL-related security threat is that it is sometimes possible to < construct a URL such that an attempt to perform a harmless < idempotent operation such as the retrieval of the object will in < fact cause a possibly damaging remote operation to occur. The

---
>    The use of URLs containing passwords is clearly unwise.
>    

> Conclusion 927c984,985 < Berners-Lee 16

---
> 

> Berners-Lee 17 929,938c987,994 < unsafe URL is typically constructed by specifying a port number < other than that reserved for the network protocol in question. The < client unwittingly contacts a server which is in fact running a < different protocol. The content of the URL contains instructions < which when interpreted according to this other protocol cause an < unexpected ooperation. An example has been the use of gopher URLs < to cause a rude message to be sent via a SMTP server. Caution < should be used when using any URL which specifies a port number < other than the default for the protocol, especially when it is a < number within the reserved space.

---
>    A need has been demonstrated, and a number of requirements have
>    been stated for uniform resource locators (URLs). A scheme has been
>    proposed which builds on existing conventions to define a syntax
>    for URLs.  This scheme has been in serious use by World-Wide Web
>    (W3) initiative since 1991.  Adoption of the scheme in
>    correspondence, standards and software will ease the use of
>    references to on-line information in a flexible way as the coming
>    information age arrives.
940,948d995
<    Care should be taken when URLs contain embedded encoded delimiters
<    for a given protocol (for example,  CR and LF characters for telnet
<    protocols) that these are not unencoded before transmission.  This
<    would violate the protocol but could be used to simulate an extra
<    operation or parameter, again causing an unexpected and possible
<    harmful remote operation to be performed.
<    

< The use of URLs containing passwords is clearly unwise. <

968c1015 < Amsterdam IETF and refined in net discussion.

---
>    Amsterdam IETF and refined in net discussion.
970,972d1016
<    The draft 03 includes changes made at Houston in Nov 93, and on the
<    net before Seattle March 1994.
<    

977c1021 < Wrappers for URIs in plain text

---
> Fragment-id  

979c1023,1027 < This section does not formally form part of the URL specification .

---
>    This represents a part of, fragment of, or a sub-function within,
>    an object or object. Its syntax and semantics are defined by the
>    application responsible for the object, or the specification of the
>    content type of the object. The only definition here is of the
>    allowed characters by which it may be represented in a URL.  

981c1029,1039 < URIs, including URLs, will ideally be transmitted though protocols

---
>    The fragment-id follows the URL of the whole object from which it
>    is separated by a hash sign (#).  If the fragment-id is void, the
>    hash sign may be omitted: A void fragment-id with or without the
>    hash sign means that the URL refers to the whole object.
>    

> While this hook is allowed for identification of fragments, the > question of addressing of parts of objects, or of the grouping of > objects and relationship between contined and containing objects, > is not addressed by this object. >

> This object does not address the question of objects which are 985c1043 < Berners-Lee 17

---
> Berners-Lee                                                          18
986a1045,1111
>    different versions of a "living" object, nor of expressing the
>    relationships between different versions and the living object.
>    

> Partial form

>

> In a certain limited set of cases, generally within a certain > application, it may be useful to pass only a section of the URL. > Within a object whose URL is well defined, the URL of another > object may be given in abbreviated form, where parts of the two > URLs are the same. This allows objects within a group to refer to > each other without requiring the space for a complete reference, > and it incidentally allows the group of objects to be moved > without changing any references. This is not discussed in detail > here, it is only mentioned so that the characters required by the > technique be reserved for that purpose. It must be emphasised that > when a reference is passed in anything other than a well controlled > context, the full form must always be used.

>

> The partial form relies on a property of the URL syntax that > certain characters ("/") and certain path elements ("..", ".") have > a significance reserved for representing a hierarchical space, and > must be recognised as such by both clients and servers.

>

> A partial form can be distinguished from a full form in that a full > form must have a colon and that colon must occur before any slash > characters. >

> The rules for the use of a partial name are:

>

> If the scheme parts are different, the whole absolute locator > must be given. Otherwise, the scheme is omitted, and:

>

> If the host and/or port parts are the different, the host, port > name and all the rest of the locator must be given.

>

> If the access and host parts are the same, then the path may be > given in absolute (fully qualified) or relative form. Within the > path:

>

> If a leading slash is present, the path is absolute. Otherwise, > a relative path is interpreted as follows:

>

> The last part of the path of the context locator (anything > following the rightmost slash) is removed, and the given partial > URL appended in its place.

>

> Within the result, all occurrences of "xxx/../" or "/." are > recursively removed, where xxx, ".." and "." are complete path > elements.

>

> Note: If a path of the context locator end in slash, partial URLs > will be treated differently to their treatment with respect to the > same path without a slash. Using a trailing slash on a directory >

>

>

> Berners-Lee 19 >

> name is not therefore recommended. The signifcance of a trailing > slash may be considered as that of the locator of a file with void > name within that directory. >

> Wrappers for URIs in plain text >

> This section does not formally form part of the URL specification. >

> URIs, including URLs, will ideally be transmitted though protocols 1005,1006c1130,1133 < Yes, Jim, I found it under <ftp://info.cern.ch/pub/www/doc> but < you can probably pick it up from <ftp://ds.internic.net/rfc>.

---
>                 Yes, Jim, I found it under <ftp://info.cern.ch/pub> bu
> t
>                 you can probably pick it up from <ftp://ds.internic.ne
> t/rfc>.
1009d1135
< 

1022,1024c1148,1150 < December 1991, as updated from time to time,

< <ftp://info.cern.ch/pub/www/doc/http-spec.txt < >

---
>                          December 1991, 

> <ftp://info.cer > n.ch/pub/www/doc/http-spec.txt>

1029a1156,1160 >

>

>

> Berners-Lee 20 >

1040,1047d1170 <

<

<

< Berners-Lee 18 <

< Horton (1987) M. Horton, R. Adams, "Standard for < interchange of USENET messages", Internet RFC < 1036 , 12/01/1987.

1062c1185 < transmission of news" , Internet RFC-977,

---
>                          transmission of news", Internet RFC-977,
1066,1068d1188
<   Kunze, 1994            J. Kunze, Requirements for URLs, to be
<                          published. 

<

1092,1094d1211 < Sollins 1994 K. Sollins and L. Masinter, Requiremnets for < URNs, to be published.

<

1097d1213 < Performance Systems International, Inc.

1101c1217 < Berners-Lee 19

---
> Berners-Lee                                                          21
1102a1219
>                          Performance Systems International, Inc. 

1109,1112c1226,1228 < . <

< AUTHOR'S ADDRESS

<

---
> Author's address  

>

>

1122a1239 >

1126d1242 <

1160c1276 < Berners-Lee 20

---
> Berners-Lee                                                          22