An alternative approach to URC/URT syntax

Olle Jarnefors (ojarnef@admin.kth.se)
Tue, 26 Oct 93 12:59:44 +0100

Date: Tue, 26 Oct 93 12:59:44 +0100
Message-Id: <9310261159.AA05259@mercutio.admin.kth.se>
From: Olle Jarnefors <ojarnef@admin.kth.se>
To: uri@bunyip.com
Subject: An alternative approach to URC/URT syntax

Summary: A simple and clean syntax for URCs is proposed, where
each URI (URL, URN, URD, or URF) is closed by ";" and the whole
URC is wrapped in a <> pair. URLs and URNs would have the syntax
label":"value";", URDs and URFs the syntax label"="value";".
Groups of related URIs are delimited by nestable pairs of
"BEGIN*;" and "END*;". Language tags (opened by ".") and data
representation tags (opened by "/"), which incorporates a
character set indication in the case of plain text, can be
included in the label part of a URI. In all allowed character
sets most low bytes must have their ASCII meaning. Also high
bytes are allowed. ESC, ";", and "%" should be encoded by "%1B",
"%;", and "%%".

This is an outline of an alternative syntax for URCs/URTs,
including the syntax of the label part of URIs.

I use the UR* terms in this way:
1) URL (Uniform Resource Locator) in its usual meaning.
2) URN (Uniform Resource Name) in its usual meaning.
3) URD (Uniform Resource metaData descriptor): a labelled
description of one property of a resource.
4) URF (Uniform Resource Fragment descriptor): a description
identifying a certain part of a resource.
5) URI (Uniform Resource Identifying element): an element of the
identification of a resource. Is either a URL, a URN, a URD
or a URF.
6) URC (Uniform Resource Citation): an aggregate of URIs
identifying a resource or a set of related resources.

Goals:
+ simpe syntax
+ good human readability with preserved unambiguity
+ as few and clean restrictions as possible on the format of
data to be included in URCs
+ good handling of langauge and coded character set issues
+ facilities for grouping related URIs in a complicated URC.

A simple example:

<ftp://nic.nordu.net/internet-drafts/draft-ietf-uri-url-01.txt;
URN:IANA:0::drafts:1204;
Author=Tim Berners-Lee;
Title=Uniform Resource Locators;
Title.sv/Text/Plain/charset:SEN_850200_B=Likformiga resursl{gesgivare;
Language=en;
Version=01;>

The different URIs are thus closed by the character ";". (If a
URI needs to contain a ";", that character must be specially
encoded.) URNs start with "URN:", URLs with some other
identifier followed by ":". URDs consists of a label and a
value, separated by "=". The whole thing is surrounded by a <>
pair.

Some nice consequences of this syntax:

+ It makes the syntax of an isolated URL or URN a special case
of a URC. Only one wrapper, that for URCs, is needed.

+ There is no need for special handling of line breaks, such as
the "start each continuation line with linear white space"
rule of RFC822. Nothing prevents dividing a long URD into
multiple lines in the most obvious way.

+ "<" and ">" need no special treatment in URI values.

+ The significance of line breaks and multiple spaces can be
defined differently for different types of URLs or URDs.

+ Language and coded character set is specified for each URD
separately.

I assume that URDs start with a field type name, indicating what
data about the resource is given by the value part. Like MIME
content types, all field types should have a publicly available
specification and a name registered with IANA. Experimental or
private fields can also be used, if there names start with "X-".
Different systems for representing bibliographic data may be
used. All field names in such a system should have the same
prefix, e.g. "IAFA-".

Two things complicate the syntax of URDs:

1) Language: Often the same unit of information about a resource
can be available in variants, translated to natural languages
other than the original language. To keep the label part of
these URDs different, the URD field name in the translated
URDs is followed by "." and a langues tag. I see no need for
a language tag in the original, untranslated URD.

The language tag consists of a language code from ISO 639,
optionally followed by "-" and a language variant specifier.
The language code will thus consist of two international
letters a-z; after the forthcoming revision of ISO 639 also
three letter codes will be available. The language variant
specifier may consist of two letters, in which case it is a
ISO 3166 country code, or three or more letters, in which
case it denotes a special form of the indicated langauge and
should either be registered with IANA or be a private
specifier, starting with the letter "x".

2) Coded character set: It must be possible to use graphic
characters outside ASCII in URD values. (I don't think that
is necessary for the label part of the URD, though.) It is
also desirable that more powerful text representation methods
than "plain text" can be used, such as TeX for mathematical
formulas. The text representation method, including coded
character set, should be specified after the field name and
any language tag by a representation tag. (It can be omitted,
if the representation method is plain text in the ASCII coded
character set.)

This tag consists of a slightly adjusted MIME content type
specification. E.g. "/Text/Plain/charset:ISO-8859-1"
corresponds to the MIME header
"Content-Type: Text/Plain; charset=ISO-8859-1".

It should be pointed out that the introduction of character
set tagging means that two URCs which consists of different
byte sequences can still be exactly the same URC. This is
difficult to examplify in a ASCII-only email message, but
suppose that

# stands for the octet value hex 7B, which represents the
character a-diaeresis in the Swedish 7-bit character set
SEN_850200_B, and

$ stands for the octet value hex E4, which represents
the same letter in ISO-8859-1.

Then

<Author/Text/Plain/charset:SEN_850200_B=Olle J#rnefors;>

will be exactly the same URC as

<Author/Text/Plain/charset:ISO-8859-1=Olle J$rnefors;>

The MIME standard includes two content-transfer-encoding
methods, Quoted-Printable and Base64, to be used when the
email transport protocol can't transport bytes > hex 7F
reliably, like in classical SMTP. I think that also
bytes > hex 7F should be allowed in URIs. To transport such
URIs, by email -- if they contain non-ASCII characters --
a modern protocol like 8-bit extended SMTP, or MIME over
classical SMTP, must be used. This is an acceptable
restriction in my opinion.

There are still problems with the more powerful multi-byte
character sets. ISO 10646 and Unicode use two-byte
representations of the ASCII characters that contain the NUL
byte, a byte which is problematic for many existing text
handling programs. Also, the representations of several
characters contain the byte hex 3B, which in ASCII represents
";", thereby corrupting the URC syntax. (This is true for
L-cedilla, Cyrillic l, and others.)

The UTF-2 encoding of 10646 and Unicode, which uses the
octets 00-7F only to represent the corresponding ASCII
characters and encodes the other two-byte 10646 characters by
sequences of two or three bytes, is probably the best means
to allow for use of these character sets in URCs. It is not
totally unproblematic, though, since it uses the high control
characters in the C1 area (bytes hex 80-9F) for representing
non-ASCII characters. These bytes are diffuclt or impossible
to display in the 8-bit ISO-standardized character sets used
in modern Unix systems.

I think that a workable solution would be to have this
restriction on coded character sets that can be used in the
value part of URIs:

The bytes hex 00-1F shall represent the same control
characters as in ASCII. All graphic characters with a
special significance in URC syntex, i.e.
< ; > : = . / * A-z a-z 0-9 -
shall be represented by the same bytes as in ASCII.

Coded character sets that satisfy this requirement includes
7-bit ISO 646 character sets, 8-bit ISO 8859 character sets,
the ISO-2022-JP character set used in email in Japan, the
UTF-2 encoded form of ISO 10646/Unicode.

To make URIs roboust, it is probably wise to prescribe the
following special encodings in the value part of URIs:
- Line breaks should be represented by an ASCII CR LF pair,
like in RFC 822 messages.
- other LF characters: %0A
- other CR characters: %0D
- NUL, very vulnerable: %00
- ESC, important in some two-byte character sets: %1B
- "%", otherwise unrepresentable: %%
- ";", otherwise unrepresentable: %;
Other sequences of "%" and two hexadecimal digits, if
occurring, should be interpreted as the corresponding byte.
Any other occurence of "%" should be regarded as a data
error.

With these restrictions on non-ASCII character set use, it
should always be possible to treat URCs as text data rather
than binary data in Internet protocols.

Besides URLs, URNs, and URDs, URCs may contain "structors", i.e.
elements whose sole purpose is to build a structure inside the
URC. Syntactically they consist of a keyword, the character "*",
an optional comment, and the final ";". Presently I'm satisfied
with two structors, BEGIN*; and END*; . They can be used to
group related URIs in a block structure with several levels of
nesting.

A more complex example:

<URN:IANA:0::drafts:1204;
Author=Tim Berners-Lee;
Title=Uniform Resource Locators;
BEGIN * Current version;
Version=1;
BEGIN * Postscript;
Content-type=PostScript;
ftp://nic.nordu.net/internet-drafts/draft-ietf-uri-url-01.ps;
END * Postscript;
BEGIN * Official;
Content-type=Text/Plain;
Charset=US-ASCII;
ftp://nic.nordu.net/internet-drafts/draft-ietf-uri-url-01.txt;
END * Official;
END * Current version;
BEGIN * Older;
Version=0;
Content-type=Text/Plain;
Charset=US-ASCII;
ftp://nic.nordu.net/internet-drafts/draft-ietf-uri-url-00.txt;
ftp://ds.internic.net/internet-drafts/draft-ietf-uri-url-00.txt;
END * Older;
Fragment=Section:Abstract;>

Here I have also included a URF in the URC, the last element. To
be able to effectively specify a certain fragment of e.g. a
document, one has to both specify a _structuring method_
employed in the document such as its division into sections and
subsections, and one or several _parts_ of this structure which
constitute the fragment.

The URF has in principle the same syntax as a URD, using the
reserved field name "Fragment". The value part of the URF
consists of an identifier of the structuring method, the
character ":", and a list of one or more parts in the structure.
One structuring method, "Byte", is applicable to most resources.
The method "Section" should be applicable to most documents.

A final comment is that what I have described here is only a
proposed _canonical_ form of URCs.

When used internally by programs or in communication protocols,
other forms of URCs may be used, which e.g. drop the outer
<> pair. Also, URCs translated to EBCDIC will not be in the
canonical form. What's important is that there exist unambiguous
rules for computing the canonical form in such cases.

When displaying URCs for humans, or printing them, it's often
best to make another deviation from the canonical form of the
URC by using the _actual_ non-ASCII characters represented in
the URDs. Then the coded character set used in the URD, i.e. the
way of representing the characters in it, is immaterial, and the
representation tag after the field name should be omitted.

A 12-rule BNF syntax for Uniform Resource Citations, described
by the RFC822 metanotation:

1 URC = "<" *(URI ";") ">"

2 URI = URL_URN / URD_URF / structor

3 URL_URN = scheme_id ["/" MIME_value] ":" value
; Is a URN iff <scheme_id> = "URN". The
; <MIME_value> indicates the coded character
; set of <value>, if other than ASCII.

4 scheme_id = identifier ; Ex.: URN ftp gopher html wais

5 URD_URF = field_name ["/" lang_tag] ["." repr_tag] "=" value
; Is a URF iff <field_name> = "Fragment".

6 field_name = identifier ; Ex.: Content-Type IAFA-Author

7 lang_tag = identifier ; Ex.: en en-gb en-us eng sv swe jpn

8 repr_tag = MIME_type "/" MIME_subtype *("/" MIME_attribute ":" MIME_value)

9 structor = ("BEGIN" / "END") "*" [comment]
; Nestable. Corresponding "BEGIN *" and "END *"
; delimit a group of related URL_URNs, URD_URFs,
; and subgroups.

10 value = * graphic_character_except_semicolon

11 identifier = ASCII-letter * (ASCII-letter / digit / "-")

12 comment = * graphic_character_except_semicolon

--
Olle Jarnefors, Royal Institute of Technology, Stockholm <ojarnef@admin.kth.se>