Black Boxes

Ramin Firoozye (rpa@netcom.com)
Mon, 18 Oct 93 14:16:37 PDT

From: rpa@netcom.com (Ramin Firoozye)
Message-Id: <9310182116.AA00661@netcom4.netcom.com>
Subject: Black Boxes
To: uri@bunyip.com (URI Mailing List)
Date: Mon, 18 Oct 93 14:16:37 PDT

Fellow URI's...

I have been following the URI debate with a great deal of interest, being
involved in the design and development of a distributed digital library.

What I find interesting is the degree of effort spent on the syntax
of the URL's vs. URN's, and the interesting technique of mixing
internal properties of an identifier with external issues such as how
it is to be used.

I think the whole URI thing is begging for clarification and simplification.
The original goal of being able to uniquely name something has subtly
shifted to being able to find something, and then again to being able to
identify something if you don't have its name. Mingled into this
is the tiptoeing around existing schemes and backward compatibility issues.

Now I don't mean to cast stones without proposing alternatives. I think
it's important to come up with something that works. But I think trying
to throw in the kitchen sink into the heap is counter-productive. Nobody
is going to use the damn thing if it's too complicated and tries to be
the universal Sword of Damocles (had enough cliche's :-)

Having naming domains is a good step. Once you're inside that domain,
the rules for the contents are specific to that domain. I like that.
Having further breakdowns into subdomains is also fine, because it
partitions the name-space and reduces the chances of collision. But
debates on %20 vs. quotes and wrappers vs. terminating triple-colons
is like arguing about the color of internal engine components, irrelevant
and esoteric...

As I mentioned, I'm not into tossing stones casually. I have had to deal
with an interesting problem which will help illustrate my suggestions.

The problem, briefly, was supporting file operations across different
platforms. This was to be implemented as a set of C++ class-libraries that
encapsulated the basic functionality. Different file syntaxes and semantics
had to be tackled and somehow "hidden" from view. The variations in syntax
are well known:

Unix- /home/users/test/The Note
VMS- DISK0:[test,user]NOTENAME.TXT;1
DOS- C:\TEST\NOTE.TXT
Macintosh- Disk::TestDirectory:Note File

On the surface, you had to deal with the disparate filesystem syntaxes of
slashes vs. backslashes vs. colons, etc... once you dug in deeper, it
became obvious that naming was only the tip of the iceberg. The operations
you wanted to perform on the files and the contents became relevant as well.

Having to do a class library forces one into abstractions. In this case,
one single abstraction that would handle all the possible uses for the files
was not feasible. Also, backward compatibility with existing filesystem
functions was key, since some OS'es allowed functions specific to their
filesystem which could not be universally abstracted.

Sounds familiar?

The solution was elusive and took a long time to flesh out but it basically
involved the following scenario:

- Design a universal "canonical" naming scheme that can handle all the
variations in the naming systems. This could be some sort of a 'tagged'
property list, or a unique name-space designator with a local name
attached. This "name" was internal to the class library. It could be
used for external storage directly, but only if it was passed back
directly to the class library. In other words, you weren't meant to see
it, but could use it by storing it in a file and returning it to the
class library verbatim. The user would not be allowed to manually
construct and decompose the contents. The syntax would be "hidden."
You effectively have an "internal" and an "external" representation.

- Design a set of external interfaces to this canonical form.
They mapped between the internal and the many external representations.
These interfaces allowed the name to be decomposed into its parts. If you
wanted the file name, you passed the canonical value and asked for the
name component. On some systems, the information was not directly part of
the name, and required additional operations, but this was all hidden
to the user. They didn't care how it was done. They just wanted it done.

The added benefit of this split in internal and external representation
was that standard cross-platform mapping schemes would be "freebies" here.
Converting from a Mac filename to a DOS filename was a two-step process
of going to canonical form and back out to the DOS form.

- Another class library was devised to handle a basic set of operations
on the files. Operations like read/write/create/delete etc...
collections of names like directory operations were put in another class.
Pretty soon, you had a good set of functions that were portable across
all other platforms. The local class libraries hid the internal
implementation.

A higher-level abstraction allowed file contents to be created and indexed.
This allowed files to be exchanged between systems regardless of their
contents. The higher-level classes depended on the lower-level classes
for locating files, opening, and reading the contents. The high-level
classes would hide the actual file content structure from the application.

- What about local OS functions. What if you wanted to do a custom
function that could only be done on a given file-system? Well, here's
where the canonical vs. local form came handy. For each class, there
were methods to allow the object being operated on to be addressed as
a local object. For example, if you wanted to get the size of the resource
fork of a file under MacOS, you converted the internal form into a
local FileSpec and called standard Mac toolbox calls to get the information.
The rule, of course, was that if you started messing around inside the
files using local functions, the higher level functions would fall out
of sync and become invalidated. Caveat Emptor.

How does this relate to the URI issue?

The goal is to come up with a standard naming scheme that can satisfy a large
number of uses. A name carries a lot of excess information with it.
Some want to hide this away as external "citations," whereas others want
to embed it inside the identifier. Should URL's support multi-byte characters?
Internally, YES. Externally, it depends on the external form. The English
language scheme can translate from Kanji names to English names by supporting
phonetic spelling from Kanji to the canonical form, and from the internal
form to English. What about WAIS doc-id's vs. URI's. What about
URI's vs. filenames for FTP or Gopher. Again, you should be able to go
from internal form to any of these external forms consistently.

The canonical form may be designed to be efficient in storage (perhaps
an index into a table) but inefficient in speed (requires extra lookups).
Or it might be efficient for speed (index-tagged fields) but inefficient in
storage (extra data for unused fields). It may even be that there are
are two internal forms and routines to translate between the two. One
would be used for fast in-memory access, the other for storage in a file
or transmission across a line.

The point is, the internals of this form should be irrelevant to users.
It would be relevant to those coming up with naming schemes and responsible
for converting to<->from the canonical form. The abstraction also calls for
supporting "common" high-level operations that would need to be performed
on the objects. Operations like create, delete, open, close, etc... would
be implemented in a standard form. And if any local service needs to
perform "local" operations, fine. Convert the canonical to a local form
and let the service do with it as it pleases.

In a nutshell, my point is: The interface is the standard, not the
implementation. Let's black box this sucker...

One might argue that all this converting back and forth is inefficient and
slows things down. But remember, if you make the system consistent and
the routines complete, you can convert once into internal form and stay with
it until you need to convert it back out into external form. You can do the
bulk of your operations behind the scenes using the internal form.
And by picking the speed vs. space version of the canonical form, you can
optimize internal operations as well.

Anyway, y'all get the general idea. The point I was trying to make was that
all this discussion of representation and syntax is valid but misplaced.
If you need a standard representation for printing URI's in bibliographic
form, you need the Printing<->Canonical functions. If you want %20,
call it "Printing Form #2" and pop in the right converter.

Lest I be accused of ++'ism, I am not proposing this be done in C++
(although there should be a simple C++ class library for those of us into
OOze). The basic concept is you standardize interface not internal
implementation.

To promote early adoption of the interface, I would also suggest
that a standard portable interface library be created and the source made
freely available to all users. The source code would be under the auspices of
some standards body like IETF or OSI, and would be the official interface
to the URI scheme.

'nuff said.

Comments and jeers encouraged.

Cheers,
Ramin.

-- 
Ramin Firoozye' 
rp&A Inc. - San Francisco, California
Internet: rpa@netcom.COM - CIS: 70751,252
--