Re: <URL:...> considered harmful

Tim Berners-Lee (timbl@quag.lcs.mit.edu)
Thu, 15 Sep 94 15:06:16 -0400

Date: Thu, 15 Sep 94 15:06:16 -0400
From: Tim Berners-Lee <timbl@quag.lcs.mit.edu>
Message-Id: <9409151906.AA00511@quag.lcs.mit.edu>
To: Chris Weider <clw@mocha.bunyip.com>
Subject: Re: <URL:...> considered harmful

> From: Chris Weider <clw@mocha.bunyip.com>
> Since I am the one who proposed the wrapper in the first place,
let me state

> why I think we *still* need something like this, and suggest some
possibilities
> for a modified wrapper now that we threw out the URL: prefix at the
last IETF.

Leaving the URL wrapper thrown out is a good thing.
The URL spec is too important to wait for our wooly deliberations
over the plain text wrapper. So irresepctive of the spec, let
us deliberate.

> We still need a way to distinguish a URL in plain text. Using a
scheme-based
> recognition technique, which looks for a valid scheme and then
extracts the
> rest of the line (or the rest of the line up to the next white
space) has several
> problems. They are:
> Scheme recognition. The number of new schemes will constantly
increase.
> Thus, without a generic wrapper, sites which have not installed the
latest
> set of schemes into their extraction tool will not be able to
correctly

> identify valid URLs embedded into text. A *human* might be able to,
> if they are familiar with all the schemes,
> but there will still be many that are missed by an automated
scheme.
> (I'm disregarding here the actual resolution of the URL).

Agreed. Although in fact <[a-z0-9.]*:[a-zA-Z0-9/_.+etc]*>
will work fine without the "URL:".

We have to be wary of looking for something which will
work 100% of the time, as we can *never* have that, because
we can *never* exclude any syntax from cropping up
elsewheer in plain text. Thereis always the possibility
for ambiguity-- so we will always technically have a heuistic,
even though the <> convention means it works in all but
pathalogical cases.

> Line length. The proposals I've seen for the X.500 URL will
require far

Agreed -- a point for <>.

> Human recognition. What's my current algorithm? Look for colons
and
> then scan the surrounding text hoping to recognize some URL format?
> I think that we can be substantially more friendly than that.

Human recognition is not really a problem: the amazing brains we
have use context, from which it is clear that something is a
reference and a lot more besides: also, humans recognize mail
addresses as such in practice, so the *human* recognition of
the existence of the URL is not a problem I feel. There is
a delimiter problem which Dan rightly points out especially
with trailing punctuation -- which is why the <> are useful.

> So, having said that, let me propose a solution. I freely admit
that my
> suggested wrapper doesn't fit into the 'sgml'ish flavor of HTML.

> So. Two suggestions that may fit better...

>

> 1: Highly recommending the anchor syntax (with surrounding <A> and
</A>)
> for all URLs quoted in free text. This allows the immediate display
of any

> text based document (with the appropriate semantics) through
Mosaic.

I have a serious problem with this, in that you *can't*

use SGML syntax in plain text. Yes, I know Mosaic does it but is
is very sloppy practice. Text is plain or is SGML.
If it's SGML you have to escape all "<" and "&" in the text
to something else. As a markup language for
slightly enriched by human readable text, SGML stinks.

If you want to use SGML, just pipe the thing through
text2html.sed and may it real text/html, and then the
software will be able to handle it in a well defined way.
If we want to use plain text, though, we should keep it
easy to write and easy to read without extra tools.
A halfway house will be untenable and we will be cursing
it hence forth.

> 2: The development of a new tag, call it URI, for example,

> <uri ref="http:blah/blah/blah"> and highly recommending its use.
This is
> perhaps less general, but is a fairly useful hack in my opinion,
and allows
> all types of references to be placed inside.

Again, the objection to being SGMLish but simpler.
Beleive me, I tried this route a long time ago. :-}

> In either case, I hope that I've convinced you of the necessity of
a wrapper.
> Tools are already being developed to take (for example) e-mail and
extract
> the URLs: if we can make their job easier, I think that will be a
major win.

Yup -- but I want wysiwyg HTML.

> Yes, it does mean that we have to make some changes now, but I

> believe that this will save us a lot of trouble in several years.

Let us *not* change the spec. But let us adopt the convention
in practice. It is only a convenetion, and we can never enforce
what people put in plain text.

> Chris Weider

Tim

<mailto:timbl@w3.org>