This site contains older material on Eiffel. For the main Eiffel page, see http://www.eiffel.com.

Comp.risks discussion about robots.txt

Many thanks to all the people who responded (including those who disagreed with my points). A few were kind enough to include some nice comments about my real work; thanks.

      NOTE (14 February 1998): I have added a new batch of messages received on the topic. Thanks for continuing the input. I was particularly interested to see the comments by authors of the two pages cited in my second posting. I also find the last message quoted particularly relevant (well, I may be saying this because it reinforces my point).

Bertrand Meyer

My original posting

(See also the second posting.)

robots.txt: ``Here is what I am not telling you.''
Bertrand Meyer (bertrand@eiffel.com)
Sun, 25 Jan 98 15:04:03 PST


Imagine the head of a large corporation who tells a TV interviewer: "Here is
the list of topics that I don't want anyone to know we are discussing with
potential business partners: ...". Or, for that matter, a President who
tells us "these are the young women I would hate you to know I had dealings
with in the past few years: ...". Pretty dumb, wouldn't it be?

Now look at the "Robot Exclusion Standard" (I think that's how it's called)
for Web sites. The need is clear: you may want to exclude some of the pages
on your Web site from consideration by the indexing "robots" -- Yahoo,
AltaVista and the like.  The solution is, how should I say, interesting: you
put at the top level of your site a file conventionally called `robots.txt',
which lists the directories that should not be indexed; well-behaved robots
will check it, and dutifully oblige.

Now whoever thought up that scheme must have been very smart, but the
smartness somehow eludes me. The file must obviously be world-readable, so
anyone can go to the top level of a site and look up `robots.txt' with a
plain browser. This is a good way to find out what the site owner doesn't
want you to know about. You don't see what's in the secret directories, of
course (well, assuming the Webmaster has done his job and made them
non-world-readable) [*], but you see what the secret directories are, and
just that can be quite valuable information.

The `robots.txt' scheme as it exists is acceptable if you simply want to
avoid having some of your Web information indexed by the search engines, for
example because it is in draft form or of time-limited value. But it is not
appropriate if your goal is to put on your Web site some secret information
that is only meant for some trusted partners. Yet there is a serious
possibility that unsuspecting companies will misuse the scheme for the
second of these applications.

This is not merely a hypothetical possibility. Just for fun I looked up
`robots.txt' for the Web sites of four or five well-known IT companies;
although regrettably I didn't find out any major scoop, I could see quite
clearly some of the topics those companies do not want others to know they
are working on.

The whole matter is very surprising, as the risk seems rather obvious and it
is not hard to think of alternative techniques that would have avoided it.

Bertrand Meyer, Interactive Software Engineering Inc., makers of ISE Eiffel
(Bertrand.Meyer@eiffel.com), http://eiffel.com

  [* ... at least not without exploiting various security flaws.  PGN]

My second posting

(See also the first posting.)
(This is an answer to the set of messages reproduced below, but it seems to make sense to put it in front.)

I have received a flurry of responses to my article describing the risks
associated with the `robots.txt' convention for excluding search engines
from indexing parts of a Web site.

I apologize for not responding individually to all the people who wrote to
me.  I have put, however, all the answers in a Web page, for the benefit of
anyone who cares to consult them:

  http://eiffel.com/private/meyer/robots.html
  (available Saturday, Jan 31st, 18:00 California time).

The common theme of the answers can be summarized as follows: I was wrong to
criticize the robots.txt design because it is not meant to protect pages,
simply to keep search engines away from pages that are not *worth* indexing,
e.g. because they are of temporary values. To quote one correspondent, Osma
Ahvenlampi :

  > Robots.txt is a way to protect your web server from being overloaded by a
  > dumb robot in a cgi loop, not a security tool. This much should be obvious
  > to anyone capable to be in charge of web site administration.

> or, according to Chris Cheyney : > Anyone stupid enough to leave a network open and count on the optional > robots.txt robot exclusion de-facto standard for security gets (and should > get) what he deserves. Among the people making similar points: Thomas Andrews (thomaso@andromedia.com), Nelson Minar (nelson@media.mit.edu), John R. Levine (johnl@iecc.com), Jeremy Nelson (jem@stairways.com.au), Barry Margolin (barmar@bbnplanet.com), Laurentiu Badea (byte@lmn.pub.ro), Klaus Johannes Rusch (KlausRusch@atmedia.net). Again, see the Web page for the details of their comments. I stand by my original assessment: 1. If every facility was always used as its designers intended, the RISKS archives would be noticeably slimmer. Here the possibility of misuse seems rather considerable. If you are just a bit absent-minded, isn't it natural to use this mechanism to exclude stuff from being indexed and hence believe no one will find it? "Stupid", maybe -- but not unlikely. After all, the designers of the Mercedes A-Class car could also say "anyone stupid enough to swerve violently when an elk crosses the road gets (and should get) what he deserves". Unfortunately for them, and probably fortunately for most of us, that doesn't pass muster. 2. For anyone who thinks this is just a hypothetical possibility, here is the robots.txt file of the site of a major communications company:   robots.txt User-agent: * Disallow: /bug-navigator # Bug Data Disallow: /warp/customer # Registered Users Disallow: /kobayashi # Navigation for registered Disallow: /cgi-bin # no programs Disallow: /pcgi-bin # no programs Disallow: /univ-src/ccden # will get content through /univercd Disallow: /cpropub/univercd # obsolete

The first two lines at least suggest to me that this is stuff that the company doesn't want publicized -- for security reasons, not because it is of temporary value. Were I a "hacker" in the bad sense of the term, I would revel in such information, as it would direct my efforts to the really juicy bits. Here is an extract from another page -- I'll let you guess the URL: # o Created this file to prevent indexing of one # SME directory.

User-agent: *

Disallow: /sparc/SPARCengineUltraAX/oem/ Disallow: /microelectronics/SPARCengineUltraAX/oem/ Disallow: /javachip/SPARCengineUltraAX/oem/ Disallow: /javachips/SPARCengineUltraAX/oem/

Disallow: /sparc/SPARCengineUltraAX/download/ Disallow: /microelectronics/SPARCengineUltraAX/download/ Disallow: /javachip/SPARCengineUltraAX/download/ Disallow: /javachips/SPARCengineUltraAX/download/ I can't say for sure, but doesn't some of this look a tad like proprietary information? 3. So even if the respondents are right that it is "stupid" to use robots.txt in that way, my posting at least draws attention to the risk. If it succeeds in making just one Webmaster a bit more careful, it will not have been useless. 4. Of course designers cannot always be blamed for misuses of their mechanisms. But they should minimize the possibility of misuses. In the robots.txt case it seems to me rather wrong to have a conspicuous world-readable file that draws attention to *excluded* information. (Reminds me of programming languages which implement information hiding by making the author of each module list conspicuously, as the first thing you read in the module's text, those features which are *not* exported!) This draws attention to what should not attract attention. I think that a more effective convention would have been to include a special marker (META tag?) in HTML files that shouldn't be indexed, and a special file (exclude.txt?) in the directories that should not be explored at all. Then you would only be able to find that information if you already knew where to look. The robots.txt mechanism is a godsend for Peeping Toms in search of possible secrets. (Thanks too to Marc Horowitz (marc@cygnus.com) and Rik Moonen (rik.moonen@technopol.be) for their comments.) Bertrand Meyer, Interactive Software Engineering, makers of ISE Eiffel (Bertrand.Meyer@eiffel.com), http://eiffel.com

From thomaso@andromedia.com Mon Jan 26 15:11:10 1998


Sender: thomaso@andromedia.com
Date: Mon, 26 Jan 1998 15:09:47 -0800
X-Mailer: Mozilla 4.04 [en] (X11; I; SunOS 5.5.1 sun4u)
To: bertrand@eiffel.com, risko@csl.sri.com
Subject: RE: robots.txt: ``Here is what I am not telling you.''
Content-Type: text/plain; charset=us-ascii
Content-Transfer-Encoding: 7bit
Content-Length: 639
Status: RO
X-Lines: 18

The "robot exclusion standard" is not meant to hide information.
It is meant to say, "this is a black hole and it will be a real
pain for me and a waste of time for you to try to index all of
this."<

For example, you have and online dictionary, but instead of using
CGI normally, you have CGI that looks like:

  /dictionary.cgi/look-up/antidisestablishmentarianism

On the definition page, you might have other links to related
words or links on words in the definition, and the robot could
basically walk your entire dictionary.

If you want to protect your site from prying eyes, it is best
to use normal web security measures.

=thomas

From nelson@pinotnoir.media.mit.edu Mon Jan 26 15:36:56 1998


Sender: nelson@media.mit.edu
To: bertrand.meyer@eiffel.com
Subject: Re: robots.txt: ``Here is what I am not telling you.''
Date: 26 Jan 1998 18:35:33 -0500
Lines: 24
X-Mailer: Gnus v5.3/Emacs 19.34
Content-Length: 1153
Status: RO
X-Lines: 24

  >The `robots.txt' scheme as it exists is acceptable if you simply want to
  >avoid having some of your Web information indexed by the search engines, for
  >example because it is in draft form or of time-limited value.<

Which is exactly what robots.txt was designed to do.

  >But it is not appropriate if your goal is to put on your Web site
  >some secret information that is only meant for some trusted partners.

Of course it's not appropriate for that! If you want to keep something
secret from the public at large, you have to authenticate the
requestor. That's what passwords and host-based authorization are for.

  >Yet there is a serious possibility that unsuspecting companies will
  >misuse the scheme for the second of these applications.

Really? That seems like a horrible misunderstanding of robots.txt. The
fault isn't with the standard, it's with the people misinterpreting it.

  >The whole matter is very surprising, as the risk seems rather obvious and it
  >is not hard to think of alternative techniques that would have avoided it.

Like what? Remember - it's got to be a non-interactive process that by default allows full access to everything.

From bertrand@eiffel.com Mon Jan 26 15:49:25 1998

Date: Mon, 26 Jan 98 15:49:21 PST
To: bertrand@eiffel.com, risko@csl.sri.com, thomaso@andromedia.com
Subject: RE: robots.txt: ``Here is what I am not telling you.''
Reply-To: Bertrand.Meyer@eiffel.com
Content-Length: 270
Status: RO
X-Lines: 10

If every facility was always used exactly as it is meant
to, computer risks and the volume of mishap reports
on comp.risks would be much lower. But that's not the
way things are. In the robots.txt case, the possibility
of misuse is quite serious.

Best regards,

-- BM

From johnl@iecc.com Mon Jan 26 17:08:55 1998

Date: 27 Jan 1998 01:07:29 -0000
To: bertrand@eiffel.com
Subject: Re: robots.txt: ``Here is what I am not telling you.''
Newsgroups: local.risks
Cc: risks@iecc.com
Content-Length: 1580
Status: RO
X-Lines: 33


  > Now look at the "Robot Exclusion Standard" (I think that's how it's
  > called) for Web sites. The need is clear: you may want to exclude
  > some of the pages on your Web site from consideration by the
  > indexing "robots" ...

I think what we have here is a more familiar risk, that of people
using technology they don't understand.

The robots.txt file is intended to warn web spiders away from parts of
a site that are accessible to the public but are not suitable for
indexing.  Most commonly it keeps the spiders away from the scripts in
/cgi-bin which are intended to be used to process data from forms and
won't do anything useful if walked by a spider.  It works just fine
for that purpose.

There are two reasonable ways to keep people away from stuff on your
web server that you want to keep private.  You can protect parts of
your site with passwords, or you can use hard-to-guess URLs where the
URL itself serves in effect as the password.  Web spiders don't have
extra-sensory perception, and the only way that they find pages is to
follow a link, either from a previously indexed page or from a URL
manually added to the spider's list.  If you have a site or sub-site
with a URL that doesn't occur anywhere else, and isn't one that's
searched automatically (index.html in the top-level directory, mainly)
no spiders will find it.

--
John R. Levine, IECC, POB 727, Trumansburg NY 14886 +1 607 387 6869
johnl@iecc.com, Village Trustee and Sewer Commissioner, http://iecc.com/johnl,
Finger for PGP key, f'print = 3A 5B D0 3F D9 A0 6A A4  2D AC 1E 9E A6 36 A3 47

From jem@stairways.com.au Mon Jan 26 17:40:33 1998


X-Sender: jem@proxy.peter.com.au
Content-Type: text/plain; charset="us-ascii"
Date: Tue, 27 Jan 1998 09:33:18 +0800
To: Bertrand.Meyer@eiffel.com
Subject: re: robots.txt: ``Here is what I am not telling you.''
Content-Length: 1015
Status: RO
X-Lines: 25

  >The `robots.txt' scheme as it exists is acceptable if you simply want to
  >avoid having some of your Web information indexed by the search engines, for
  >example because it is in draft form or of time-limited value. But it is not
  >appropriate if your goal is to put on your Web site some secret information
  >that is only meant for some trusted partners. Yet there is a serious
  >possibility that unsuspecting companies will misuse the scheme for the
  >second of these applications.

If you are attempting to use the Robots.txt files to protect secrets then I
don't really think you understand what a Robots.txt file is for.  The
robots file prevents bots hammering your server into the ground.   It
doesn't stop people from looking through your site.

If you have information you wish to protect then use a password protected
web or FTP site.  With all due respect, only an idiot would try to
'protect' information using a robots.txt file.

Jeremy.

--
Jeremy Nelson.
Stairways Software (http://www.stairways.com/)

From jem@stairways.com.au Mon Jan 26 18:29:55 1998

X-Sender: jem@proxy.peter.com.au
Content-Type: text/plain; charset="us-ascii"
Date: Tue, 27 Jan 1998 10:28:22 +0800
To: Bertrand.Meyer@eiffel.com
Subject: re: robots.txt: ``Here is what I am not telling you.''
Content-Length: 2924
Status: RO
X-Lines: 70

  >The robots.txt facility is poorly designed. It is meant
  >for programs, but draws the attention of *humans* on things
  >that are supposed to be withdrawn from attention.
  >There would have been much better ways to achieve
  >the desired result, e.g. putting the no-indexing request
  >in the files themselves.

The robots file is not designed to withdraw attention from web pages.  It
is designed to stop bots from hitting recursive loops, or exploring areas
of the web which are dynamic or generated on the fly.  It is designed to
counter some of the fundamental limitations of automated information
gathering.

To quote from the noRobots standard:
(http://info.webcrawler.com/mak/projects/robots/norobots.html)

  'In 1993 and 1994 there have been occasions where robots have visited WWW
  servers where they weren't welcome for various reasons. Sometimes these
  reasons were robot specific, e.g. certain robots swamped servers with
  rapid-fire requests, or retrieved the same files repeatedly. In other
  situations robots traversed parts of WWW servers that weren't suitable,
  e.g. very deep virtual trees, duplicated information, temporary
  information, or cgi-scripts with side-effects (such as voting).

  These incidents indicated the need for established mechanisms for WWW
  servers to indicate to robots which parts of their server should not be
  accessed. This standard addresses this need with an operational
  solution.'

No mention is made of any need to hide information from the public.  It is
there for technical reasons to prevent server loading.

I don't disagree that it is possible for people to misunderstand the
purpose and intent of the robots exclusion file and therefore try to 'hide'
secret information using a robots file.  But that doesn't mean that the
standard is broken, it simply means that the people don't understand what
the robot exclusion file is for.

There are a number of ways people can add security to 'secret' documents:

  Don't link it in to the site. If it isn't linked in then the robots can't find it.  And if the robots can't find it then you don't need to add it to the robots.txt file
  Add password security to those web pages
  Don't put it on the web.  Mail it to people
  Mail it to people in an encoded password protected manner
  Mial it to people using a cryptographic system


There are ways to add security and the Robots standard is not it.

I might add that even without the Robots.txt file pointing people to the
'juicy' bits of the web site, I would still expect anything linked in to a
web site to be found.   Humans are much better at trawling the net than
robots are.  If the web pages are there and linked in to the web, you
should have every expectation that people will find them quite
independently of the search engines, robots.txt files or any other
automated mechanism.

  >Best regards,
  > BM

--
Jeremy Nelson.
Stairways Software (http://www.stairways.com/)


From barmar@bbnplanet.com Mon Jan 26 21:40:07 1998

Date: Tue, 27 Jan 1998 00:38:44 -0500
To: bertrand@eiffel.com
Subject: Re: robots.txt: ``Here is what I am not telling you.''
Content-Length: 662
Status: RO
X-Lines: 14

Anyone who thinks robots.txt is some kind of security mechanism is clearly
deluded.  The name clearly indicates that it's just intended to stop
automated web crawlers.  It has no effect on normal browsing, so it can't
possibly be used to keep something secret.

Web crawlers don't do anything that an ordinary browser can't do; they
simply follow all the links on your web pages.  robots.txt is just a way to
tell them nicely "Don't bother with these directories, they're not
interesting for search engines."

--
Barry Margolin, barmar@bbnplanet.com
GTE Internetworking, Powered by BBN, Cambridge, MA
Support the anti-spam movement; see 

From jem@stairways.com.au Mon Jan 26 21:46:11 1998

X-Sender: jem@proxy.peter.com.au
Content-Type: text/plain; charset="us-ascii"
Date: Tue, 27 Jan 1998 13:44:00 +0800
To: Bertrand.Meyer@eiffel.com
Subject: re: robots.txt: ``Here is what I am not telling you.''
Content-Length: 2352
Status: RO
X-Lines: 74

>Here is the content of `robots.txt' for a well-known
>hardware company:
>
>	User-agent: *
>	Disallow: /bug-navigator # Bug Data
>	Disallow: /warp/customer # Registered Users
>	Disallow: /kobayashi # Navigation for registered
>	Disallow: /cgi-bin # no programs
>	Disallow: /pcgi-bin # no programs
>	Disallow: /univ-src/ccden # will get content through /univercd
>	Disallow: /cpropub/univercd # obsolete
>
>To me some of this looks like information that the company
>wants to make accessible to some parties only.

That doesn't change the fact that the robots.txt file is not designed to be
a security feature.

The standard is not incorrect, this use of it is incorrect.  It is like
complaining that Eudora doesn't give you very good Web access.  You may be
able to bend it into retrieving HTML, but then complaining it doesn't do
the job very well is missing the point.

You have a valid warning, but the message is 'robots.txt files are not a
security measure' not 'the robots standard is broken'.

I must say though, there may be a good deal of ignorance out there.  Here
is the robots.txt file for Dell (in its entirety!)
:

# Please send an email to webmaster@dell.com with subject "Robot Request"

Say what?

I like IBM's robots.txt file:

#Format is:
#       User-agent: 
#       Disallow:  | 
# Flag  Date    By      Reason
# $l1-  950130  epc     finally understood what the file was for!
# $L2=	960909	epc	fixed url since mak moved to Webcrawler...
# $L3=	970811	epc	drop /Stretch/

In particular the first reason supplied 'finally understood what the file
was for!'. :)

Microsoft got it right.

Compag, Netscape and Apple didn't have one. Digital didn't have one either,
although to be fair, ww.altavista.digital.com did.

What a fun game.

Finally, the robots.txt file you listed above could still be entirely
legitimate.  The only one that is vaguely suspicious is the bug-navigator
entry and that could be innocent too.  I went looking to check and see what
was actually in that directory, but I couldn't find the manufacturer in
question.

And, of course, it is simply possible that the person who set up the
robots.txt file thought it was some sort of security measure.

>Best regards,
>
>-- BM

Jeremy.

--
Jeremy Nelson.
Stairways Software 



From adrianh@victoriareal.co.uk Tue Jan 27 01:23:51 1998

X-Sender: adrianh@seagulls.victoriareal.co.uk
Content-Type: text/plain; charset="us-ascii"
Date: Tue, 27 Jan 1998 09:17:35 +0000
To: risks@CSL.sri.com
Subject: robots.txt, potential mis-understanding
Cc: bertrand@eiffel.com
Content-Length: 1720
Status: RO
X-Lines: 42

I think the worries that Bertrand has about the robots.txt stem, in my
opinion, from a incorrect interpretation of its purpose.... which is the
real risk.

Bertrand comments apply to robots.txt used as a "privacy" tool --- marking
information that should not be accessed by Jo Public.

However the reason the robot exclusion protocol was created was to prevent
web bots wandering into "dangerous" areas which are useless for them to
index. Anyone who is using robots.txt as a form of security is downright
foolish.... and hasn't read the original spec document...

----
In 1993 and 1994 there have been occasions where robots have visited WWW
servers where they weren't welcome for various reasons. Sometimes these
reasons were robot specific, e.g. certain robots swamped servers with
rapid-fire requests, or retrieved the same files repeatedly. In other
situations robots traversed parts of WWW servers that weren't suitable,
e.g. very deep virtual trees, duplicated information, temporary
information, or cgi-scripts with side-effects (such as voting).

These incidents indicated the need for established mechanisms for WWW
servers to indicate to robots which parts of their server should not be
accessed.
----

The real risk? The web has grown so quickly it is, in many cases, being run
by people who do not understand the technologies they are using. (I'm
obviously don't mean Bertrand here, but the people whose robots.txt files
he discovered.)

Cheers,

Adrian

PS Bertrand, love Eiffel :-)

----
Adrian Howard. adrianh@victoriareal.co.uk. Head Techie. Victoria Real Ltd
URL: http://www.victoriareal.co.uk/ v. +44 1273 774469 f. +44 1273 779960



From bruce.oneel@obs.unige.ch Tue Jan 27 02:52:15 1998

Date: Tue, 27 Jan 1998 11:50:36 +0100
Subject: Robots.txt
Sender: oneel@jedai.unige.ch
To: Bertrand.Meyer@eiffel.com
Cc: risks@CSL.sri.com
X-Mailer: Mozilla 4.04 [en] (X11; I; SunOS 5.5.1 sun4u)
Content-Type: text/plain; charset=us-ascii
Content-Transfer-Encoding: 7bit
Content-Length: 2105
Status: RO
X-Lines: 47

Hi,
  I enjoy your writing, and, I saw your recent article in RISKS 19.57
and had a few comments.

  You comments are of course correct, but, I think that the robots.txt
was intended to be viewed more as "please don't go there" rather than
"you may not go there".  The difference being that areas excluded by
robots.txt are areas we'd prefer you not to index for assorted different
reasons, not, areas we don't want the public to see.  An example from a
previous job (http://heasarc.gsfc.nasa.gov) excluded the /FTP area.  The
/FTP partition was a set of 4 or 5 500 gigabyte to 1TeraByte optical
juke boxes whose directory graphs had cycles. We more than once had
indexing robots get caught in loops and/or seriously degrade the
performance of the jukeboxes by requesting too many files.

  In our case we didn't want this data indexed even though it was all
publicly available because:

1.  It was in a format useful to astronomers, not html.  Most indexing
robots wouldn't understand it anyway.
2.  We already had a custom search engine which understood the data from
an astronomy point of view.
3.  It was a large amount of data.  GSFC is connected by multiple T3
connections so the bandwidth wasn't a problem, rather, the jukeboxes
were slow.  Plus we were trying to help the search engines.  No point in
them copying terabytes of data when they didn't need to.

  If we found that some search engine was pounding on us we tried to
send them a note asking them not to.  If that didn't work we'd
eventually exclude them from accessing us, but, that was a last resort.
Most often we had no problem getting them to agree with us after a brief
note.

  As you state, robots.txt is a really really bad idea to exclude things
from public view.

Thanks!

bruce

--

Bruce O'Neel                       phone:  +41 22 950 91 22 (direct)
INTEGRAL Science Data Centre               +41 22 950 91 00 (switchb.)
Chemin d'Ecogia 16                 fax:    +41 22 950 91 33
CH-1290 VERSOIX                    e-mail: Bruce.Oneel@obs.unige.ch
Switzerland                        WWW:    http://obswww.unige.ch/isdc/

From rik.moonen@technopol.be Tue Jan 27 05:54:04 1998

To: 
Cc: 
Subject: robots.txt: ``Here is what I am not telling you.''
Date: Tue, 27 Jan 1998 14:51:04 +0100
Content-Type: multipart/alternative; boundary="----=_NextPart_000_0006_01BD2B32.FEF88160"
X-Priority: 3
X-Msmail-Priority: Normal
X-Mailer: Microsoft Outlook Express 4.71.1712.3
X-Mimeole: Produced By Microsoft MimeOLE V4.71.1712.3
Content-Length: 2154
Status: RO
X-Lines: 69

This is a multi-part message in MIME format.

------=_NextPart_000_0006_01BD2B32.FEF88160
Content-Type: text/plain;
	charset="utf-8"
Content-Transfer-Encoding: quoted-printable

Hi,

I enjoy a lot every info in the Risk Digest Volumes.
Also your article about "robots.txt" (25th of January), is a good =
illustration of our new technology risks.

But.... look what happens if you try:

http://catless.ncl.ac.uk/robots.txt

Seems like the Risk Digest server is a kind of...risky ?

Best regards

Rik



------=_NextPart_000_0006_01BD2B32.FEF88160
Content-Type: text/html;
	charset="utf-8"
Content-Transfer-Encoding: quoted-printable

I enjoy a lot every info in the Risk Digest Volumes. Also your article
about 'robots.txt' (25th of January), is a good illustration of our new
technology risks.

But.... look what happens if you try:

  http://catless.ncl.ac.uk/robots.txt

Seems like the Risk Digest server is a kind of...risky ?

Best regards

Rik

------=_NextPart_000_0006_01BD2B32.FEF88160--


From Lindsay.Marshall@newcastle.ac.uk Tue Jan 27 05:59:45 1998

Date: Tue, 27 Jan 1998 13:54:39 +0000 (GMT)
Subject: Re: robots.txt: ``Here is what I am not telling you.''
To: rik.moonen@technopol.be
Cc: Bertrand.Meyer@eiffel.com
Content-Type: TEXT/plain; CHARSET=US-ASCII
Content-Length: 862
Status: RO
X-Lines: 26

On 27 Jan, Rik Moonen wrote:
> Hi,
>
> I enjoy a lot every info in the Risk Digest Volumes.
> Also your article about "robots.txt" (25th of January), is a good illustration of our new technology risks.
>
> But.... look what happens if you try:
>
> http://catless.ncl.ac.uk/robots.txt
>
> Seems like the Risk Digest server is a kind of...risky ?

I'm not sure why you think of the robots.txt on Catless as risky! It
doesn;t reveal any secrets or anything that I am hiding! Everyone has a
bin directory and /Obituray/garden is the file space for the Virtual
nenorial garden which I didn't want indexed. /Images is exactly what it
says - where I keep images and again i couldn;t see any point in having
them indexed.

In fact, robots.txt also used to exclude the Risks pages - I cant think
why I took them out of now!

L.
--
http://catless.ncl.ac.uk/Lindsay


From byte@lmn.pub.ro Tue Jan 27 07:45:44 1998

Subject: Re: robots.txt: ``Here is what I am not telling you.''
To: risko@csl.sri.com
Date: Tue, 27 Jan 1998 17:30:52 +0200 (EET)
Cc: bertrand@eiffel.com
X-Mailer: ELM [version 2.4ME+ PL15 (25)]
Content-Transfer-Encoding: 7bit
Content-Length: 1328
Content-Type: text/plain; charset=US-ASCII
Status: RO
X-Lines: 30


>
> The `robots.txt' scheme as it exists is acceptable if you simply want to
> avoid having some of your Web information indexed by the search engines, for
> example because it is in draft form or of time-limited value. But it is not
> appropriate if your goal is to put on your Web site some secret information
> that is only meant for some trusted partners. Yet there is a serious
> possibility that unsuspecting companies will misuse the scheme for the
> second of these applications.

Non-public information should be, well, not public. There are many ways
to ensure privacy for web sites, for the case you describe.

The robot can index a file only if it finds a link to it somewhere else,
if there is no link then it doesn't know of its presence (except for /).
So I could make a directory /foryou and email you the url, and the
robots will never find it (assuming there is an html page for /).

As an example, I used robots.txt to exclude a software archive on our web
site, because I noticed robots were happily downloading all that 700Mb of
binary data every time. I'm even curious what have they indexed there...

So I think robots.txt doesn't mean "here is what I don't want you to know",
but "don't bother looking there" instead.

Regards,
--
Laurentiu Badea
mailto:byte@lmn.pub.ro
http://www2.lmn.pub.ro/~byte/

From peter@taronga.com Tue Jan 27 08:49:55 1998

Date: Tue, 27 Jan 1998 10:10:36 -0600
To: Bertrand.Meyer@eiffel.com
Subject: robots.txt
Content-Length: 1083
Status: RO
X-Lines: 20

The optimal solution for keeping well-behaved robots out of a site would
be for the robots to include a tag with their requests indicating they're
a robot, and have the server deny requests that robots shouldn't be targeting.

The problem is that one of the design constraints on the mechanism was that
it work with existing servers, without requiring websites to modify their
code. Since the check has to be done by the server or the client, and there
were many servers and few robots involved, the decision was made to take
the easy way out and modify the clients.

I personally have my servers detect and reroute some of the robots that I'm
aware of (mostly email address gatherers for spammers, sigh), but I also use
a robots.txt file to deny access to my whole "test" server. It works
reasonably well.

In any case, anything that requires privacy should be secured by passwords
and encryption, not by keeping it obscure. While the danger of people
misunderstanding robots.txt is real, I think you're too hard on the designers
given the restrictions they were operating under.


From KlausRusch@atmedia.net Tue Jan 27 15:02:22 1998

Date: Tue, 27 Jan 1998 23:09:25 CET
Reply-To: Klaus Johannes Rusch        
X-Url:     http://www.atmedia.net/KlausRusch/
To: risks@csl.sri.com
Cc: bertrand@eiffel.com (Bertrand Meyer)
Subject: Re: robots.txt: ``Here is what I am not telling you.''
References: <199801262248.OAA00524@chiron.csl.sri.com>
Content-Type: text/plain; charset="iso-8859-1"
Content-Transfer-Encoding: 8BIT
X-Mailer: OS/2 Warp/4.0 LaMail/2.3
Content-Length: 3583
Status: RO
X-Lines: 66

The Robots Exclusion Standard was created to keep robots from indexing content
areas which are likely to cause problems with robots, such as dynamically
generated content, include files, as well as content which is kept for archival
purposes only and should not show up in search results.

After all, robots.txt doesn't restrict, and was never intended to restrict,
access to documents but only keep them from being indexed. (There is, however,
the risk of webmasters assuming robots.txt files keeps visitors off certain
areas.)

Klaus Johannes Rusch
--
KlausRusch@atmedia.net
http://www.atmedia.net/KlausRusch/

From marc@cygnus.com Tue Jan 27 19:58:39 1998

To: risks@csl.sri.com
Cc: bertrand@eiffel.com
Subject: Re: robots.txt: ``Here is what I am not telling you.''
Date: 27 Jan 1998 22:57:16 -0500
Lines: 10
X-Mailer: Gnus v5.3/Emacs 19.34
Content-Length: 428
Status: RO
X-Lines: 10

>> The whole matter is very surprising, as the risk seems rather
>> obvious and it is not hard to think of alternative techniques that
>> would have avoided it.

I can think of an alternative technique: Don't put sensitive
information where arbitrary anonymous people (or programs) can see it.
Passwords and firewalls aren't very good security tools (crypto is
better), but they sure beat leaving stuff out in the open.

		Marc

From cheyney@mindspring.com Tue Jan 27 21:44:26 1998

Sender: csc@mindspring.com
Date: Wed, 28 Jan 1998 06:00:00 +0000
X-Mailer: Mozilla 2.01 (X11; I; Linux 2.0.30 i486)
To: bertrand@eiffel.com
Subject: Re: robots.txt: ``Here is what I am not telling you.''
Content-Type: text/plain; charset=us-ascii
Content-Transfer-Encoding: 7bit
Content-Length: 1042
Status: RO
X-Lines: 23

Bertrand -

Anyone stupid enough to leave a network open and count on the optional
robots.txt robot exclusion de-facto standard for security gets (and should
get) what he deserves.

Between firewalling, IP filtering, httpd access control lists, password-
based server authentication, and SSL, you should have a fairly good means
of controlling access to information.  Smart people at companies with a
clue implement internal firewalls so marketing groups can't see what
development's thinking about (so overzealous clueless marketers can't
sell what doesn't exist) and so forth.

The only thing robots.txt is good for is telling annoying Web indexers
that obey the de-facto standard to quit hogging the site's bandwidth .....
it is not nor was ever meant to constitute any form of security.

chris
---
Chris Cheyney                      | "Where there is a way, there
Internet:  cheyney@mindspring.com  |  is a will to subvert it."
ICBM:     33 54'40" N/84 11'25" W  |____________________________________
CIA FBI NSA EPA IRS IOU M-O-U-S-E  |

From oa@spray.fi Wed Jan 28 04:43:35 1998

To: bertrand@eiffel.com (Bertrand Meyer)
Subject: Re: robots.txt: ``Here is what I am not telling you.''
Content-Type: text/plain; charset=US-ASCII
Date: 28 Jan 1998 12:52:02 +0200
Lines: 54
X-Mailer: Gnus v5.4.66/Emacs 19.34
Content-Length: 2806
Status: RO
X-Lines: 54

Hi,

Saw your article in the RISKS digest. I must disagree with your
conclusions. My reasons for this are pretty simple - the basis being:
anything put on a world-readable web server is public information.

Robots.txt was not designed to deny search robots access to secret
data. Anyone putting secret data in a place that could be accessed by
casual browser or a search robot should be fired for criminal
negligence - robots simply follow link trails found by accessing the
one known page on a web server, namely the root page. If a robot could
get to the information, so can a human, and humans have no reason to
look at robots.txt. What use would it be to protect sensitive
information from Alta Vista if a casual browser could get to the
information anyway? It would only make finding the information
slightly more work.

You're not interpreting the meaning of robots.txt correctly. It's not
"here's what I'm not telling you", but "here's what you should not try
to look at, since you'll go into a loop if you do". This much is
explained in the robots.txt specification, I believe, however
unfortunately I'm unable to access it right now as it seems
something's wrong with the connection to our ISP :(

The methods for putting secret or sensitive data on a web server can
not rely on robots.txt. Instead, they have to be dealt with using
domain name, IP address space, or user authentication restrictions.

For example, one of the web servers I run is an intranet site for our
company. That site is only accessible to the IP address space in our
internal network. Another site contains snapshots of the web site
projects we work on for our clients. Often, our involvement with this
projects is sensitive information at least until published. Yet we
have to have the client access to the development version so we can
show them things and they can track our progress. Thus, the server in
question contains password-protected directories for the projects, and
the root directory itself is also password protected (with no valid
users). The latter is to hide the names of the directories themselves;
no matter what query you make to the server, you'll always get the 402
not authorized error back to you. Even if you know a user/pass for one
directory, you'll still get 402 not authorized for anything that's not
below that directory. This makes it impossible to successfully mine
the root dir for client or project names.

This is how you protect sensitive information on an otherwise
world-readable web site. Robots.txt is a way to protect your web
server from being overloaded by a dumb robot in a cgi loop, not a
security tool. This much should be obvious to anyone capable to be in
charge of web site administration.

--
Life is anything that dies when you stomp on it.
Osma Ahvenlampi 


From Peter Neumann, 26 Jan

Status: RO
X-Lines: 857

I cannot run all this.  Is there a possibility that you can summarize
the responses and send me a note for inclusion in RISKS?  You have been
cc:ed on some of them.

(Answer: yes -- BM.)

From: Thomas Andrews

Date: Mon, 26 Jan 1998 15:09:47 -0800
From: Thomas Andrews 
Organization: Andromedia, Inc.
X-Mailer: Mozilla 4.04 [en] (X11; I; SunOS 5.5.1 sun4u)
MIME-Version: 1.0
To: bertrand@eiffel.com, risko@csl.sri.com
Subject: RE: robots.txt: ``Here is what I am not telling you.''

The "robot exclusion standard" is not meant to hide information.
It is meant to say, "this is a black hole and it will be a real
pain for me and a waste of time for you to try to index all of
this."

For example, you have and online dictionary, but instead of using
CGI normally, you have CGI that looks like:

	/dictionary.cgi/look-up/antidisestablishmentarianism

On the definition page, you might have other links to related
words or links on words in the definition, and the robot could
basically walk your entire dictionary.

If you want to protect your site from prying eyes, it is best
to use normal web security measures.

=thomas

Date: Mon, 26 Jan 98 15:49:21 PST
Message-Id: <9801262349.AA03352@vienna.eiffel.com>
To: bertrand@eiffel.com, risko@csl.sri.com, thomaso@andromedia.com
Subject: RE: robots.txt: ``Here is what I am not telling you.''
From: Bertrand.Meyer@eiffel.com
Reply-To: Bertrand.Meyer@eiffel.com

If every facility was always used exactly as it is meant
to, computer risks and the volume of mishap reports
on comp.risks would be much lower. But that's not the
way things are. In the robots.txt case, the possibility
of misuse is quite serious.

Best regards,

-- BM

From: Adrian Howard

Received: (from server@localhost)
	by csla.csl.sri.com (8.8.7/8.8.7) id BAA15887;
	Tue, 27 Jan 1998 01:17:59 -0800 (PST)
Date: Tue, 27 Jan 1998 01:17:59 -0800 (PST)
Message-Id: <199801270917.BAA15887@csla.csl.sri.com>
X-Authentication-Warning: csla.csl.sri.com: server set sender to owner-risks using -f
To: owner-risks@csl.sri.com
From: Adrian Howard 
Subject: robots.txt, potential mis-understanding

>From risks-owner  Tue Jan 27 01:17:56 1998
Received: from seagulls.victoriareal.com ([193.133.23.129])
	by csla.csl.sri.com (8.8.7/8.8.7) with ESMTP id BAA15882
	for ; Tue, 27 Jan 1998 01:17:51 -0800 (PST)
Received:  from [193.133.23.136] by seagulls.victoriareal.com (UUNET PIPEX simple 1.30)
	id JAA06007; Tue, 27 Jan 1998 09:21:18 GMT
X-Sender: adrianh@seagulls.victoriareal.co.uk
Message-Id: 
Mime-Version: 1.0
Content-Type: text/plain; charset="us-ascii"
Date: Tue, 27 Jan 1998 09:17:35 +0000
To: risks@csl.sri.com
From: Adrian Howard 
Subject: robots.txt, potential mis-understanding
Cc: bertrand@eiffel.com

I think the worries that Bertrand has about the robots.txt stem, in my
opinion, from a incorrect interpretation of its purpose.... which is the
real risk.

Bertrand comments apply to robots.txt used as a "privacy" tool --- marking
information that should not be accessed by Jo Public.

However the reason the robot exclusion protocol was created was to prevent
web bots wandering into "dangerous" areas which are useless for them to
index. Anyone who is using robots.txt as a form of security is downright
foolish.... and hasn't read the original spec document...

----
In 1993 and 1994 there have been occasions where robots have visited WWW
servers where they weren't welcome for various reasons. Sometimes these
reasons were robot specific, e.g. certain robots swamped servers with
rapid-fire requests, or retrieved the same files repeatedly. In other
situations robots traversed parts of WWW servers that weren't suitable,
e.g. very deep virtual trees, duplicated information, temporary
information, or cgi-scripts with side-effects (such as voting).

These incidents indicated the need for established mechanisms for WWW
servers to indicate to robots which parts of their server should not be
accessed.
----

The real risk? The web has grown so quickly it is, in many cases, being run
by people who do not understand the technologies they are using. (I'm
obviously don't mean Bertrand here, but the people whose robots.txt files
he discovered.)

Cheers,

Adrian

PS Bertrand, love Eiffel :-)

----
Adrian Howard. adrianh@victoriareal.co.uk. Head Techie. Victoria Real Ltd
URL: http://www.victoriareal.co.uk/ v. +44 1273 774469 f. +44 1273 779960




Message 3 -- *********************
Received: (from server@localhost)
	by csla.csl.sri.com (8.8.7/8.8.7) id BAA15901;
	Tue, 27 Jan 1998 01:19:48 -0800 (PST)
Date: Tue, 27 Jan 1998 01:19:48 -0800 (PST)
Message-Id: <199801270919.BAA15901@csla.csl.sri.com>
X-Authentication-Warning: csla.csl.sri.com: server set sender to owner-risks using -f
To: owner-risks@csl.sri.com
From: Tim Senior 
Subject: Re: robots.txt: ``Here is what I am not telling you.''

>From risks-owner  Tue Jan 27 01:19:45 1998
Received: from m325.phonelink.com (m325.PhoneLink.COM [193.195.138.3])
	by csla.csl.sri.com (8.8.7/8.8.7) with SMTP id BAA15896
	for ; Tue, 27 Jan 1998 01:19:43 -0800 (PST)
Received: from tim.dev.phonelink.com by mailgate.PhoneLink.COM id aa08299;
          27 Jan 98 9:22 GMT
Received: by tim.dev.phonelink.com with Microsoft Mail
	id <01BD2B05.3C4975C0@tim.dev.phonelink.com>; Tue, 27 Jan 1998 09:23:32 -0000
Message-ID: <01BD2B05.3C4975C0@tim.dev.phonelink.com>
From: Tim Senior 
To: "'risks@csl.sri.com'" 
Subject: Re: robots.txt: ``Here is what I am not telling you.''
Date: Tue, 27 Jan 1998 09:23:31 -0000
Encoding: 20 TEXT, 50 UUENCODE
X-MS-Attachment: WINMAIL.DAT 0 00-00-1980 00:00

> This is not merely a hypothetical possibility. Just for fun I looked up
> `robots.txt' for the Web sites of four or five well-known IT companies;
> although regrettably I didn't find out any major scoop, I could see quite
> clearly some of the topics those companies do not want others to know they
> are working on.
>
> The whole matter is very surprising, as the risk seems rather obvious and it
> is not hard to think of alternative techniques that would have avoided it.

The RISKs here seem to be of people mis-using, or mis-understanding the
system (and how many times have we heard that?)  There is nothing
fundamentally wrong with the 'robots.txt' idea, but of course it is no
substitute for proper security permissions.  In any case, it is trivially
simple to 'hide' your secret projects using the scheme - just put
them all in an excluded directory called 'Private'.  Of course, it
would be even better to disallow read access to the directory, as
one would presume that even web spiders that ignore robots.txt
would be unable to index pages that it cannot see.

Tim Senior, Phonelink PLC

From Tom Guptill

Received: (from server@localhost)
	by csla.csl.sri.com (8.8.7/8.8.7) id GAA17409;
	Tue, 27 Jan 1998 06:49:45 -0800 (PST)
Date: Tue, 27 Jan 1998 06:49:45 -0800 (PST)
Message-Id: <199801271449.GAA17409@csla.csl.sri.com>
X-Authentication-Warning: csla.csl.sri.com: server set sender to owner-risks using -f
To: owner-risks@csl.sri.com
From: Tom Guptill 
Subject: Re: robots.txt: ``Here is what I am not telling you.''

>From risks-owner  Tue Jan 27 06:49:42 1998
Received: from mailmac.pas.rochester.edu (mailmac.pas.rochester.edu [128.151.144.15])
	by csla.csl.sri.com (8.8.7/8.8.7) with ESMTP id GAA17404
	for ; Tue, 27 Jan 1998 06:49:41 -0800 (PST)
Received: from [128.151.145.205] by mailmac.pas.rochester.edu with ESMTP
 (Eudora Internet Mail Server 2.0); Tue, 27 Jan 1998 09:54:45 -0500
X-Sender: tgpt@mailmac.pas.rochester.edu
Message-Id: 
In-Reply-To: <199801262248.OAA00524@chiron.csl.sri.com>
Mime-Version: 1.0
Content-Type: text/plain; charset="us-ascii"
Date: Tue, 27 Jan 1998 10:01:25 -0500
To: risks@csl.sri.com
From: Tom Guptill 
Subject: Re: robots.txt: ``Here is what I am not telling you.''


As far as I know, the 'robots.txt' standard was never *intended* to shield
confidential data from prying (or indexing) eyes.  It was intended to do
two things:

- protect low-end or overburdened web servers from excess load from
web-crawling robots

- protect web-crawling robots from getting "lost" in a virtual HTML maze of
dynamically generated information that goes on forever, or from indexing
things like "/usr/dict/words" (the UNIX list of all words which are part of
a given language) which would result in meaningless false "hits" in search
engines

One site I know of has (to this day, I believe) a section of their site
excluded within the "robots.txt" file.  That section consists simply of a
hidden script which returns "dummy" pages which contain URLs pointing
"deeper" into that area, and which return *more* dummy pages and so on.
It's sort of a "trap" for robots which fail to follow the standard.

Of course this isn't saying that there aren't sites which *rely* on this
standard to protect the security of their site, but that's a risk of "using
the software to do something it wasn't intended to do and (AFAIK) doesn't
claim to do" rather than any inherent flaw in the standard itself.

- Tom

--
Tom Guptill
UNIX Systems Manager
Department of Physics and Astronomy
University of Rochester




From: David Sklar

Received: (from server@localhost)
	by csla.csl.sri.com (8.8.7/8.8.7) id HAA17480;
	Tue, 27 Jan 1998 07:04:34 -0800 (PST)
Date: Tue, 27 Jan 1998 07:04:34 -0800 (PST)
Message-Id: <199801271504.HAA17480@csla.csl.sri.com>
X-Authentication-Warning: csla.csl.sri.com: server set sender to owner-risks using -f
To: owner-risks@csl.sri.com
From: David Sklar 
Subject: Re: RISKS DIGEST 19.57 (robots.txt)

>From risks-owner  Tue Jan 27 07:04:32 1998
Received: from juan-epstein.student.net (root@juan-epstein.student.net [206.138.124.13])
	by csla.csl.sri.com (8.8.7/8.8.7) with ESMTP id HAA17475
	for ; Tue, 27 Jan 1998 07:04:31 -0800 (PST)
Received: from student.net (foo.bar.baz.davis.net [206.138.124.51])
	by juan-epstein.student.net (8.8.5/8.8.5) with ESMTP id KAA28424
	for ; Tue, 27 Jan 1998 10:08:43 -0500
Message-ID: <34CDF88F.DC9B97AE@student.net>
Date: Tue, 27 Jan 1998 10:09:03 -0500
From: David Sklar 
Organization: Student.Net Technical Wormhole
X-Mailer: Mozilla 4.03 [en] (WinNT; I)
MIME-Version: 1.0
To: risks@csl.sri.com
Subject: Re: RISKS DIGEST 19.57 (robots.txt)
References: <199801262248.OAA00524@chiron.csl.sri.com>
Content-Type: text/plain; charset=us-ascii
Content-Transfer-Encoding: 7bit

This is a misleading characterization of robots.txt and the robot
exclusion
standard. As per the spec at
	
the point of the robots.txt is not to mark particular areas of content
as
"secret" but as "inappropriate to index" or "if accessed rapid-fire by a
program, will harm my server."

The real risk is that people misunderstand and consequently misuse
robots.txt to incorrectly protect documents as "secret".

-dave


> ------------------------------
>
> Date: Sun, 25 Jan 98 15:04:03 PST
> From: bertrand@eiffel.com (Bertrand Meyer)
> Subject: robots.txt: ``Here is what I am not telling you.''
>
> Imagine the head of a large corporation who tells a TV interviewer: "Here is
> the list of topics that I don't want anyone to know we are discussing with
> potential business partners: ...". Or, for that matter, a President who
> tells us "these are the young women I would hate you to know I had dealings
> with in the past few years: ...". Pretty dumb, wouldn't it be?
>
> Now look at the "Robot Exclusion Standard" (I think that's how it's called)
> for Web sites. The need is clear: you may want to exclude some of the pages
> on your Web site from consideration by the indexing "robots" -- Yahoo,
> AltaVista and the like.  The solution is, how should I say, interesting: you
> put at the top level of your site a file conventionally called `robots.txt',
> which lists the directories that should not be indexed; well-behaved robots
> will check it, and dutifully oblige.
>
> Now whoever thought up that scheme must have been very smart, but the
> smartness somehow eludes me. The file must obviously be world-readable, so
> anyone can go to the top level of a site and look up `robots.txt' with a
> plain browser. This is a good way to find out what the site owner doesn't
> want you to know about. You don't see what's in the secret directories, of
> course (well, assuming the Webmaster has done his job and made them
> non-world-readable) [*], but you see what the secret directories are, and
> just that can be quite valuable information.
>
> The `robots.txt' scheme as it exists is acceptable if you simply want to
> avoid having some of your Web information indexed by the search engines, for
> example because it is in draft form or of time-limited value. But it is not
> appropriate if your goal is to put on your Web site some secret information
> that is only meant for some trusted partners. Yet there is a serious
> possibility that unsuspecting companies will misuse the scheme for the
> second of these applications.
>
> This is not merely a hypothetical possibility. Just for fun I looked up
> `robots.txt' for the Web sites of four or five well-known IT companies;
> although regrettably I didn't find out any major scoop, I could see quite
> clearly some of the topics those companies do not want others to know they
> are working on.
>
> The whole matter is very surprising, as the risk seems rather obvious and it
> is not hard to think of alternative techniques that would have avoided it.
>
> Bertrand Meyer, Interactive Software Engineering Inc., makers of ISE Eiffel
> , http://eiffel.com
>
>   [* ... at least not without exploiting various security flaws.  PGN]
>


From: Wayne Mesard

Received: (from server@localhost)
	by csla.csl.sri.com (8.8.7/8.8.7) id MAA19819;
	Tue, 27 Jan 1998 12:29:54 -0800 (PST)
Date: Tue, 27 Jan 1998 12:29:54 -0800 (PST)
Message-Id: <199801272029.MAA19819@csla.csl.sri.com>
X-Authentication-Warning: csla.csl.sri.com: server set sender to owner-risks using -f
To: owner-risks@csl.sri.com
From: Wayne Mesard 
Subject: robots.txt [19.57]

>From risks-owner  Tue Jan 27 12:29:51 1998
Received: from tofu.ironbridgenetworks.com ([146.115.140.72])
	by csla.csl.sri.com (8.8.7/8.8.7) with ESMTP id MAA19814
	for ; Tue, 27 Jan 1998 12:29:49 -0800 (PST)
Received: (from wmesard@localhost)
	by tofu.ironbridgenetworks.com (8.8.7/8.8.7) id KAA06125;
	Tue, 27 Jan 1998 10:34:10 -0500
Date: Tue, 27 Jan 1998 10:34:10 -0500
Message-Id: <199801271534.KAA06125@tofu.ironbridgenetworks.com>
MIME-Version: 1.0
Content-Type: text/plain; charset=us-ascii
Content-Transfer-Encoding: 7bit
From: Wayne Mesard 
To: risks@csl.sri.com
Subject: robots.txt [19.57]
X-Mailer: VM 6.38 under Emacs 20.2.1

Bertrand Meyer :
> The `robots.txt' scheme...is not appropriate if your goal is to put on
> your Web site some secret information that is only meant for some
> trusted partners.

Nor was it intended for that purpose.

The only RISK here is the old one that security by obscurity can be
worse than no security at all.  Restricted information must be protected
by a restricted access mechanism.

Wayne();


Message 9 -- *********************
Received: (from server@localhost)
	by csla.csl.sri.com (8.8.7/8.8.7) id OAA21011;
	Tue, 27 Jan 1998 14:55:53 -0800 (PST)
Date: Tue, 27 Jan 1998 14:55:53 -0800 (PST)
Message-Id: <199801272255.OAA21011@csla.csl.sri.com>
X-Authentication-Warning: csla.csl.sri.com: server set sender to owner-risks using -f
To: owner-risks@csl.sri.com
From: Klaus Johannes Rusch        
Subject: Re: robots.txt: ``Here is what I am not telling you.''

>From risks-owner  Tue Jan 27 14:55:47 1998
Received: from fbma.tuwien.ac.at (fbma.tuwien.ac.at [193.170.75.14])
	by csla.csl.sri.com (8.8.7/8.8.7) with SMTP id OAA21006
	for ; Tue, 27 Jan 1998 14:55:45 -0800 (PST)
Received: by fbma.tuwien.ac.at (AIX 4.1/UCB 5.64/4.03)
          id AA32290; Tue, 27 Jan 1998 23:59:56 +0100
Message-Id: <9801272259.AA32290@fbma.tuwien.ac.at>
Date: Tue, 27 Jan 1998 23:09:25 CET
From: Klaus Johannes Rusch        
Reply-To: Klaus Johannes Rusch        
X-Url:     http://www.atmedia.net/KlausRusch/
To: risks@csl.sri.com
Cc: bertrand@eiffel.com (Bertrand Meyer)
Subject: Re: robots.txt: ``Here is what I am not telling you.''
In-Reply-To: <199801262248.OAA00524@chiron.csl.sri.com>
References: <199801262248.OAA00524@chiron.csl.sri.com>
Content-Type: text/plain; charset="iso-8859-1"
Content-Transfer-Encoding: 8BIT
Priority: normal
X-Mailer: OS/2 Warp/4.0 LaMail/2.3


The Robots Exclusion Standard was created to keep robots from indexing content
areas which are likely to cause problems with robots, such as dynamically
generated content, include files, as well as content which is kept for archival
purposes only and should not show up in search results.

After all, robots.txt doesn't restrict, and was never intended to restrict,
access to documents but only keep them from being indexed. (There is, however,
the risk of webmasters assuming robots.txt files keeps visitors off certain
areas.)

Klaus Johannes Rusch
--
KlausRusch@atmedia.net
http://www.atmedia.net/KlausRusch/


From: Richard Cook

Received: (from server@localhost)
	by csla.csl.sri.com (8.8.7/8.8.7) id QAA21550;
	Tue, 27 Jan 1998 16:09:32 -0800 (PST)
Date: Tue, 27 Jan 1998 16:09:32 -0800 (PST)
Message-Id: <199801280009.QAA21550@csla.csl.sri.com>
X-Authentication-Warning: csla.csl.sri.com: server set sender to owner-risks using -f
To: owner-risks@csl.sri.com
From: Richard Cook 
Subject: Re: robots.txt:`Here is what I am not telling you.' (RISKS 19.57)

>From risks-owner  Tue Jan 27 16:09:29 1998
Received: from haven.uchicago.edu (root@haven.uchicago.edu [128.135.12.3])
	by csla.csl.sri.com (8.8.7/8.8.7) with ESMTP id QAA21545
	for ; Tue, 27 Jan 1998 16:09:27 -0800 (PST)
Received: from midway.uchicago.edu (root@midway.uchicago.edu [128.135.12.12])
	by haven.uchicago.edu (8.8.5/8.8.5) with ESMTP id SAA04155;
	Tue, 27 Jan 1998 18:13:36 -0600 (CST)
Received: from harper.uchicago.edu (root@harper.uchicago.edu [128.135.12.7]) by midway.uchicago.edu (8.8.5/8.8.3) with ESMTP id SAA17030; Tue, 27 Jan 1998 18:10:23 -0600 (CST)
Received: from [128.135.80.16] (dacc-77.uchicago.edu [128.135.80.16]) by harper.uchicago.edu (8.8.5/8.8.3) with ESMTP id SAA15834; Tue, 27 Jan 1998 18:10:21 -0600 (CST)
X-Sender: ricook@acs-popmail.uchicago.edu
Message-Id: 
Mime-Version: 1.0
Content-Type: text/plain; charset="us-ascii"
Date: Tue, 27 Jan 1998 18:06:55 -0600
To: risks@csl.sri.com
From: Richard Cook 
Subject: Re: robots.txt:`Here is what I am not telling you.' (RISKS 19.57)
Cc: dfaber@dacc.bsd.uchicago.edu

   My colleague and our Departmental computer guru, David Faber, laughed
when I passed along the item about robot.txt and security. He maintains our
Department website at http://dacc.uchicago.edu and writes:

>The robots exclusion standards have nothing to do with security; they are
>merely an efficiency enhancer. I tell robots to ignore things like
>demo pages and instructions which were included with software packages,
>things which just aren't useful to anyone except me but which I leave on
>the server. I also provide keywords and page descriptions that I want robots
>to include. The robots standard is meant as a courtesy, in hopes of reducing
>wasted bandwidth and search times for people. It helps make it possible to
>craft an efficient and useful site.
>
>Security is another matter entirely.

   I think his point is a good. It seems that the real risk of this feature
is the same one attached to so much of the software and hardware now
available, namely that its inherent complexity makes it hard to know what
it is doing. But this is much more likely to be the case when software has
the characteristics of technology centered automation, i.e. that it is
silent, powerful, and hard to direct. It is not that the software is
failing, in some narrow engineering sense, but rather that it is failing in
the larger sense of not making it possible to work effectively because its
functionality is hidden or obscure. It is this obscurity that makes it
likely to be misused or misunderstood.
   Thus the presence of a feature specifically designed to make smooth,
efficient operation possible has no inherent deficiency but rather
characteristics that, in combination with a variety of other features of
the systems, leads to a situation possessing the potential for misuse. To
use the example described in the orginal RISKS posting, the fact that the
robot.txt area points to areas not intended for web viewing is only a
significant feature in combination with other characteristics. These
include the ability to view directory structures remotely, even though
specific files within them are not available to be read. But it also
includes the consequences of our common practice of naming files in ways
that reflect their contents. These largely arbitrary features of systems
are what makes robot.txt a potentially problematic feature. We can easily
imagine a system in which the names of files were arbitrary (a quick look
in your web viewer's cache directory should show plenty of examples) but we
choose not to and this, coupled with the largely arbitrary rules about
reading, opening, listing directories, etc. produces the situation that
prompted the original message.
  The risk of robot.txt is, it seems to me, an emergent characteristic of a
whole bunch of factors, some narrowly technical (like trying to make web
service faster by telling robots about wasteful places to index), some
quite operational (like keeping sensitive and public data on the same
physical systems), and some almost intrinsically human (like the way we
assign names to things). It seems to me that the real RISK is that we will
not draw this larger conclusion from the experience with robot.txt.

Richard I. Cook, MD                              | tel: 1+773-702-5306 |
Cognitive Technologies Laboratory                | fax: 1+773-702-4791 |
Department of Anesthesia and Critical Care       -----------------------
University of Chicago; 5841 S. Maryland Ave., MC 4028; Chicago, IL 60637




From: Marc Horowitz

Received: (from server@localhost)
	by csla.csl.sri.com (8.8.7/8.8.7) id TAA23082;
	Tue, 27 Jan 1998 19:53:02 -0800 (PST)
Date: Tue, 27 Jan 1998 19:53:02 -0800 (PST)
Message-Id: <199801280353.TAA23082@csla.csl.sri.com>
X-Authentication-Warning: csla.csl.sri.com: server set sender to owner-risks using -f
To: owner-risks@csl.sri.com
From: Marc Horowitz 
Subject: Re: robots.txt: ``Here is what I am not telling you.''

>From risks-owner  Tue Jan 27 19:52:59 1998
Received: from rover.cygnus.com (rover.cygnus.com [192.80.44.65])
	by csla.csl.sri.com (8.8.7/8.8.7) with ESMTP id TAA23077
	for ; Tue, 27 Jan 1998 19:52:58 -0800 (PST)
Received: (from marc@localhost) by rover.cygnus.com (8.8.8/8.6.12) id WAA06720; Tue, 27 Jan 1998 22:57:16 -0500 (EST)
To: risks@csl.sri.com
cc: bertrand@eiffel.com
Subject: Re: robots.txt: ``Here is what I am not telling you.''
From: Marc Horowitz 
Date: 27 Jan 1998 22:57:16 -0500
In-Reply-To: bertrand@eiffel.com's message of Sun, 25 Jan 98 15:04:03 PST
Message-ID: 
Lines: 10
X-Mailer: Gnus v5.3/Emacs 19.34

>> The whole matter is very surprising, as the risk seems rather
>> obvious and it is not hard to think of alternative techniques that
>> would have avoided it.

I can think of an alternative technique: Don't put sensitive
information where arbitrary anonymous people (or programs) can see it.
Passwords and firewalls aren't very good security tools (crypto is
better), but they sure beat leaving stuff out in the open.

		Marc


From dsouth@darwin.helios.nd.edu Wed Jan 28 20:07:10 1998
Subject: Re: robots.txt
To: risks@CSL.sri.com
Date: Wed, 28 Jan 1998 23:05:39 -0500 (EST)
Cc: bertrand@eiffel.com
X-Mailer: ELM [version 2.4 PL25]
Content-Type: text/plain; charset=US-ASCII
Content-Transfer-Encoding: 7bit
Content-Length: 3109
Status: RO
X-Lines: 65


bertrand@eiffel.com wrote in RISKS DIGEST 19.57:

>Now look at the "Robot Exclusion Standard"

For those who have not try this URL:
  http://info.webcrawler.com/mak/projects/robots/norobots.html

> Now whoever thought up that scheme must have been very smart, but the
> smartness somehow eludes me. The file must obviously be world-readable, so
> anyone can go to the top level of a site and look up `robots.txt' with a
> plain browser. This is a good way to find out what the site owner doesn't
> want you to know about.

Ignoring the ad hominem attack on those who though up the standard for
the moment, the author seems to be confusing two issues:

 1)  Security through obscurity
 2)  Preventing unwelcome _automated_ retrieval.

The first issue should be familiar to RISKS readers as something that
doesn't work, isn't desirable, and should be avoided.  If the voluntary
Standard for Robot Exclusion was designed for this purpose, or used
for this purpose, I would agree that it was worthless junk.  But that
is not why the standard exists.

To quote the aforementioned web page:

  ``there have been occasions where robots have visited WWW servers
  where they weren't welcome for various reasons. Sometimes these
  reasons were robot specific, e.g. certain robots swamped servers
  with rapid-fire requests, or retrieved the same files repeatedly.
  In other situations robots traversed parts of WWW servers that weren't
  suitable, e.g. very deep virtual trees, duplicated information,
  temporary information, or cgi-scripts with side-effects (such as
  voting).''

In short, robots.txt is a way to mark pages ``people only''.

I use robots.txt to keep robots from indexing the ``dynaweb''
(dynamically translated SGML) manuals provided by vendors and
available on several of my servers.  If I had excluded robots by
restricting off-campus access access to the dynaweb pages, it would
also have prevented humans from accesses them.  Since many of the
locally generated pages have links to the dynaweb manuals, those
pages would be of less utility to off-campus visitors.

Since many sites (including the vendor's) serve the same dynaweb
manuals, I doubt that the web is mourning the loss of my copies from
the search engines.  Because the locally-created pages are still indexed
by robots, visitors can still find and access the content that
is unique to my servers.  Keeping robots out of ``dynaweb'' spares me
the performance hit incurred when search engines were retrieving 1000's
of dynamically created pages every week.

By creating a standard to exclude only the robots, they robot authors
have made it easier to preserve universal information access for the
humans.  It fixes the RISK of creating ``people only'' content in a
medium (HTTP) that's accessed by both humans and machines.


/* Dale Southard Jr.   [http://www.nd.edu/~dsouth]   AFF/I  SL/I  T/I */
/* Science Computing Associate,  [pgp on www page]     S&TA  D-11216  */
/* University of Notre Dame, 202B NSH            Sr. Rigger  NCB#194  */
/* southard.1@nd.edu   219/631-7326         "I'd rather be skydiving" */

From harald.justen@eifel.com Thu Jan 29 06:43:30 1998

Date: Thu, 29 Jan 1998 15:40:34 -0800
Reply-To: harald.justen@eifel.com
X-Mailer: Mozilla 3.01Gold [de] (WinNT; I)
To: bertrand@eiffel.com
Subject: [Fwd: robots.txt]
Content-Type: message/rfc822
Content-Transfer-Encoding: 7bit
Content-Disposition: inline
Content-Length: 942
Status: RO
X-Lines: 25

Received: from maxwel.ph.kcl.ac.uk
	by mail.rapidsite.net (8.8.5/8.8.5) with SMTP id LAA20542
	for ; Tue, 27 Jan 1998 11:44:23 -0500 (EST)
Received: by maxwell.ph.kcl.ac.uk (MX V4.1 AXP) id 20; Tue, 27 Jan 1998
          16:43:31 +0000
Date: Tue, 27 Jan 1998 16:43:30 +0000
From: Nigel Arnot 
To: BERTRAND@EIFEL.COM
Message-ID: <009C0ECB.2943DF04.20@maxwell.ph.kcl.ac.uk>
Subject: robots.txt


You misunderstand it. The purpose isn't security. If you don't want it
accessed from outside it, you protect it.

the purpose is to keep robots out of things like dynamically created
pages. It's quite possible that if they don't, they'll basically start
trying to catalogue infinity -- your server will be busy for ever
generating new pages, and the robot will be busy for ever trying to
analyze or remember them.

This is also why robot writers ususlly honour robots.txt

		Yours,    Nigel Arnot.


From risko@chiron.csl.sri.com Thu Jan 29 13:40:31 1998

Date: Thu, 29 Jan 98 13:39:07 PST
To: bertrand@eiffel.com (Bertrand Meyer)
Subject: More robots
Content-Length: 10723
Status: RO
X-Lines: 232


29-Jan-98  4:01:32-GMT,4232;000000000000
Received: (from server@localhost)
	by csla.csl.sri.com (8.8.7/8.8.7) id UAA01678;
	Wed, 28 Jan 1998 20:01:31 -0800 (PST)
Date: Wed, 28 Jan 1998 20:01:31 -0800 (PST)
Message-Id: <199801290401.UAA01678@csla.csl.sri.com>
X-Authentication-Warning: csla.csl.sri.com: server set sender to owner-risks using -f
To: owner-risks@csl.sri.com
From: dale southard 
Subject: Re: robots.txt

>From risks-owner  Wed Jan 28 20:01:28 1998
Received: from darwin.helios.nd.edu (darwin.helios.nd.edu [129.74.250.114])
	by csla.csl.sri.com (8.8.7/8.8.7) with ESMTP id UAA01673
	for ; Wed, 28 Jan 1998 20:01:23 -0800 (PST)
Received: (from dsouth@localhost)
	by darwin.helios.nd.edu (8.8.8/8.8.8) id XAA24624;
	Wed, 28 Jan 1998 23:05:39 -0500 (EST)
From: dale southard 
Message-Id: <199801290405.XAA24624@darwin.helios.nd.edu>
Subject: Re: robots.txt
To: risks@csl.sri.com
Date: Wed, 28 Jan 1998 23:05:39 -0500 (EST)
Cc: bertrand@eiffel.com
X-Mailer: ELM [version 2.4 PL25]
MIME-Version: 1.0
Content-Type: text/plain; charset=US-ASCII
Content-Transfer-Encoding: 7bit


bertrand@eiffel.com wrote in RISKS DIGEST 19.57:

>Now look at the "Robot Exclusion Standard"

For those who have not try this URL:
  http://info.webcrawler.com/mak/projects/robots/norobots.html

> Now whoever thought up that scheme must have been very smart, but the
> smartness somehow eludes me. The file must obviously be world-readable, so
> anyone can go to the top level of a site and look up `robots.txt' with a
> plain browser. This is a good way to find out what the site owner doesn't
> want you to know about.

Ignoring the ad hominem attack on those who though up the standard for
the moment, the author seems to be confusing two issues:

 1)  Security through obscurity
 2)  Preventing unwelcome _automated_ retrieval.

The first issue should be familiar to RISKS readers as something that
doesn't work, isn't desirable, and should be avoided.  If the voluntary
Standard for Robot Exclusion was designed for this purpose, or used
for this purpose, I would agree that it was worthless junk.  But that
is not why the standard exists.

To quote the aforementioned web page:

  ``there have been occasions where robots have visited WWW servers
  where they weren't welcome for various reasons. Sometimes these
  reasons were robot specific, e.g. certain robots swamped servers
  with rapid-fire requests, or retrieved the same files repeatedly.
  In other situations robots traversed parts of WWW servers that weren't
  suitable, e.g. very deep virtual trees, duplicated information,
  temporary information, or cgi-scripts with side-effects (such as
  voting).''

In short, robots.txt is a way to mark pages ``people only''.

I use robots.txt to keep robots from indexing the ``dynaweb''
(dynamically translated SGML) manuals provided by vendors and
available on several of my servers.  If I had excluded robots by
restricting off-campus access access to the dynaweb pages, it would
also have prevented humans from accesses them.  Since many of the
locally generated pages have links to the dynaweb manuals, those
pages would be of less utility to off-campus visitors.

Since many sites (including the vendor's) serve the same dynaweb
manuals, I doubt that the web is mourning the loss of my copies from
the search engines.  Because the locally-created pages are still indexed
by robots, visitors can still find and access the content that
is unique to my servers.  Keeping robots out of ``dynaweb'' spares me
the performance hit incurred when search engines were retrieving 1000's
of dynamically created pages every week.

By creating a standard to exclude only the robots, they robot authors
have made it easier to preserve universal information access for the
humans.  It fixes the RISK of creating ``people only'' content in a
medium (HTTP) that's accessed by both humans and machines.


/* Dale Southard Jr.   [http://www.nd.edu/~dsouth]   AFF/I  SL/I  T/I */
/* Science Computing Associate,  [pgp on www page]     S&TA  D-11216  */
/* University of Notre Dame, 202B NSH            Sr. Rigger  NCB#194  */
/* southard.1@nd.edu   219/631-7326         "I'd rather be skydiving" */

29-Jan-98 15:50:35-GMT,2816;000000000000
Received: (from server@localhost)
	by csla.csl.sri.com (8.8.7/8.8.7) id HAA04501;
	Thu, 29 Jan 1998 07:50:34 -0800 (PST)
Date: Thu, 29 Jan 1998 07:50:34 -0800 (PST)
Message-Id: <199801291550.HAA04501@csla.csl.sri.com>
X-Authentication-Warning: csla.csl.sri.com: server set sender to owner-risks using -f
To: owner-risks@csl.sri.com
From: Simon Wilkinson 
Subject: Re: robots.txt: ``Here is what I am not telling you.''

>From risks-owner  Thu Jan 29 07:50:31 1998
Received: from tardis.ed.ac.uk (brigadier.tardis.ed.ac.uk [193.62.81.14])
	by csla.csl.sri.com (8.8.7/8.8.7) with ESMTP id HAA04467
	for ; Thu, 29 Jan 1998 07:44:03 -0800 (PST)
Received: from tardis.tardis.ed.ac.uk (sxw@tardis.tardis.ed.ac.uk [193.62.81.1])
	by tardis.ed.ac.uk (8.8.7/8.8.7/TardisMailhub) with ESMTP id OAA29354
	for ; Thu, 29 Jan 1998 14:58:42 GMT
Received: (from sxw@localhost)
	by tardis.tardis.ed.ac.uk (8.8.7/8.8.7/TardisClientv2) id OAA17491
	for risks@CSL.sri.com; Thu, 29 Jan 1998 14:58:41 GMT
Date: Thu, 29 Jan 1998 14:58:41 GMT
Message-Id: <199801291458.OAA17491@tardis.tardis.ed.ac.uk>
From: Simon Wilkinson 
Subject: Re: robots.txt: ``Here is what I am not telling you.''
To: risks@csl.sri.com

I fear that Bertrand Meyer is mistaking the purpose of the robots.txt
file.  It is not there to restrict access to files on a site, or to
protect secret or sensitive data. Other methods exist within the HTTP
protocol to fulfill this function. Instead, the robots.txt file's sole
purpose is to prevent web "robots" (that is automated agents) from
fetching those pages from your server. The directories that should not
be indexed are unlikely to be those containing sensitive data, more
likely ones which contain data which changes rapidly or that is
automatically generated by CGIs. So, the robots.txt file is less
"Here is what I am not telling you" and more "Please keep off the
grass".

Yes, anyone can go to the top level of a site and read the robots.txt
file.  However - there is no point in listing in it directories which
are not world-readable, as the robot would be unable to fetch them
anyway. The robots.txt standard was never designed to provide security
for data, just to act as a suggestion to robots of those files which
they should not fetch.

I would suggest that the risk is more with web server admins using
inappropriate technology to limit the visibilty of sensitive documents
- which are in any case world readable, than with any design flaw in
the robots.txt file.

I would be interested to hear of "alternative technologies that would
have avoided [the risk]" which require no reprogramming of web servers
or alteration of protocols, and which can be simply deployed on all
web servers currently active.

Simon Wilkinson



29-Jan-98 17:24:26-GMT,3551;000000000000
Received: (from server@localhost)
	by csla.csl.sri.com (8.8.7/8.8.7) id JAA05209;
	Thu, 29 Jan 1998 09:24:25 -0800 (PST)
Date: Thu, 29 Jan 1998 09:24:25 -0800 (PST)
Message-Id: <199801291724.JAA05209@csla.csl.sri.com>
X-Authentication-Warning: csla.csl.sri.com: server set sender to owner-risks using -f
To: owner-risks@csl.sri.com
From: Joshua Cope 
Subject: Re: robots.txt (Digest 19.57)

>From risks-owner  Thu Jan 29 09:24:23 1998
Received: from mail11.digital.com (mail11.digital.com [192.208.46.10])
	by csla.csl.sri.com (8.8.7/8.8.7) with ESMTP id JAA05204
	for ; Thu, 29 Jan 1998 09:24:20 -0800 (PST)
Received: from DELPHI (delphi.zko.dec.com [16.32.0.6])
	by mail11.digital.com (8.8.8/8.8.8/WV1.0c) with SMTP id MAA11685
	for ; Thu, 29 Jan 1998 12:29:01 -0500 (EST)
Received: by delphi.zko.dec.com (UCX V4.2-21, OpenVMS V7.1 VAX);
	Thu, 29 Jan 1998 12:26:00 -0500
Message-ID: <34D0BC30.3858@star.enet.dec.com>
Date: Thu, 29 Jan 1998 12:28:16 -0500
From: Joshua Cope 
Reply-To: cope@star.enet.dec.com
Organization: Digital Equipment Corporation
X-Mailer: Mozilla 3.02 (WinNT; U)
MIME-Version: 1.0
To: risks@csl.sri.com
Subject: Re: robots.txt (Digest 19.57)
References: 
Content-Type: text/plain; charset=us-ascii
Content-Transfer-Encoding: 7bit

> Now look at the "Robot Exclusion Standard" (I think that's how it's called)
> for Web sites. The need is clear: you may want to exclude some of the pages
> on your Web site from consideration by the indexing "robots" -- Yahoo,
> AltaVista and the like.  The solution is, how should I say, interesting: you
> put at the top level of your site a file conventionally called `robots.txt',
> which lists the directories that should not be indexed; well-behaved robots
> will check it, and dutifully oblige.

The Robot Exclusion Protocol is not the solution to the problem Mr.
Meyer
mentioned - that of posting data which the author wishes to keep secret
by obscurity, without actually applying access controls. The solution to
that problem is to *not* make any links to the page, and to notify the
other parties of its location via e-mail or some other method. (This
assumes, of course, that the other party knows not to make any links to
it
from publicly-visible pages as well!) If there are no links to the page,
the robots will not find it in the first place - no ROBOTS.TXT needed.

ROBOTS.TXT is useful as a way to keep robots from indexing pages which
change quickly over time ("Today's news"), pages with dynamic content or
which take up a lot of resources (/cgi-bin/*), or have recursive paths
("C:\mydir\..\mydir\..\mydir\..\mydir\..\mydir\myfile.txt"). But as the
author notes, it's a lousy security device.

As a sidenote, the RISK mentioned is acknowledged in the robots.txt RFC;
see http://info.webcrawler.com/mak/projects/robots/norobots-rfc.html
for a copy. Part 6 states (note the last sentence):

|   Web site administrators must realise this method is voluntary, and
|   is not sufficient to guarantee some robots will not visit restricted
|   parts of the URL space. Failure to use proper authentication or
other
|   restriction may result in exposure of restricted information. It
even
|   possible that the occurence of paths in the /robots.txt file may
|   expose the existence of resources not otherwise linked to on the
|   site, which may aid people guessing for URLs.

	Joshua Cope
	(not speaking for)
	Digital Equipment Corporation


From jbash@cisco.com Fri Jan 30 11:29:06 1998

To: bertrand@eiffel.com
Subject: Re: RISKS DIGEST 19.58
Date: Fri, 30 Jan 1998 11:26:33 -0800
Content-Length: 4630
Status: RO
X-Lines: 112

Your conclusion may or may not be right, but your examples don't support it.

> 2. For anyone who thinks this is just a hypothetical possibility, here is
> the robots.txt file of the site of a major communications company:
>
>  robots.txt
>
> 	User-agent: *
> 	Disallow: /bug-navigator # Bug Data
> 	Disallow: /warp/customer # Registered Users
> 	Disallow: /kobayashi # Navigation for registered
> 	Disallow: /cgi-bin # no programs
> 	Disallow: /pcgi-bin # no programs
> 	Disallow: /univ-src/ccden # will get content through /univercd
> 	Disallow: /cpropub/univercd # obsolete
>
> The first two lines at least suggest to me that this is stuff that the
> company doesn't want publicized -- for security reasons, not because it is
> of temporary value.  Were I a "hacker" in the bad sense of the term, I would
> revel in such information, as it would direct my efforts to the really juicy
> bits.

Well, good luck. The customers-only information is password-protected, and
always has been. Of course, most of it is information that we give out to
all our thousands of customers, press on thousands of CD-ROMs, and print in
thousands of paper manuals, anyway... our goal isn't to keep it secret, but
to reduce the overall load presented to our system by people who haven't
paid for support.

Actually, /bug-navigator doesn't exist any more, so you've caught us in a
minor file maintenance error. The bug navigator has moved to a new
location, down under /warp/customer somewhere. I imagine it was probably
initially listed because it's a big, weird database, and I wouldn't be
surprised if it contains an infinite virtual URL space.

/warp/customer really doesn't need to be in the list, either, since it's
all password protected. It may be that some of the navigational structure
used to be unprotected... which would mean we'd want to keep it from
being indexed, not for security reasons, but to keep people from
finding themselves on pages whose links all led to things they couldn't
read.

I've never quite figured out what /kobayashi does, but I think it's
similar... a navigational system that may point to things people
can't read.

> Here is an extract from another page -- I'll let you guess the URL:
>
> 	# o Created this file to prevent indexing of one
> 	#   SME directory.
>
> 	User-agent: *
>
> 	Disallow: /sparc/SPARCengineUltraAX/oem/
> 	Disallow: /microelectronics/SPARCengineUltraAX/oem/
> 	Disallow: /javachip/SPARCengineUltraAX/oem/
> 	Disallow: /javachips/SPARCengineUltraAX/oem/

These all return "not found"...

> 	Disallow: /sparc/SPARCengineUltraAX/download/
> 	Disallow: /microelectronics/SPARCengineUltraAX/download/
> 	Disallow: /javachip/SPARCengineUltraAX/download/
> 	Disallow: /javachips/SPARCengineUltraAX/download/

These kick me off.

Looks like another case of somebody doing "suspenders and belt", or
trying to keep a search engine from doing something stupid, like
diving infinitely deep into the "You got here by mistake" pages
that the latter set of URLs return.

> I can't say for sure, but doesn't some of this look a tad like
> proprietary information?

Yep. That's probably why their server won't give it to me.

> 4. Of course designers cannot always be blamed for misuses of their
> mechanisms. But they should minimize the possibility of misuses. In the
> robots.txt case it seems to me rather wrong to have a conspicuous
> world-readable file that draws attention to *excluded* information.

It's not perfect, but it works. Certainly I'd prefer something that let
me tell search engines exactly how aggressive they should be about
indexing various parts of my URL space.

Remember, however, that there's no need to list something in "robots.txt"
unless you've *already* called attention to it by having a link to it
somewhere on the Web. Search engines don't find things by magic; they find
them by following links.

Certainly I'd prefer a standard that at least let me

> I think that a more
> effective convention would have been to include a special marker (META tag?)
> in HTML files that shouldn't be indexed, and a special file (exclude.txt?)
> in the directories that should not be explored at all.

The risk there is that putting META tags in thousands of files is
an administrative nightmare, and wouldn't happen.

How about an "include.txt" file that, if it exists, explicitly lists
the things to be *included*? Of course, it's way too late to change
to that now.

> Then you would only
> be able to find that information if you already knew where to look.

That's already the case... nothing will index files unless there's a link
to them somewhere.

					-- John B.

From gfischer@hub.org Fri Jan 30 11:31:03 1998

Date: Fri, 30 Jan 1998 14:29:29 -0500 (EST)
To: Bertrand.Meyer@eiffel.com
Subject: RISKS DIGEST 19.58 (fwd)
Content-Type: TEXT/PLAIN; charset=US-ASCII
Content-Transfer-Encoding: QUOTED-PRINTABLE
Content-Length: 30684
Status: RO
X-Lines: 758

Robots.txt is by definition used on publically available data.

--=20

Grant Fischer                        (E-mail: gfischer at hub.org)


From dmckeon@swcp.com Fri Jan 30 10:26:46 1998

X-Copyright: Copyright 1998 by Denis McKeon
References:  
X-Original-Newsgroups: comp.risks
Date: Fri, 30 Jan 1998 11:25:15 -0700
X-Mailer: Mail User's Shell (7.2.5 10/14/92)
To: bertrand@eiffel.com
Subject: Re: robots.txt (Meyer, RISKS-19.57)
Content-Length: 1359
Status: RO
X-Lines: 41

In  (Bertrand Meyer) wrote:
>RISKS-LIST: Risks-Forum Digest  Friday 30 January 1998  Volume 19 : Issue 58
...
>Date: Fri, 30 Jan 98 00:23:26 PST
>From: bertrand@eiffel.com (Bertrand Meyer)
>Subject: Re: robots.txt (Meyer, RISKS-19.57)
...
>I think that a more
>effective convention would have been to include a special marker (META tag?)
>in HTML files that shouldn't be indexed, and a special file (exclude.txt?)

I believe that this is widely used, but I don't know if it has been
incorporated into any HTTP standard:

    

Password protection or restriction of access to certain ranges of
IP addresses by the server with a .htaccess file or similar is another
option, and I would expect companies to routinely favor such protection
or better over use of /robots.txt or NOINDEX.

The problem seems to parallel the patent/trade secret situation -
how can one make information available (to all, to a few?) and
still expect to protect their interests in it?

If you are interested, as of 12/96 there was a Web robots mailing list at:

    To: robots-request@webcrawler.com

	subscribe

run by Martijn Koster,

    Email: m.koster@webcrawler.com
    WWW: http://info.webcrawler.com/mak/mak.html

which had frequent discussions of /robots.txt issues.


--
Denis McKeon

From scottm@rain.kcls.org Fri Jan 30 14:33:26 1998

Date: Fri, 30 Jan 1998 14:31:54 -0800 (PST)
To: bertrand@eiffel.com
Subject: robots.txt
X-No-Archive: yes
Content-Type: TEXT/PLAIN; charset=US-ASCII
Content-Length: 709
X-Lines: 18
Status: RO

I feel compelled to point out that in your example robots.txt files, they
all seem to represent data that changes often (preventing outdated info
from being indexed), is password protected (why index a page giving login
instructions?), or just wouldn't be useful to go directly to (like cgi's).
These are both heavily hit sites, and the 404's alone would be reason
enough to generate robots.txt files.

Anyone who is really concerned about keeping something private is going to
password protect it.

However, pointing this out is not a Bad Thing(tm).

Just my 2 cents, though I'm sure you've got at least a buck or two.  ;)

- Scott McDermott
- Systems & Network Administrator
- King County Library System


From fritz@hsc.vcu.edu Fri Jan 30 14:42:03 1998

Date: Fri, 30 Jan 1998 17:43:35 -0500
X-Mailer: Mozilla 4.04 [en] (Win95; I)
To: risks@CSL.sri.com
Cc: Bertrand Meyer 
Subject: Re: robots.txt: ``Here is what I am not telling you.''
Content-Type: text/plain; charset=us-ascii
Content-Transfer-Encoding: 7bit
Content-Length: 1516
X-Lines: 26
Status: RO

Bertrand Meyer complains that the robots.txt scheme "is not appropriate
if your goal is to put on your Web site some secret information that is
only meant for some trusted partners."  True - but it was never intended
for that purpose.  Its primary usefulness is to list those areas of a
web site that are expensive or impossible to index.  Consider a program
that generates web pages on the fly, and keeps track of such pages by
inserting an identifier, unique to the page, in each link.  A naive
robot trying to index that page and its descendants will go into an
infinite loop.  Non-naive robots have various tricks to detect such
situations, and as a last resort will just stop trying - but it is still
in the interest of both the web server and the indexing robot to stop
the loop as early as possible.  A better slogan for the scheme might be:

    robots.txt: "Please tour my house - expect for the west wing,
                which has infinitely many rooms."

Admittedly, there is now a risk that naive webmasters will attempt to
use robots.txt as a security measure.  I'd be interested to hear about
simple alternative techniques that would have protected the naive.  When
he invented the scheme, Martijn Koster needed to convince the writers of
(then mostly experimental) robots visiting his web site to adopt it.
Given the nature of the problem, it was the least skillful of the robot
writers who needed it most -- so simplicity was perhaps an even greater
virtue in this case than it usually is.

-Fritz

From dmckeon@swcp.com Fri Jan 30 10:26:46 1998

X-Copyright: Copyright 1998 by Denis McKeon
References:  
X-Original-Newsgroups: comp.risks
Date: Fri, 30 Jan 1998 11:25:15 -0700
X-Mailer: Mail User's Shell (7.2.5 10/14/92)
To: bertrand@eiffel.com
Subject: Re: robots.txt (Meyer, RISKS-19.57)
Content-Length: 1359
Status: RO
X-Lines: 41

In  (Bertrand Meyer) wrote:
>RISKS-LIST: Risks-Forum Digest  Friday 30 January 1998  Volume 19 : Issue 58
...
>Date: Fri, 30 Jan 98 00:23:26 PST
>From: bertrand@eiffel.com (Bertrand Meyer)
>Subject: Re: robots.txt (Meyer, RISKS-19.57)
...
>I think that a more
>effective convention would have been to include a special marker (META tag?)
>in HTML files that shouldn't be indexed, and a special file (exclude.txt?)

I believe that this is widely used, but I don't know if it has been
incorporated into any HTTP standard:

    

Password protection or restriction of access to certain ranges of
IP addresses by the server with a .htaccess file or similar is another
option, and I would expect companies to routinely favor such protection
or better over use of /robots.txt or NOINDEX.

The problem seems to parallel the patent/trade secret situation -
how can one make information available (to all, to a few?) and
still expect to protect their interests in it?

If you are interested, as of 12/96 there was a Web robots mailing list at:

    To: robots-request@webcrawler.com

	subscribe

run by Martijn Koster,

    Email: m.koster@webcrawler.com
    WWW: http://info.webcrawler.com/mak/mak.html

which had frequent discussions of /robots.txt issues.


--
Denis McKeon

From jfritz@erols.com Fri Jan 30 23:14:53 1998

Date: Sat, 31 Jan 1998 02:18:35 -0500
X-Mailer: Mozilla 4.04 [en] (Win95; I)
To: risks@CSL.sri.com
Cc: bertrand@eiffel.com
Subject: Re: robots.txt
Content-Type: text/plain; charset=us-ascii
Content-Transfer-Encoding: 7bit
Content-Length: 799
Status: RO
X-Lines: 19

Bertrand Meyer said:

> I think that a more effective convention would have been to include a
> special marker (META tag?) in HTML files that shouldn't be indexed,
> and a special file (exclude.txt?) in the directories that should not
> be explored at all.

Unfortunately many CGI programs use the trailing portion of the URL as
input.  In http://www.yoyodyne.com/foo/bar/baz/, for instance, /foo/bar
may be a program, and "/baz/" a parameter passed to it by the web
server.  To guarantee that a request for /foo/bar/baz/exclude.txt was
handled correctly would require modifying /foo/bar -- and similarly for
every other CGI program.

A version of the META tag idea is supported by a few robots -- but its
use for pages generated by a CGI program would again require modifying
the program.

-Fritz

From bear@dcs.warwick.ac.uk Sat Jan 31 05:07:15 1998

Subject: META & robots
To: bertrand@eiffel.com
Date: Sat, 31 Jan 1998 13:05:24 +0000 (GMT)
X-Url: http://www.dcs.warwick.ac.uk/~bear/
X-Mailer: ELM [version 2.4ME+ PL35 (25)]
Content-Type: text/plain; charset=US-ASCII
Content-Transfer-Encoding: 7bit
Content-Length: 1431
Status: RO
X-Lines: 31

Hi.
I read the comp.risks newagroup to get the digests - I'm not on the list
itself, and I just read:
-----------------------------< cut here >-----------------------------
Date: Fri, 30 Jan 98 00:23:26 PST
From: bertrand@eiffel.com (Bertrand Meyer)
Subject: Re: robots.txt (Meyer, RISKS-19.57)
[...]
3. So even if the respondents are right that it is "stupid" to use
robots.txt in that way, my posting at least draws attention to the risk.  If
it succeeds in making just one Webmaster a bit more careful, it will not
have been useless.
[...]
attention to what should not attract attention. I think that a more
effective convention would have been to include a special marker (META tag?)
in HTML files that shouldn't be indexed, and a special file (exclude.txt?)
-----------------------------< cut here >-----------------------------

I seem to recall reading something in the Apache docs addressing
inappropriate use of robots.txt.  And with regard to the META tag -
there is one.  I believe that this is also addressed in the Apache
docs.  Many of my pages have
  
in their head-section.

Hope this helps.
--
Phil Pennock ; GCS d- H+ s+:+ g-(+) p3 !au a22 w+++ v+@ C++(++++)
UL++++/S+++/H+ P++@ L+++ E-@ W(++) N++ o !K w--- O+@ M !V !PS PE Y+
PGP+ t-- 5++ X+ R tv- b++>+++ DI++ D+ G+ e+ u+ h* f !r n+(-)@ !y+
PGP info: send mail, subject "send pgp " and "fingerprint" or "pubkey"

From emergent@cape.com Sat Jan 31 09:28:41 1998

To: 
Cc: 
Subject: robots.txt and the nature of a risk
Date: Sat, 31 Jan 1998 12:26:54 -0500
Content-Type: text/plain; charset="iso-8859-1"
Content-Transfer-Encoding: 7bit
X-Priority: 3
X-Msmail-Priority: Normal
X-Mailer: Microsoft Outlook Express 4.72.2106.4
X-Mimeole: Produced By Microsoft MimeOLE V4.72.2106.4
Content-Length: 1647
Status: RO
X-Lines: 36

I agree that many mechanisms are misused, and that it may
be a good idea to consider misuse mechanisms when designing
something.  I even agree that robots.txt is being misused as
a security mechanism.  However, I think the responsibility lies
squarely on the shoulders of the webmaster and that there is
no design flaw in the mechanism.

To use an analogy.  Suppose the maids come every week to
tidy your house, but you don't want them to tidy your
workshop (it's cluttered, but it's *organized* clutter).  Perhaps
you put a note on the door that says "Don't clean this room."
It effectively keeps the maids out.

A self-styled "security expert" may suggest that a similar note
is all that is needed to keep burgulars out of the jewelry store:
"Don't steal from this room."  Does the blame lie in the design
of the note?  No more than it lies in the "misuse" of pen and paper.

There is a continuum of misuse and a continuum of stupidity.  Yes,
it may be "stupid" to swerve a Mercedes A-Class Car, and it shouldn't
relieve Mercedes of the burden of producing a safe car.  But on
the other end of the continuum, it is stupid to ram toothpicks up
your nose, but I doubt that anyone would argue that toothpick
manufacturers are negligent by failing to put safety guards on their
product or placing clear instructions on the box.

The use of robots.txt for security is much closer to the latter than
the former.

There are clearly risks when a reasonable person does a reasonable
thing and is surprised by the result.  When
a stupid person does a careless thing (no matter how common that
might be) I don't call it a risk, I call it a consequence.




From j.pelan@am.qub.ac.uk Sat Jan 31 13:59:21 1998

Date: Sat, 31 Jan 1998 21:57:09 +0000 (GMT)
X-Sender: johnp@phantom
Reply-To: J.Pelan@am.qub.ac.uk
To: risks@CSL.sri.com
Cc: Bertrand Meyer 
Subject: Re: robots.txt (Meyer, RISKS-19.57 & 19.58)
Content-Type: TEXT/PLAIN; charset=US-ASCII
Content-Length: 1583
Status: RO
X-Lines: 36


I must admit that I too consider the use of the "robots.txt" mechanism
to prevent the indexing of sensitive information as 'stupid'.

However, I cite Appendix B of the HTML 4.0 specification
(http://www.w3.org/TR/REC-html40/appendix/notes.html#h-B.4) it says,
with my *emphasis*, that;

``People may be surprised to find that their site has been indexed by an
indexing robot and that the robot should not have been permitted to visit
a *sensitive* part of the site. Many Web robots offer facilities for Web
site administrators and content providers to limit what the robot does.
This is achieved through two mechanisms: a "robots.txt" file and the META
element in HTML documents, described below.''

In other words, a bone fide standards document seems to advocate
the use of "robots.txt" to prevent indexing of 'sensitive' WWW pages.

Some other points have not been explicitly mentioned;

o "robots.txt" may not be honoured by all robots, like 'home-grown' ones.

o A robot can only index unprotected pages to which links already exist.
  Robots generally begin with some top-level URL and then traverse all
  the links on the same site, so documents with no direct links will
  not be visited. One might question why supposedly sensitive information
  is accessible by any casual browser let alone by a robot indexer.

o As alluded to in above excerpt, a META tag convention does exist for robots.
  This feature too can clearly be misused.

In summary, one can't write "TOP SECRET" on a document, file it in a
public library and then expect that it won't be read.

John P.


From johnl@iecc.com Sat Jan 31 17:50:15 1998

Date: 1 Feb 1998 01:48:17 -0000
To: bertrand@eiffel.com
Subject: Re: robots.txt (Meyer, RISKS-19.57)
Newsgroups: local.risks
Cc: risks@iecc.com
Content-Length: 1395
Status: RO
X-Lines: 28

I certainly agree with your sentiment that wherever possible, you
should design your system to be idiot-resistant.  But you have to
balance that against efficiency and usability.

> I think that a more effective convention would have been to include
> a special marker (META tag?)  in HTML files that shouldn't be
> indexed, and a special file (exclude.txt?)  in the directories that
> should not be explored at all.

But robots.txt is intended to decrease, not increase, the number of
wasted retrievals that web spiders make.  Proper design of a spider is
tricky, and even a spider that obeys robots.txt can easily bring a
poor server to its knees by visiting every page as fast as possible.
The robots.txt file is one file per domain, requiring one retrieval
each time a spider visits a domain.  A META tag would require that the
spider retrieve all the pages that it's supposed to ignore, and a
per-directory file could double the number of retrievals that a spider
makes.

A sufficiently determined user can screw up any tool, computerized or
otherwise.  There are a lot of ways to foul up a web site, and I'd
rank robots.txt pretty low on the threat list.


--
John R. Levine, IECC, POB 727, Trumansburg NY 14886 +1 607 387 6869
johnl@iecc.com, Village Trustee and Sewer Commissioner, http://iecc.com/johnl,
Finger for PGP key, f'print = 3A 5B D0 3F D9 A0 6A A4  2D AC 1E 9E A6 36 A3 47

From dkarr@bbn.com Sun Feb 1 15:31:34 1998

Date: Sun, 1 Feb 1998 18:30:08 -0500
To: bertrand@eiffel.com (Bertrand Meyer)
Subject: Re: robots.txt (Meyer, RISKS-19.57)
Newsgroups: comp.risks
Content-Length: 2073
Status: RO
X-Lines: 38

I read both of your posts on this topic as well as your Web page.  I
found the whole discussion quite interesting, particularly as I myself
found the specification of robots.txt all but unusable.  If the design
had been up to me, I would have chosen something like your proposed
alternative, in fact yours would have satisfied my requirements
perfectly.

Nevertheless, I think even a well-designed "robot exclusion" feature
could fall afoul of the risks you identified.  The fundamental problem
is, what if the robot doesn't honor whatever exclusion scheme we put
in place?  Someone who is motivated to dig up dirt from Web sites
(so that they would be spying on robots.txt in the first place)
_and_ who has access to a moderate amount of programming skill could
create such an ill-behaved robot.

Or to put it another way, if the robots can get to your sensitive page
in the first place (making some facility like robots.txt necessary)
then any Web surfer can get to the pages, and you simply should not
put your trade secrets or a list of your clandestine affairs there.

The one situation in which I can imagine robots.txt actually does
significantly compromise security (in this case by making questionable
security into unquestionably bad security) is where you have set up
some unadvertised directories with names that are not obvious and have
no hyperlinks into their files from anywhere.  ("Security by
obscurity.")  In this case there is no need to list these directories
in robots.txt, since robots would not reach them anyway.  But some
genius might decide to list the directories in robots.txt just to be
"extra sure."  This error would be less likely if the "exclude" files
had to be placed in the obscure directories themselves.

I have to admit I didn't "get" the analogy to swerving a Mercedes.
The situation you describe seems more analogous to swerving over the
edge of a cliff in order to avoid hitting the elk.  What are we to
blame the Mercedes engineers for in this case---for providing a
steering mechanism that enabled this maneuver?

David A. Karr

From adrianh@victoriareal.co.uk Sun Feb 1 21:38:12 1998

X-Sender: adrianh@seagulls.victoriareal.co.uk
Content-Type: text/plain; charset="us-ascii"
Date: Mon, 2 Feb 1998 05:29:07 +0000
To: bertrand@eiffel.com
Subject: Question on robots.txt
Cc: risks@CSL.sri.com
Content-Length: 2181
Status: RO
X-Lines: 56

Questions on robots.txt... only answer if you're not already too bored with
the subject :-)

I freely admit fact that some foolish people use robots.txt as a "security"
shield. This is wrong. Dumb people are always a risk. Hopefully some of
them will be kicking themselves after reading the last couple of issues of
RISKS.

However, the robots.txt spec does perform a useful purpose (the one it was
designed for... stopping bots wandering where they'll get themselves in a
spin).

My question is: Is it possible to stop robots going where they shouldn't
without producing a scheme which dumb people will misinterpret as a
security protocol.

The problem with the two methods you suggest are that they significantly
increase the load on servers. A bot has to check for an exclude.txt in
every directory it accesses. And the META tags mean that a document must
have been read for it to be excluded! Not useful for pages with
side-effects (which are often a bad idea in themselves anyway... but that's
another discussion).

On the human side too, I think something called "exclude.txt" is much more
likely to be mis-interpreted as a privacy tool!

The obvious solution would be a global "allow.txt" which tells bots where
they can go. However with the structure of many web sites it is far easier
to express where robots cannot go so I imagine this method of robot control
would be less likely to be used.

BTW, as a point of interest, there are already META tags defined for robot
exclusion. See
 for more
details.

In fact, now that I think about it, the real risk in robots.txt is its use
in targetting denial of service type attacks. Since the URLs excluded by
robots.txt are often links to processor intensive CGIs a suitably vicious
individual could target these for an intensive set of HTTP GETs.

...Of course *everyone* does resource checks before they start running
their CGIs don't they :-)

Anyway, I really must get some work done.

Cheers,

Adrian


----
Adrian Howard. adrianh@victoriareal.co.uk. Head Techie. Victoria Real Ltd
URL: http://www.victoriareal.co.uk/ v. +44 1273 774469 f. +44 1273 779960



From bertrand@eiffel.com Sun Feb 1 15:39:24 1998

Date: Sun, 1 Feb 98 15:39:21 PST
To: bertrand@eiffel.com, dkarr@bbn.com
Subject: Re: robots.txt (Meyer, RISKS-19.57)
Reply-To: Bertrand.Meyer@eiffel.com
Content-Length: 784
Status: RO
X-Lines: 25

Dear David:

My Mercedes analogy was simply in answer to the person who
wrote "anyone who is stupid enough to..." [use a mechanism
outside of the cases planned by the engineers] "deserves
what he gets".

The problem with the robots.txt is that it appears in
a widely advertised place:

	http://www.YOUR_COMPANY.com

i.e. the very place where YOUR_COMPANY wants to draw
the millions. Among the millions there is bound to
be a few who will probe for a possible robots.txt there.
If you put a marker in some other place, there is no
particular reason why the great unwashed masses will
get the idea of looking up that particular place.

Unless I hear otherwise I will put your message (and this
one) on the Web page.

Thanks a lot for your comments and best regards,

-- Bertrand Meyer

From Camillo.Sars@DataFellows.com Mon Feb 2 03:41:55 1998

To: bertrand@eiffel.com (Bertrand Meyer)
Subject: Re: robots.txt (Meyer, RISKS-19.57)
Date: 02 Feb 1998 13:38:17 +0200
Lines: 38
X-Mailer: Gnus v5.4.64/Emacs 19.34
Content-Type: text/plain; charset=unknown-8bit
Content-Transfer-Encoding: quoted-printable
X-Mime-Autoconverted: from 8bit to quoted-printable by intra.datafellows.com id NAA20737
Content-Length: 2050
Status: RO
X-Lines: 48

bertrand@eiffel.com (Bertrand Meyer) writes:

> After all, the designers of the Mercedes A-Class car could also say
> "anyone stupid enough to swerve violently when an elk crosses the
> road gets (and should get) what he deserves". Unfortunately for
> them, and probably fortunately for most of us, that doesn't pass muster.

This comparison is inherently unfair.  The question you are asking is
"does mechanism X fulfill its purpose?"  For robots.txt, the purpose
of keeping certain robots out is fulfilled.  For Mercedes A-Class, the
purpose of safely driving under adverse conditions is not.

> 4. Of course designers cannot always be blamed for misuses of their
> mechanisms. But they should minimize the possibility of misuses. In the
> robots.txt case it seems to me rather wrong to have a conspicuous
> world-readable file that draws attention to *excluded* information. (Re=
minds
> me of programming languages which implement information hiding by makin=
g the
> author of each module list conspicuously, as the first thing you read i=
n the
> module's text, those features which are *not* exported!) This draws
> attention to what should not attract attention. I think that a more
> effective convention would have been to include a special marker (META =
tag?)
> in HTML files that shouldn't be indexed, and a special file (exclude.tx=
t?)
> in the directories that should not be explored at all. Then you would o=
nly
> be able to find that information if you already knew where to look.  Th=
e
> robots.txt mechanism is a godsend for Peeping Toms in search of possibl=
e
> secrets.

Here you are also making a fundamental mistake, although the first
sentence is promising.  If designed correctly, robots.txt should list
things to be *included*, according to the principle of least
privilege. Right?

Regards,
Camillo
--=20
Camillo S=E4rs  Data Fellows Ltd.
http://www.Europe.DataFellows.com/          Aim for the impossible and yo=
u
http://www.iki.fi/ged                       will achieve the improbable


From redfield@cisco.com Mon Feb 2 13:39:18 1998

X-Sender: redfield@ce-nfs-1.cisco.com
X-Mailer: QUALCOMM Windows Eudora Pro Version 3.0.2 (32)
Date: Mon, 02 Feb 1998 13:31:39 -0800
To: bertrand@eiffel.com
Subject: Robots.txt on www.cisco.com
Cc: cco-trolls@cisco.com
Content-Type: text/plain; charset="us-ascii"
Content-Length: 1671
Status: RO
X-Lines: 37

Bertrand,

As the author of the robots.txt file on CCO [a long time ago], I can assure
you we know what it's for. The 'Bug-navigator' and 'Registered User'
filters are in place solely to reduce hits that will result in an
authentication failure message (our system uses basic and SSL
authentication extensively).

We really don't care where a robot may go in terms of information content
as *by policy* we assume that any content stored on our web site is
publically available (as opposed to transactions *through* the site which
have other protections). Robots that understand authentication can go
anywhere the corresponding user would (though again, we really didn't want
AltaVista indexing our authenticated sections, then broadcasting those
URL's and chewing up our authentication failure systems).

Does anyone really believe the robot exclusion _policy_ would be honored by
hackers? All websites require a multi-layered approach to security since
all are vulnerable to some form of attack.

You also may have violated our copyright protections by publishing that
information on the risks digent. Suggest you read
http://www.cisco.com/kobayashi/copyright.html . Just because you can
reproduce something doesn't mean you should.

I'll leave it to you to post what you will of this response to comp.risks.

-Keith


---------------------------------------------------------------------
   Keith Redfield				Cisco Systems, Inc
   Manager, Advanced Customer Systems	170 West Tasman Drive
   Cisco Systems, Inc.			San Jose, CA 95134
   redfield@cisco.com			http://www.cisco.com/
   408-526-8656
----------------------------------------------------------------------

From redfield@cisco.com Mon Feb 2 13:56:35 1998

X-Sender: redfield@ce-nfs-1.cisco.com
X-Mailer: QUALCOMM Windows Eudora Pro Version 3.0.2 (32)
Date: Mon, 02 Feb 1998 13:47:54 -0800
To: bertrand@eiffel.com
Subject: Re: Robots.txt on www.cisco.com
Content-Type: text/plain; charset="us-ascii"
Content-Length: 2362
Status: RO
X-Lines: 58

Oops, I inadvertantly sent you to the *protected* side of the site.

See below.

>Date: Mon, 2 Feb 1998 13:49:36 -0800 (PST)
>http://www.cisco.com/kobayashi/copyright.html
>
>Ironically, this URL is authenticated. :-)
>
>http://www.cisco.com/public/copyright.html
>
>>Bertrand,
>>
>>As the author of the robots.txt file on CCO [a long time ago], I can assure
>>you we know what it's for. The 'Bug-navigator' and 'Registered User'
>>filters are in place solely to reduce hits that will result in an
>>authentication failure message (our system uses basic and SSL
>>authentication extensively).
>>
>>We really don't care where a robot may go in terms of information content
>>as *by policy* we assume that any content stored on our web site is
>>publically available (as opposed to transactions *through* the site which
>>have other protections). Robots that understand authentication can go
>>anywhere the corresponding user would (though again, we really didn't want
>>AltaVista indexing our authenticated sections, then broadcasting those
>>URL's and chewing up our authentication failure systems).
>>
>>Does anyone really believe the robot exclusion _policy_ would be honored by
>>hackers? All websites require a multi-layered approach to security since
>>all are vulnerable to some form of attack.
>>
>>You also may have violated our copyright protections by publishing that
>>information on the risks digent. Suggest you read
>>http://www.cisco.com/kobayashi/copyright.html . Just because you can
>>reproduce something doesn't mean you should.
>>
>>I'll leave it to you to post what you will of this response to comp.risks.
>>
>>-Keith
>>
>>
>>---------------------------------------------------------------------
>>   Keith Redfield				Cisco Systems, Inc
>>   Manager, Advanced Customer Systems	170 West Tasman Drive
>>   Cisco Systems, Inc.			San Jose, CA 95134
>>   redfield@cisco.com			http://www.cisco.com/
>>   408-526-8656
>>----------------------------------------------------------------------
>
>
>
---------------------------------------------------------------------
   Keith Redfield				Cisco Systems, Inc
   Manager, Advanced Customer Systems	170 West Tasman Drive
   Cisco Systems, Inc.			San Jose, CA 95134
   redfield@cisco.com			http://www.cisco.com/
   408-526-8656
----------------------------------------------------------------------

From e8726057@student.tuwien.ac.at Mon Feb 2 15:37:55 1998

Date: Mon, 2 Feb 1998 23:52:44 CET
Reply-To: Klaus Johannes Rusch        
X-Url:     http://www.atmedia.net/KlausRusch/
To: bertrand@eiffel.com (Bertrand Meyer)
Cc: risks@csl.sri.com
Subject: Re: robots.txt (Meyer, RISKS-19.57)
References: <199801301734.JAA04664@chiron.csl.sri.com>
Content-Type: text/plain; charset="iso-8859-1"
Content-Transfer-Encoding: 8BIT
X-Mailer: OS/2 Warp/4.0 LaMail/2.3
Content-Length: 2961
Status: RO
X-Lines: 69

In <199801301734.JAA04664@chiron.csl.sri.com>, risks@csl.sri.com writes:
> RISKS-LIST: Risks-Forum Digest  Friday 30 January 1998  Volume 19 : Issue 58
>
> Date: Fri, 30 Jan 98 00:23:26 PST
> From: bertrand@eiffel.com (Bertrand Meyer)
> Subject: Re: robots.txt (Meyer, RISKS-19.57)
>
> I have received a flurry of responses to my article describing the risks
> associated with the `robots.txt' convention for excluding search engines
> from indexing parts of a Web site.
>
> 2. For anyone who thinks this is just a hypothetical possibility, here is
> the robots.txt file of the site of a major communications company:

While I don't know the first company and haven't been able to verify whether
or not the robots.txt was put in place in order to keep robots from indexing
irrelevant content, or for keeping content secret, I have looked at the
second example:

>         User-agent: *
>
>         Disallow: /sparc/SPARCengineUltraAX/oem/
>         Disallow: /microelectronics/SPARCengineUltraAX/oem/
>         Disallow: /javachip/SPARCengineUltraAX/oem/
>         Disallow: /javachips/SPARCengineUltraAX/oem/
>
>         Disallow: /sparc/SPARCengineUltraAX/download/
>         Disallow: /microelectronics/SPARCengineUltraAX/download/
>         Disallow: /javachip/SPARCengineUltraAX/download/
>         Disallow: /javachips/SPARCengineUltraAX/download/
>
> I can't say for sure, but doesn't some of this look a tad like
> proprietary information?

Have you tried any of the links? 404 File not found -- I guess that's a good
area to keep robots away from :-)

> 3. So even if the respondents are right that it is "stupid" to use
> robots.txt in that way, my posting at least draws attention to the risk. If
> it succeeds in making just one Webmaster a bit more careful, it will not
> have been useless.

I agree on that. (However, most webmasters don't even know about robots.txt
so drawing their attention on this file at all already serves a purpose.)

> ... I think that a more
> effective convention would have been to include a special marker (META tag?)
> in HTML files that shouldn't be indexed, and a special file (exclude.txt?)
> in the directories that should not be explored at all. Then you would only
> be able to find that information if you already knew where to look.  The
> robots.txt mechanism is a godsend for Peeping Toms in search of possible
> secrets.

The HTML approach is already used in the form of the ROBOTS meta tag, which is
not widely respected, for two reasons:

1. The document must be fetched in order to determine it should not have been
   fetched (bad for computationally intensive documents).

2. This only works for HTML documents, not for other media types.

Spreading exclude.txt all over the server would create an enourmous additional
load for the robot for keeping track of all exlude files, walking down the full
path etc.

Klaus Johannes Rusch
--
KlausRusch@atmedia.net
http://www.atmedia.net/KlausRusch/




From alun@texis.com Tue Feb 3 12:20:47 1998

X-Sender: alun@mail.io.com
X-Mailer: QUALCOMM Windows Eudora Pro Version 3.0.5 (32)
Date: Tue, 03 Feb 1998 08:51:45 -0600
To: Bertrand Meyer 
Subject: robots.txt
Content-Type: text/plain; charset="us-ascii"
Content-Length: 2012
Status: RO
X-Lines: 45

Just to let you know, there was a point to writing your RISKS article about
the use of robots.txt - not only with the examples you cite, but in this
exchange between me and an operator at a large NASA facility, after I had
mentioned that a "registered users only" page had become publicised on
search engines:

Him:>	I recall seeing something on your webpage somewhere
Him:>	about a jerk that registered your download directory
Him:>	with a bunch of search engines.  Just place a file
Him:>	called "ROBOTS.TXT" in the root directory with the
Him:>	following string:
Him:>
Him:>	disallow:/marketdagger.html
Him:>
Him:>	and that will keep away most of the robots.

Me: Unfortunately, then the page becomes a little easier for
Me: the hard-core hacker to get to - after all, if you want
Me: to know which pages a company doesn't want listed in Alta
Me: Vista et al, all you do is fetch the ROBOTS.TXT file,
Me: right?  What I really ought to do is place a password
Me: on the web page.

Him:>No.  Those search engine robots are looking for HTML files.
Him:>If the word ROBOTS.TXT is present in a webpage it *will* see
Him:>that, but not the actual file.  Password protection is
Him:>generally the best way to keep something secure in a webpage
Him:>like yours.

Apparently, this guy hadn't the first idea that perhaps someone that was
looking for my "registered users only" page might just connect to my web
site, and get the file ROBOTS.TXT to read what's in it.  So, there are
definitely people out there who are not only poorly educated about the
usefulness of a ROBOTS.TXT file, but they are also willing to spread their
lack of knowledge.

Alun.
~~~~

---
Texas Imperial Software | Try WFTPD, the Windows FTP Server. Find it
1602 Harvest Moon Place | at web site http://www.wftpd.com or email
Cedar Park TX 78613     | us at alun@texis.com.  VISA / MC accepted.
Fax +1 (512) 378 3246   | NT users and ISPs, be sure to read details of
Phone +1 (512) 257 2578 | WFTPD Pro, the NT service version - cost $80.