Ignore robots.txt in specific situations

37eaf28e66433106549f

Some websites should be indexed, regardless of their robots.txt. One (my only) example of this is LegInfo, California's official website for state law. Because their tech department is incompetent, their robots.txt is

User-agent: *
Disallow: /

Disallow: /billPdf.xhtml$
Disallow: /billAnalysisClient.xhtml$

Crawl-Delay: 10

Sitemap: https://leginfo.legislature.ca.gov/sitemap.xml

I should be able to search a law (by standard citation!) and get the official source first, not buried deep in results or not visible at all. Too often I have to search through their website rather than a search engine, and this is absurd. I don't have a good suggestion for a general-purpose rule about when to respect or ignore robots.txt, but at least in this case, I see no reason to respect it.

Anonymous26

robots.txt is a stupid thing just look at what Reddit did

37eaf28e66433106549f

Anonymous26 I don't know that I'd go that far, but that is a good flag. I wonder whether it might make sense to respect it as applied to specific endpoints, but not disallow: /-s? or some more advanced heuristic that I can't think of atm

Anonymous22

Because some people aren't using it right, it is not good form to ignore the conventions of polite behaviour that has been the backbone of search until now. Eventually, we end up declaring that "we follow robots.txt when convenient and otherwise ignore it". Surely that can't be a good look?

37eaf28e66433106549f

Anonymous22 I think that's precisely when it is appropriate to ignore the conventions of polite behavior -- it's a two-way street. Or rather, we should accede to reasonable requests, and ignore unreasonable ones (cf. "Call me Mads, not Maddie" vs "Call me Your Highness"). As a result of paranoia of AI, more websites are making unreasonable requests. What does that fact have to do with whether I should be able to look up the text of the law that applies to me, and have the official text be the first result? robots.txt was the solution to a coordination problem -- "we follow when convenient" was always the point, it's a social norm.

In the narrow case, I don't think it would be harmful to say "we ignore blanket disallows on government websites" (which in this case might straight up be illegal on their part (see subsection b)). I could also see a case for a broader policy, but I feel less strongly about that.

Anonymous22

Ultimately, this has just made it an Arms Race and we end up with escalating behaviour on each side. I suppose, it has already happened with the crawlers for these AI services being the ones who shoulder most of the blame.

We just see changes to robots.txt in some places but many have already adopted some form of 'gatekeeping' or the other to 'prove' that one is a human and not yet another crawler.

There is no going back of course. So, with an a disclaimer that I'm not a member of staff and just another user, I'll bow out.