Obama's whitehouse.gov nukes 2400-line Bush robots.txt file

Discuss

46 Responses to “Obama's whitehouse.gov nukes 2400-line Bush robots.txt file”

  1. theawesomerobot says:

    Ouch, they really are using ASP – that’s painful.

  2. nanite2000 says:

    As an ASP.NET developer myself, I would like to say how good a programming language it is (especially since ASP.NET 2.0). Yea, people say it’s just a knock off of Java, and maybe it is. But it is so much more polished and the support community is infinitely more helpful. Also, because it is managed by just one company, there is a consistency to it that is not present in Java, and is being continuously developed and improved upon (again, unlike Java).

    I used to develop in Java and found it almost impossible to get support from the global community (assuming I could even find a well populated support forum in the first place). Then I started developing in ASP.NET and it was like a breath of fresh air. Well featured, with a fantastic IDE, a friendly helpful support community, plenty of tutorials and demos on how to do the most basic (and advanced) things, etc… Quite simply, I can do more with ASP.NET in a fraction of the time it takes to do in Java. I have never looked back since.

    On the few occasions I am asked to do Java development, I fill with dread. It seems these days that Java is an elitist, overblown, oversized, sprawling inconsistent mess with no standard framework or development environment. And that makes it difficult for programmers to actually develop in it if they wish to have a life outside of programming.

  3. Anonymous says:

    So why did they want to hide their earmarks from us???

  4. James Holden says:

    Stripping out the lines for text-only and print versions, the original robots.txt file looks like this:

    User-agent: *
    Disallow: /cgi-bin
    Disallow: /search
    Disallow: /query.html
    Disallow: /help
    Disallow: /slideshow/gallery/iraq

    User-agent: whsearch
    Disallow: /cgi-bin
    Disallow: /search
    Disallow: /query.html
    Disallow: /help
    Disallow: /sitemap.html
    Disallow: /privacy.html
    Disallow: /accessibility.html

    (From http://web.archive.org/web/20080325202806/www.whitehouse.gov/robots.txt)

    So really (because the rest only applies to their own crawler) all they excluded is:

    /cgi-bin, /search, /query.html, /help and /slideshow/gallery/iraq

    Seems reasonable to me, and I think holding this up as an example of “Change” doesn’t really hold water.

    The file would be a hell of a lot shorter if the robots.txt syntax allowed for /*/text and /*/print, but it doesn’t so the file ends up verbose.

  5. Evil Jim says:

    Thank you to those commenters who explained the robots.txt significance to those of us not in the know. Especially #10 Anonymous.

    In future, would BoingBoing kindly include some more background on what they’re linking? Especially technical articles.

  6. senormike says:

    Regarding ASP – think of the military. The M16 is certainly not the best gun out there, but we’ve bought a couple million and can’t switch now.

    ASP.NET (and the corresponding servers) is a lot like the M16…except easier to shoot yourself in the foot with.

  7. Beryllium says:

    To the person who said “they deleted all the files and uploaded new ones” … I don’t think that’s how they did it.

    (And frankly, even if they DID do it that way, it would have been Cheney et al hitting the delete button ;-))

    They probably just did a coordinated server swap of some kind. It’s probably all load-balanced, anyway, so it might have just been as simple as flicking a switch.

  8. toolbag says:

    ASP at .gov sites: the feds get significant discounts from MS on software and training. Many of the content producers for those sites are GS##”lifers” who learned IIS/Frontpage back in the 90′s so it’s going to be around for a while. At least that’s what I’ve gleaned from my off and on work with government contracts and subcontractors.

  9. Master Gracey says:

    On the ASP question, I wonder what security clearance level the web designers and webmasters need? I suspect there is a larger talent pool with appropriate security clearances that are trained in ASP as opposed to anything else. Does it have to be this way? Of course not, but short a massive re-write of the site, I suspect it will stay ASP a while longer.

  10. Master Gracey says:

    So let me see if I’ve got this right, they put info on the website that they wanted to keep secret, so they crafted an extensive robots.txt file to “hide the info”, but the new administration (before they even took office) replaced the file with one two lines long… Yeah.

    There is this concept of a “Steve Jobs Reality Distortion field” and another called “Bush Derangement Syndrome”, I think this is the start of a new phenomenia, where in every step Gov’t takes is a direct result of a positive action by our new president, and we need to compare it to the previous administration to underscore the change.

    I fall in with the “good website management” group – Bush, Cheney, Rove, Rice, and Rumsfeld all were smart enough to know you don’t post secrets on a public gov’t website, robots.txt file or not…

  11. jathomas says:

    @#10 THANK you!

  12. Carlos says:

    CNet’s Declan McCullagh explains what this means and specifically calls Boing Boing out in this post:

    http://news.cnet.com/8301-13578_3-10146802-38.html?part=rss&subj=news&tag=2547-1_3-0-5

  13. bfarn says:

    Ok, so it’s a shame that the new White House has gotten rid of its “Disallow: /earmarks” policy, but I’m glad to see them allowing /help, /expectmore, and most importantly /results.

  14. Anonymous says:

    I for one welcome our new robot overlords… oh wait.

  15. Michael Leung says:

    First line of business: nuke the search, allow terrorists in!

  16. Anonymous says:

    I’m not quite sure if the robots.txt got cleared out on purpose, accident or plain negligence.

    It’s ASP.NET because the contractor refuses to acknowledge any other language/technology. I used to work for them, and this was DEFINITELY a “lowest bidder” job. This was an utter clusterf— of a project and, remembering the work environment, I’m utterly amazed it turned out as well as it did.

    (/includes is just where we kept css and javascript and disallowed in the default builds we had.)

  17. Master Gracey says:

    From CNET:

    If anything, Obama’s robots.txt file is too short. It doesn’t currently block search pages, meaning they’ll show up on search engines–something that most site operators don’t want and which runs afoul of Google’s Webmaster guidelines. Those guidelines say: “Use robots.txt to prevent crawling of search results pages or other auto-generated pages that don’t add much value for users coming from search engines.”(Emphasis Added)

  18. Bart says:

    @#2 re: gov’t websites and ASP

    I’ve actually worked closely with the web staff of several federal agencies, building the software infrastructure, so I can answer that, to a certain extent.

    ASP and ASP.net developers are inexpensive, plentiful and easily trained. Of the sites that I worked on, most of the work was done by contractors and subs (and even sub- subs). In a typical project, I’d talk with a GS a few times over the life of the project, but almost all technical details went to/ through the contractors. Typically the GSs were in managerial roles and more concerned with what the final website looked like, rather than what the back end technologies are. With the mantra of “noone ever got fired for buying Microsoft” alive and well in the US government, there’s no incentive to choose anything else. Plus, ASP and ASP.net are (relatively) easy technologies to learn. If you’re planning, long term, to have a site that has to be maintained by unknown people (who knows who the contractor dejure will be next year), there is a certain logic to going with ASP. A lot of the folks I worked with were not the types of people I would expect to learn anything new if they considered it ‘hard’. I’m not advocating ASP over any other technology (I’m a apache/Java type person myself), but it works well for them.
    As someone else said, a lot of these sites have a legacy of being originally written in front page, and having migrated over time to ASP and then ASP.net as it’s sometimes easier to slowly upgrade the system than it is it re-architect on a new platform (and to teach your lifers a totally new set of skills).

  19. Jamie Brown says:

    The BBC has picked up on this and got it completely and uttlerly wrong: http://www.jamiedigi.com/2009/01/bbc-gets-it-completely-wrong-about-whitehousegov-robotstxt/

  20. ian_b says:

    Well, this proves it.

    Obama is in the pocket of Big Spider.

  21. Webnauts says:

    The robots.txt have a PR for. I think PageRank Sculpting should be the next step.

    I would suggest the implementation of X-Robots in the WhiteHouse.gov site .htaccess file like this:


    Header set X-Robots-Tag “noindex”

  22. mjcmjc says:

    the files now has more lines… and viewing of one of the disallows lets you see a site in building progress…http://www.whitehouse.gov/omb/

    User-agent: *
    Disallow: /includes/
    Disallow: /search/
    Disallow: /omb/search/

  23. freshyill says:

    If I remember correctly, weren’t a huge percentage of the entries in the old robots.txt file just printer versions of other pages?

  24. joe says:

    I really don’t think that the official website for the White House needs to worry about rank in organic searches.

  25. xeoron says:

    Am I the only one that now wants to make a shirt that says something on the lines of this:

    Robots
    User-agent: *
    Disallow: /includes/

    Humans
    User-agent: *
    Disallow: /encrypted/

  26. Ryan says:

    It looks better, for sure– but it looks like they’re still using ASP and ASP.net. Maybe someone else knows better than me, but why does the US Government rely so much on ASP? Practically every government website I’ve been to, city, state or federal, appears to be written in it. Is there some sort of agreement between Microsoft and the government?

    While part of me feels that it’s probably because they want a guaranteed level of support on products like that, I realize that they could get the same level of support while paying for more in people time as opposed to license fees. If you hire a bunch of people to work on open source solutions, you have hired people who have a connection to your project; rather than being forced to deal with some vendor’s busy and distracted support staff.

    It’s just an interesting choice for the ‘Change’ we’re supposed to be getting.

  27. abushaw says:

    I want to be excited about the new president too, and I am, but really they would never put anything “secret” or otherwise confidential on a .gov site. Doesn’t happen.

  28. dainel says:

    The robots.txt is not used to hide anything. The things you really want to hide, you don’t list them in robots.txt. You just don’t show them to visitors who are not logged in.

    Think of it like a no-entry sign stuck on a door. If you really don’t want strangers going in, you’d lock the door. It’s even less than that, because robots.txt only applies to robots, humans *are* supposed to ignore it.

  29. Anonymous says:

    Please explain the significance of this — other than banning google from whitehouse.gov anyway… :)

  30. hms says:

    As pointed out by other websites, the entries in Robot.txt of the Bush Whitehouse were there to help the search robot and to prevent pages that are not needed to be index from being indexed (like pages with forms on them). It also pointed the robot to the graphics pages rather than the text only verisions.

    The bush version is actually the preferred method… see googles webmaster guidelines.

    The post shows the partisan and technical ignorance of the person posting it.

  31. imipak says:

    This post is kinda pointless. There’s no significance to the changing robots.txt file beyond changing content on the server.

  32. TheChickenAndTheRice says:

    What does this mean?

  33. jerwin says:

    Archive.org and it’s way back machine obey the robots.txt directive.

  34. almostwitty says:

    Given that (as far as I can tell) nothing from the Bush administration exists on the current White House website, isn’t it far more likely that they just wiped all the files off the .gov server, and uploaded a pre-made set of new files?

    The basic significance is that the Whie House website during Bush basically asked search engines not to index anything in those listed directories.

    The new robots.txt file says “Index it all! I don’t mind! Come and visit!”. The fact that there’s probably not that much there to begin with also helps…

  35. christ says:

    an explanation of what this post means.

    a robots.txt file instructs ‘robots’ such as search indexing spiders and so on.

    the previous bush administration robots.txt file instructed such robots to not index over 2500 files/directories

    the current administrations robots.txt file only instructs such robots/spiders to ignore the includes directory.

  36. bene says:

    Blocked files were the text only versions of those pages. The full pages were fully available for indexing.

    Instead of a “civil liberties” tag, how about “good web coding” ?

  37. Anonymous says:

    In lay, lay, layman’s terms:

    The Bush administration didn’t give the outside world access to many of the pages on Whitehouse.gov via search engines like Google.

    Under the Obama administration, they’re opening up pretty much everything to be indexed by Google, so normal people can find information that was restricted before.

    It seems like a good example of the subtle and not-so-subtle changes we can expect for the next four years. :)

  38. spazzm says:

    The idea of a robots.txt file stinks of segregation and apartheid. Why should robots not be allowed in certain areas?

    Robots are people to!

    ♫Let my people go…♫

    I’m only half joking.

  39. spazzm says:

    Under the Obama administration, they’re opening up pretty much everything to be indexed by Google, so normal people can find information that was restricted before.

    Either that, or they’ve put all the Sooper Seekrit stuff under /includes/.

  40. Anonymous says:

    Why did they want to hide the stuff about the easter egg hunts so badly? What was in those eggs?

  41. schr0559 says:

    Sadly, with #30. It’s very well possible that after 4 years (or 8, if all goes well), the Obama White House page will have an even bigger robots.txt file, for perfectly non-sneaky reasons.

    Then again, having to go that far out of your way just to make Google happy probably means you need a new IT crew… not the case so far from what it seems :)

  42. IamInnocent says:

    an explanation of what this post means.

    Thanks God!

  43. Namdnal Siroj says:

    Almost all of the disallowed directories are called “text” or “search”.
    It looks like someone took the time to try and avoid double entries in search engines, by disallowing plain text resources (used to build pages) and internal search pages.

    The implied conclusion is tendentious and not substantiated.

  44. Beanolini says:

    wget http://www.whitehouse.gov -r -erobots=off

  45. Namdnal Siroj says:

    In layman terms:

    It looks like the webmaster tried to exclude unformatted copies of pages from search results.
    This would help google to create more accurate search results, instead of multiple, unformatted copies of the same pages.

Leave a Reply