Scraping the Senate, turning US govt into structured data

Cory Doctorow

4:08 am Fri, Sep 3, 2004

Paul Ford has written an article for XML.com about his plan to scrape all the information he can about the Senate and convert it into searchable, structured data (much like the UK's brilliant They Work For You project, which does the same for Parliament). He's planning to document his process of converting the Senate's sloppy html into clean XML, and turn the process into a tutorial on how to make the Semantic Web come alive.

Of course screen-scraping is itself a dubious process. When the Senate decides to change its page design, moves the page, or alters the suffix, I'm out of luck. At the same time, it's hard to argue against the fact that the Senate's own web site is a definitive source for up-to-date, reliable information about the current composition of the Senate. This is a situation that we're likely to encounter again: the best, most reliable site to get some information is the worst place to get useful data. Hopefully, as we go forward, we'll have multiple sources of information on various members of the government, and can use them all together.

Link

(via Kottke)