NYT: Internet activist accused of data theft

Internet activist Aaron Swartz, formerly of Reddit and Wired Digital, was indicted Tuesday on charges of data theft. The district attorney in Boston claims that he "stole" millions of JSTOR documents while at M.I.T., crimes that could put him in jail for 35 years. Here's Nick Bilton in the New York Times:
In a press release, Ms. Ortiz's office said that Mr. Swartz broke into a restricted area of M.I.T. and entered a computer wiring closet. Mr. Swartz apparently then accessed the M.I.T. computer network and stole millions of documents from JSTOR.
In a press release, Demand Progress, the political action group founded by Swartz, denies the prosecutor's claims outright: "As best as we can tell, he is being charged with allegedly downloading too many scholarly journal articles from the Web" and compares it to "checking too many books out of the library." JSTOR is an online archive of print journals, containing millions of articles. The prosecutor's language here is unequivocal: that he "broke in" to a "restricted area" to gain access to a "wiring closet" that would enable a mass data theft. The criminal complaint [via Jason Levine and Anil Dash] suggests most of the theft, however, was accomplished using scraper software to download en-masse stuff over the web, from a website he already had access to. "Swartz used the Acer laptop to systematically access and rapidly download an extraordinary volume of articles from JSTOR. He used a software program to automate the downloading process so that a human being would not need to keep typing in the archive requests." The trip to the wiring closet happened after JSTOR finally blocked that technique:
On January 4, 2011, Aaron Swartz was observed entering the restricted basement network wiring closet to replace an external hard drive attached to his computer. On January 6, 2011, Swartz returned to the wiring closet to remove his computer equipment. This time he attempted to evade identification at the entrance to the restricted area. As Swartz entered the wiring closet, he held his bicycle helmet like a mask to shield his face, looking through ventilation holes in the helmet. Swartz then removed his computer equipment from the closet, put it in his backpack, and left, again masking his face with the bicycle helmet before peering through a crack in the double doors and cautiously stepping out.
Needs a theme tune by Henry Mancini. Note: the NYT originally reported that Swartz was a co-founder of Reddit, referenced here in an earlier headline. I've updated this post to reflect its update: Swartz joined Reddit early but not as a founder.


  1. JSTOR content is easily (and often freely) accessible through libraries via interlibrary loan and other means. Just doesn’t make sense…

  2. I can’t even get the documents I’M SUPPOSED TO BE ABLE TO GET from JSTOR, and this guy “stole” millions?

    i r confuzed

  3. So, how much did JSTOR and/or publishers pay the Justice Department to put on this circus act?

  4. Some interesting context provided by DaringFireball (original site unreachable at present):

    “Demand Progress, the political activism group Swartz founded, has a response:
    “This makes no sense,” said Demand Progress Executive Director David Segal; “it’s like trying to put someone in jail for allegedly checking too many books out of the library.”
    “It’s even more strange because the alleged victim has settled any claims against Aaron, explained they’ve suffered no loss or damage, and asked the government not to prosecute,” Segal added.”

  5. “theft” “stole”

    I don’t think that word means what you think it means.

    “Stealing is stealing whether you use a computer command or a crowbar, and whether you take documents, data or dollars,” said Ms. Ortiz in the press release.”

    No, it really isn’t. You cannot steal what is not exclusive or scarce. He didn’t delete the originals. They’re still available.

  6. Aaron Swartz is a talented young software developer that contributed lots of interesting stuff to the public domain ; it is obvious to me that he does what he does to make the Internet accessible to everybody, safely and with ease. Consequently, I can only guess that people acting against him are looking for the opposite, a controlled Internet, with walls to hid things ; so sad.

  7. “JSTOR collaborates with hundreds of publishers and content providers to preserve and broaden access to their scholarly content.”
    – about.jstor.org

    This and more they do for us at the low, low price of $8-$32 an article. Aaron Swartz should have known that to steal from them is steal knowledge from the university professors who are the only ones who deserve access to it. I vote we chain Mr. Swartz to a rock and have an eagle eat out his liver every day, only to regrow at night, for the rest of eternity.

    1. bwahahahaa! I forgot how cheap some of those eggheads can be. ONLY 8-32 bucks a hit? Dang.

  8. Why on Earth would someone steal archived journals? Isn’t Swartz rich? Why didn’t he just buy the recent issues if he wanted the embargoed papers?

  9. It’s been a little while since I’ve been over there; but last time I was MIT’s libraries were pretty open. You could just walk in, plug in your laptop, and get your MIT-affiliated IP. I’m not sure why you’d have to go playing Mission Impossible in the wiring closet…

    Or if, y’know, you’re the cofounder of reddit, you probably have a fairly large number of MIT students who would be delighted to give you access to whatever areas students use to get to JSTOR.

    It is conceivable that this guy did do something stupid to get what he could have gotten otherwise; but it seems very strange to break in when you can take your choice of walking in or being invited in.

    1. I would assume that the typical “MIP-affiliated IP” interface had safeguards against data dumps, while the wiring closest allowed him to download “millions of documents” with significantly greater speed. I doubt JSTOR wanted him to do this.

      From the demand progress website,
      “About Aaron,

      Aaron Swartz… is the author of numerous articles on a variety of topics, especially the corrupting influence of big money on institutions including nonprofits, the media, politics, and public opinion. In conjunction with Shireen Barday, he downloaded and analyzed 441,170 law review articles to determine the source of their funding; the results were published in the Stanford Law Review. From 2010-11, he researched these topics as a Fellow at the Harvard Ethics Center Lab on Institutional Corruption…”

      1. Can we get a cite for the article which he worked on with Ms. Barday? Searching for “Swartz” at the Standford Law Review’s website doesn’t return any results.

        Barday’s note in the Stanford Law Review from 2008 on the same topic only mentions the need to download 51 articles and from Westlaw, not JSTOR.

  10. Back away from the scholarly articles! Hands up, get down on the ground NOW!

    A few hours later, in the holding cell: “So, what’re ya in for?”

    “Uh.. I robbed JSTOR.”

    “The J Store, what?”

    “No… just JSTOR.”

  11. It’s worth noting that quite a bit of what was involved included the work of not for profit scholarly publishers like the one I work for. I think all of us involved in the non-profit scholarly publishing sector would prefer to give away all of our content for free, but our salaries and the technology we use to create and distribute that scholarship would still need to be paid for. So far the legislators and administrators who decide nfp publishing policies require us to both engage in markets and to publish in an economically sustainable way. It does not surprise me that ITHAKA, who developed the JSTOR platform, isn’t interested in legal action. I should add that neither am I. But I work with very talented and dedicated editors, designers, and programmers and I’m not sure how we will pay for their talents and the tools they need unless we sell the content.

    Perhaps another approach to making sure this content is available to anyone with connectivity is to encourage university administrators and state and federal legislators to fully fund the journals and books we publish as Open Access publications. If they were willing to cover the full costs of the editing, design, and distribution of this content, it wouldn’t be restricted content. Open Access isn’t as sexy an issue as pirating, but it’s a solution to at least this particular problem, the problem that Aaron was trying to address, and it needs public support.

    It’s also worth noting that it seems Aaron did do some collateral damage with this effort. The indictment seems to indicate he brought down some servers and caused MIT users to lose access to the same journals he was copying.

    1. “It’s also worth noting that it seems Aaron did do some collateral damage with this effort.”

      Perhaps. But he didn’t do anything worth 35 years in prison. You can get less on a murder charge.

    2. I used to work in scholarly publishing too, and I used to talk that way, until getting harshly spoken to around here and elsewhere, and a fair amount of reading and thinking, put me right. Really, the comparative pennies they throw at an editor or programmer can easily be found in other ways than walled-garden, rentseeking publishing: stopping print publication, slashing honoraria, cost control, subventions, etc., would help so much, just for starters. Like any top-heavy, unsustainable corporate boondoggle, most of the $$$ isn’t going to the actual workers, but to middle management types and bosses, and into the coffers of faceless entities like Thompson-Reuters et al. Most of the problems finding the money seem to be ones of imagination, inertia, and political will.

      1. Thompson-Reuters isn’t an nfp. NFP publishing employees on average earn 3/4 to 1/2 of what their commercial counterparts earn. I do it because I love it. We don’t typically pay honoraria and our content output per employee is roughly 20% higher than commercial presses. We also ask for subventions on every book and some open access journal experiments do require it from their authors, but should publication be dependent on the author finding money or should the only factor be the quality of the scholarship? I really don’t think there’s the cost control available that you are implying.

        1. Replace Thompson-Reuters, then, with one of the university nonprofit conglomerations: the point’s the same, big corporate-style controls, redundancies, middle managers, kowtowing to copyright laws, etc. Your situation sounds different than mine was, so apologies for mixing my experiences with yours. Authorial subventions weren’t what I was thinking, so much as institutional and other forms of that, although scholars do seem remarkably adept at scaring up grants when in need, bless their thrifty souls.

          And are you typing this at work, perhaps? There’s some cost-cutting right there, get off the Internet! Seriously, though, I have no idea what hierarchy you work under, but with my former journals there were a large amount of ossified honoraria, managers aplenty, and a fair amount of cash walking out the door in the pockets of those who didn’t exactly need it. The print costs alone, in an age when print is unnecessary (and unavailable to many libraries in the developing world, for example) and wasteful. Again, I think it’s often an issue of will and imagination.

          I hear you about loving the work, though: I did too, even after I realized that most of the authors’ interest was in resume-building and tenure securing, and that management often had little or no clue about the editorial and content issues, and were incredibly resistant to thinking about new ways to do things. Those would cost $$$ to think about, see. Sigh.

      2. There’s still a lot of inertia and vestigial bureaucracy left over from the days when print was the only way get journals out (you know, in the prehistoric eighties). As fast the internet has transformed academic publishing, the institutions themsleves haven’t caught up. As it stands, the extent system feeds a lot jobs that only in the last decade have become completely superfluous. And, unlike free-agent IP publishers such as artists, it’s not as if most academics and researchers can just go set up their own peer-review system without drawing the ire of those they work for.

        Some revenue does need to happen to fund a reliable jury system complete with administration and server/bandwidth costs, but no way should it take $8-$32 a paper.

        1. That largely describes my experience. I still remember the glee with which I’d read typed letters from senior scholars bemoaning the lack of “proper” typesetting in the digital age, and threatening to cancel subscriptions over the “horrible” digital type. Warehousing, mailing, print costs: a sclerotic, moribund process, there.

          I was in humanities publishing, and there I rather think that if enough younger, hipper academics moved over to newer, better ways of peer-review, the industry would follow suit in time. Most humanities work doesn’t need the kind of infrastructure that scientific work requires: if I write a fun paper on Matthew Arnold, that can be published the bad old way, or by someone sitting in an apartment office with a Mac and the Adobe Suite. (More or less.) I also think that these imperatives will combine with the bottom having fallen out in the job market across the academy and especially in the humanities, the scarcity of tenured jobs and the conscious outmoding of tenure by business-degreed deans, etc.: we’ll all simply start doing things the way we want, and hopefully be able to tweak what’s left of the academy to what we need. Or perhaps we’ll revert to the itineracy of medieval scholars like Abelard, or the early humanists: scary. If the alleged hack was in any way politically motivated, that’s one sign of the vulnerability of the walled garden, which may in time simply fall due to determined, culture-wide copying: one can only hope!

          1. I sometimes forget that humanities authors may be less eager to embrace the digitization of trade literature than the sciences. My own narrow little fledgling field, evolutionary computation, had the somewhat unusual distinction of not really taking off until after ARPANET. Programmers in general are perhaps a bit less attached to dead trees :)

            To make matter worse, I spent most of the last decade in industry. Now that I’ve gone back to school for my doctorate I’m becoming reacquainted with, egads, the library! *shudders*

            I’m pretty sure something will crystallize in coming years, probably with digital peer-review and streamlined refereeing and perhaps libraries and other repositories turning to print-on-demand for their carbon needs. Right now we’re just going through a transition phase with timely convergence of online archives and the realigning jobs market sending everyone in publishing into full panic mode. And while overhaul is necessary and inevitable, it’s hard to blame anyone in academia jealously looking out for their jobs. In the end academics want to share research so they can get noticed and advance their fields, and the world wants them to succeed so they can keep producing intellectual goodies. Some management types doubtless are focused first and foremost on lining their pockets, but they’re lunching on borrowed time. There’s not enough money to bilk academia for to make it worth it to the people and intuitions who ultimately fund the research and have their eyes on the bigger picture.

          2. “…I’m becoming reacquainted with, egads, the library! *shudders*”

            Ouch! Hey, hey! Lighten up on us weenies in the academic libraries of today–we’re doing our bestest in rowing against the tidal onslaught of skyrocketing database and e-journal aggregator prices. And don’t lets get started on if you want multi-user access to that pretty e-journal. Academic libraries are taking body blows right along with our public brothers and sisters–budget cuts left and right–and yet we’re implementing new and interesting tools that allow quicker access to the things we hold (and if we don’t, we’ll interlibrary-loan that sucker for you). New tools in the field (like Serial Solutions “Summon” product, or EBSCO’s “Discovery”) makes the old database/journal searching seem…well, like very old school, and not in an OG type of way.

            In any case, I’m finding this discourse quite interesting because it impacts our daily work in continuing to provide free access to information that should be free in the first place. As numerous posters have mentioned above, while the old publishing model is dying, those vendors still hold the power in the system…but not for long. Scientists and researchers–let’s get the Open Access journal system running strong so we don’t have to worry about someone copying a bunch of articles.

          3. Much as libraries private and public collectively stopped recording what books were being checked out in response to Bush Administration requests for said information–the Patriot Act, as I recall, required records to be turned over, so they simply stopped recording said data–you could all collectively act out, and do something crazy transgressive like just giving away the databases, and see what happens. Maybe just slip up and publish everyone’s library barcode numbers, thus enabling mass remote access of databases? If you think that information “should” be free, you’re in a critical position to really hit the rentier economy of the corporate walled-garden middlepersons where they live: in their “non-profit” checkbooks.

            At the very least, you could push back on horribly overreaching conglomerates like UMI, which require dissertations to observe corporate standards of copyright, and are incredibly draconian on issues regarding fair use. “Make the publishing corporations happy or we’ll remove your dissertation from our network!” Ugh. Or maybe get together to fund a CC-dissertation database that doesn’t rent-seek? Anyhoo, good to hear from a reasonable actor working in an unreasonable system: good luck to you!

    3. I am a professional mathematician. The rates scholarly journals charge libraries for access are ridiculous. The papers are written, refereed, and edited by the community at no cost to the journal; and yet the community often has to pay to view the results. And by pay, we are not talking about a few dollars- it can be in the thousands for online access for a year. I have plenty of sympathy for this guy…

      1. I’m in full agreement except that I don’t understand what he planned to do with all of these papers. He was going to get in trouble eventually for the copyright violations if he tried to publicly distribute them. Even if that worked out somehow, it wouldn’t stop people from publishing in for-profit journals.

      2. Anon 25, I would encourage you to talk to your professional societies and ask them why they are working with commercial publishers. Few math journals are published by NFPs but are instead published by commercial entities. If you want to reduce the price of those journals, become politically active in your professional society and advocate for a change in publisher.

        I would also encourage you to consider the cost of the journal you submit your own research to. If you think that journal overcharges, tell them so. And consider choosing where you submit based on that criteria. Why not submit to Project Euclid instead?

  12. The fact that an MIT server was impacted suggests a cause for the excitement.

    About a decade ago, I was tracking down a series of hacking attacks against members of a Palestinian Solidarity mailing list. An unintended consequence of the attacks was a performance impact to MIT’s mail servers, at which point the FBI took an immediate interest.

  13. This guy seems like a total schmuck. His company was bought by Reddit (the gizmodo article doesn’t say what company it was), so he certainly must have the funds to buy whatever research articles he needs. Just another guy with no respect for property and copyright, unless it’s his own.

    1. If he wants to read 10 articles he could buy those 10 articles. If he wants to have 1 000 000 articles for data mining journal articles the price would be quite high.

      It’s insane to have copyright restrictions that effectively prevent people from writing algorithms that use all existing journal articles.

  14. Breaking into a secure computer facility to steal JSTOR articles is like rappelling through the skylight of the Natural History Museum to steal the Windex they use to clean the display cases they put the priceless gems in.

      1. Now now, just because MIT is fully of clever people who carry around implements of disassembly everywhere they go doesn’t mean you can’t secure something there. For instance, if I wanted to keep MIT students out of a room, I’d leave the door slightly ajar and tape a sign on it reading



          ALL WELCOME

          Is that like singles’ night?

  15. Any resident of Massachusetts can get access to JSTOR articles for free through a public library.

    Thanks for fighting the good fight, dude! It’s my right to download every article all at once, in bulk, without the horrible inconvenience of visiting a public library web site.

    Any other point of view is heresy and if you have to illicitly access the hardware of a not-for-profit institution and soak up the bandwidth of another not-for-profit institution in order to prove your point, I’m with you!

  16. The prosecutors have pretty strong circumstantial evidence of hacking. They searched his apartment and found pie.

  17. This policy change looks like a fall out of Aaron’s work.

    Also, from the indictment: “MIT offers campus guests short-term service on its computer network. Campus guests must register on the MIT network and are limited to a total of fourteen days per year of network service.” That sounds wrong. There’s no process of “registering” on the MIT network. Here’s the relevant update from MIT’s IT office.

  18. I don’t understand how JSTOR has any standing here. Many of my publications are available through JSTOR, but I hold the copyright on most of them and the journals hold the copyright on the rest. What exactly did he steal that was JSTOR’s property?

  19. Anon (and a bunch of similar comments): “Any resident of Massachusetts can get access to JSTOR articles for free through a public library.”

    Ya’ll assume he went to great trouble, and knowingly risked serious legal trouble, just for the convenience of not having to go online to read some JSTOR articles himself? C’mon!

    It is more likely (given his pattern of internet activism) that he planned to make the JSTOR bulk accessible for others. You know, the tiny portion of the earths population that are NOT residents of Massachusetts?

    High quality research publishing is costly. But the current costs for access for everyone outside major universities are just ridiculous. The paywalls is also an effective competition barrier against researchers in poorer countriers and/or at universities that can’t afford access to all the top journals.

    JSTOR and their ilk are repugnant information rentiers.

    1. It is more likely (given his pattern of internet activism) that he planned to make the JSTOR bulk accessible for others. You know, the tiny portion of the earths population that are NOT residents of Massachusetts?
      I don’t really like (to put it mildly) all the outfits that want me to pay $30-40 for a paper but at least JSTOR access is pretty universal in university libraries all over the world. I’m about 10,000 km away from Massachusetts and I have full access to JSTOR.

  20. Also, it should be mentioned in the relish of mighty irony that Mr. Swartz was a researcher at Harvard Ethics Center Lab on Institutional Corruption, where he downloaded hundreds of thousands of law documents using similar scripts. Yo dawg, how about some ethics in your ethics…

    Swartz claims to be a cofounder of Reddit on his web site: http://www.aaronsw.com/ In truth, Reddit bought his startup Infogami, which no one has heard of.

    Is JSTOR like JDate for researchers? Just askin’.

  21. This sounds familiar. Let’s see…from Slashdot in October, 2009:

    “Federal court documents aren’t free to the public, they cost $0.08/page through a system called PACER. During a period when the US Government Printing Office was trying out free access at a number of courthouses around the US, a 22-year-old programmer named Aaron Swartz installed a small PERL script at the 7th US Circuit Court of Appeals library in Chicago — a script that uploaded a public document every three seconds to Amazon’s EC2 cloud computing service. Swartz then donated over 19 million documents to public.resource.org. That’s when the FBI took interest in the programmer responsible for this effort and ran his name through government databases. How did he discover this? His FOIA was approved, of course, and he received the FBI’s partially redacted report on himself. The public.resource.org database was later merged with that of the RECAP Firefox extension, which we discussed a couple of months back.”

    I, for one, sleep better at night knowing the US Attorney for the State of Massachusetts is taking such an interest in enforcing JSTOR’s position as gatekeeper, copyright protector, and money collector for electronic publishers. Because, in America, information doesn’t want to be free; it wants to be paid for. Anybody who says different probably steals crowbars.

    Forbes magazine calls itself a “capitalist tool,” but Ms. Ortiz is the real capitalist tool here.

  22. I have no idea what he wanted to do with those articles, but releasing them to the public isn’t the only option.

    If he were interested in investigating them as a corpus of scholarly output, some interesting things could be done. Perhaps analyzing how many articles were written with the support of Organization X or Company Y. That would seem in line with the previous Stanford Law Review article. Just speculating, though.

    @Gulliver, @phisrow, etc.: I’m pretty confident, BTW, that he didn’t download them just to read them, because that would be silly..

  23. As @kenahoo notes, there isn’t any evidence that he intended to redistribute these articles. Had he done so, he could be facing serious copyright violations. As it is, they are charging him for violating JSTOR’s Terms of Service. 35 years for ToS violations is, in any sane world, disproportionate.

    JSTOR claims that there are ways of getting at the corpus. If that’s true, it’s the first I’ve heard of it. And if those ways were easier than scraping, I’m sure he would have pursued them.

    As for impacting others’ access (@toekneesan ‘s “collateral damage” above), it sounds like JSTOR cut off MIT’s access when it became clear someone was massively downloading files. I’m not sure that it’s fair to blame Swartz for JSTOR pulling the plug.

    There is something rotten here.

  24. I know…many people…let’s say, Akbar & Jeff…who did virtually the same sort of thing during their last semester in college (when they still had University-based access to damn near everything.) IP-based login-less access combined with wget set to a setting non-aggressive enough to not raise any flags can get you damn near everything you’d ever want to read out of…something like JSTOR which isn’t at all JSTOR and is far less litigious.

  25. I can see the temptation. There’s a lot of good information locked up safe from eyes that might see and use it differently by “non-profit” JSTOR.

    College libraries used to have lots of technical books (especially books on how to actually do, not just think) on the shelves that have been removed since the 80s. Some got moved from general to specialist collections. Some of them just plain disappeared without a trace. To my knowledge, that disappearance was neither local nor coincidental.

  26. I worked as a sysadmin at MIT for four years, at a building close to the one this indictment is talking about. If a student had done this, I would consider demanding disciplinary action. If it were my network, and this were not a student, I’d consider hauling the guy into small claims court for the hours of my time he diverted me from other tasks.

    But 35 years in prison? For charges that stretch the definition of “theft”? What a profound waste of my tax money.

  27. There goes my massive paper collection >.>

    As someone who reflexively downloads every academic paper he comes across (regardless of network, excepting obvious paywalls), I very much dislike the idea that someone can sue me for it and get enough support to show up on NYT.

  28. What JSTOR does is take research produced by students and faculty at tax payer funded schools and then charge tax payers to read the results.

    They’ve also prevented large scale scans for plagiarism.

Comments are closed.