Hard data on ebook piracy versus sales -- slides from O'Reilly Tools of Change for Publishing panel

One of the absolute highlights of the O'Reilly Tools of Change for Publishing conference in New York this week was Brian O’Leary (Magellan Media), Mac Slocum (O'Reilly), and Chelsea Vaughn (Random House) presenting a panel called Challenging Notions of "Free", which presented a long-term, quantitative study of the effects of ebook piracy on book sales. There's a lot of hot air bandied about by people who argue that free ebooks generate or cannibalize sales, and it's a hard problem to study, but here at last are some good, crunchy stats and analysis to add to the argument.

The authors have generously given me permission to upload their slide-deck to the Internet Archive under a Creative Commons Attribution-NonCommerial-ShareAlike license, and they've set up a form for anyone who wants to sign up to get the full report for free when they publish it in a few weeks.

Challenging Notions of "Free"


  1. Sometime this week, the video to this presentation should become available at the blip.tv TOC 2009 video archive.

    Hope you’ll post notifications of when that (and other videos of interest, such as your own blistering indictment of Amazon and Audible for their DRM) becomes available.

  2. Wow, I have no idea how to interpret those sales graphs. I’m sure they are intended for a very specific audience. But come on, axes labels, please!

  3. I’m glad I’m not the only one who can’t read these. Could somebody please provide a layperson translation?

  4. I’ll gleefully pirate just about anything, but I’m not at all keen on Ebooks. Can’t read ’em in the bath, you see. Unless you hack together a digital projector to project them on the bathroom wall, of course.

    /goes off to find screwdriver/

  5. I think the blue dots are science fiction titles, and the red dot* is the ebook “How to make any woman sleep with you, win at WoW & make money from home”.

    (*Judging by what I see on torrent sites)

  6. I assume the x-axis is ‘Prior sales’ and the y-axis is ‘% change in post-seed sales’. As a bona fide scientist, my opinion is that that graph shows no relationship between the two. I’m not sure why you would posit a relationship between prior sales and change post-seed anyway. I guess the underlying assumption is that sales of any book slow after seeding, but I would also guess that there is a positive relationship between the popularity of a book (at least to a certain audience) and the speed with which it gets seeded. You would therefore expect popular books to get seeded while their legitimate sales were still growing rapidly, which would produce the kind of muddy graph you get above.
    That said, we don’t have any units on the x-axis. If those are actual prior sales figures of less than 200, we’re either talking seriously unpopular books, or a seeding interval so short you can’t get any decent data out of it.
    In all, that graph is seriously information deficient.

  7. I’d be willing to be that most people who download pirated content, especially e-books, don’t really use them that much. They’re just hoarders.

    We are conditioned to fetishize consumer goods. Along comes this new technology and many of us can’t help ourselves, we hoard.

  8. Forget the graph, I couldn’t get passed the graph title. Funk the what? Whatever happened to straight-talk?

  9. Gilbert Wham, DragonFrog,

    Did you see the post on BBG the other day on the miniature projectors? Perhaps we can form a small corporation, and send in a request for, say, 3 of the mini projectors. Perhaps my rubber ducky will see new life as a cyborg.

  10. The Correlation Coefficient determines how far that data differs from the standard deviation of the set. Basically it gives you a number from 0 – 1 that helps determine if there is any real correlation between your numbers. Anything less than about .50 means that the correlation is probably random so if you leave out that one outlier then you get .30 which is terrible. .67 isnt much better though.

  11. Following up on #13 – what you’re looking for is whether the points cluster around a straight line–either positive (as X goes up so does Y) or negative (as X goes up Y goes down). With the outlier you’ve got a weak positive correlation–imagine a diagonal leading from the bottom left up towards that outlier, and you’ll see that the points are kinda sorta in the vicinity, but it’s not all that strong. If you take out the outlier, probably the best fit is close to a straight up and down line, which is quite inconclusive (and would be reflected in that .37 correlation.

    I still don’t completely get what it’s showing, other than that post seed sales and prior sales have a very weak (and probably not significant) positive correlation. I’d personally want more data. Doesn’t seem like a lot to go on.

  12. Basically, the blue dots are individual books, and their position along the x-axis is their “prior sales.” So most books there are around 50 “somethings” — whether that’s 50 books, 50,000 book, 50 million I’m not sure.

    The y-axis then represents how much that book’s sales changed after someone seeded it, I assume by uploading a torrent. So some books increased by 10-20%, some decreased by 10-20% and some remained unchanged. In general, there is no correlation between how much a book was selling vs. how much it changed after seeding.

    Like others, I’m not really sure why this correlation would be relevant. A much more interesting correlation would be TOTAL DOWNLOADS vs. sales change.

    This would actually tell you more about the effects of file sharing. As it is, we don’t even know if these books were traded.

    Of course, there could be more slides showing more interesting graphs. I wasn’t able to get any out of the “slide deck” though — I seemed to be able to get a bunch of audio files.

  13. Cory, is your keynote speech available to watch anywhere? I heard good things about it.

    I’d like to see other presentations too, if there are any available online.

  14. Some of them are available (see the link in comment #1) but Cory’s doesn’t seem to be yet, or this one. Hopefully they’ll go up next week.

  15. Robotech Master, if people are going to bury useful links in plain sight in the very first comment…. how do you think I am going to find them? ;)

  16. Sorry about the slide – the fonts substituted and the text reflowed in the title. We wrote the presentation for the conference, not really as a standalone. This is one of about 30 in the full deck, which Cory has posted.

    The y axis is the change in average weekly sales seen in the four weeks after a seed first appears for a title. The baseline is the average weekly sale for the four weeks before the seed was seen. The x axis is the average number of units sold in the four weeks prior for a total of eight titles tracked.

    This slide shows a test for “bigger book” bias (there wasn’t much of one). The next slide in the presentation demonstrates the positive correlation of seeds to growth in sales.

  17. Also, a (maybe not so small) edit to Cory’s comment on the research paper: it will be published shortly, but it will be available for purchase, not free. The research was unfunded, so …

    Of course, if it is pirated, I will probably add my own work to the report and track the change in sales.

  18. Cory – do you have any idea what this hard data says or indicates?

    Do pirated editions cannibalize existing sales? Do they contribute to existing sales? Do they have no impact? Do the impact in unpredictable ways?

    I would love the value-added to this data that a bit of your inimitable contextualizing might provide.

  19. As a librarian this is no surprise to me!

    Downloadable free books — pirated or no — and libraries serve the public by offering free “loss leader” samples of authors’ (and by extension publishers’) works (or in business parlance, “product”), and many readers will then migrate to the hard stuff, from reading online to reading actual printed books, or from borrowing books to buying them.

    (And don’t forget that not everyone downloads books. In addition to people who just find books easier to handle, there are also many people who are not technologically savvy or affluent enough to download books or use book-readers. Many people don’t even have their own computers or are dropping their home internet access in these hard times. They use the library’s computers instead.)

    Libraries also fill a gap in the system by offering access to out-of-print and back-catalog books that may be hard to come by either in print or online, making it easier for readers of series or completists who like an author or genre and want to read *everything*. This actually builds demand for future books of that author or genre, and if the readers don’t buy them, they will often urge their libraries to. Which may lead to more readers and eventually more demand.

Comments are closed.