Data mining the intellectual history of the human race with Google Book Search

Harvard's Jean-Baptiste Michel, Erez Lieberman Aiden and colleagues have been analyzing the huge corpus of literature that Google digitized in its Book Search program, and they're uncovering absolutely fascinating information about our cultural lives, the evolution of language, the secret history of the world, censorship and even public health. It's all written up in a (regwalled) paper in Science, "Quantitative Analysis of Culture Using Millions of Digitized Books":

When the team looked at the frequency of individual years, they found a consistent pattern. In their own words: "'1951' was rarely discussed until the years immediately preceding 1951. Its frequency soared in 1951, remained high for three years, and then underwent a rapid decay, dropping by half over the next fifteen years." But the shape of these graphs is changing. The peak gets higher with every year and we are forgetting our past with greater speed. The half-life of '1880' was 32 years, but that of '1973' was a mere 10 years.

The future, however, is becoming ever more easily ingrained. The team found that new technology permeates through our culture with growing speed. By scanning the corpus for 154 inventions created between 1800-1960, from microwave ovens to electroencephalographs, they found that more recent ones took far less time to become widely discussed.

The cultural genome: Google Books reveals traces of fame, censorship and changing languages (via Beyond the Beyond)

28

  1. I suspect that the number of interesting things that happened in a particular year has much to do with it. For example, 1945 is likely to have a higher spike than 1950, with a significantly longer half-life. Books still regularly discuss the time between 1939 and 1945, whereas few books have an in-depth discussion of 1957. I’d be curious what the results would look like if they normalized for “major news events.”

    1. This is a good point. The claim being made is that as you move forward in time the gradient of the decay curve gets steeper. What you’d want to do, to smooth out the “WW2 effect”, is take a series of rolling averages of this gradient over the preceding hundred years; what’s the average of the gradients from 1850 to 1950, from 1851 to 1951, and so on?, and then plot these averages. If the claim is good, then the rolling average should decline, and you’d only see a slight hiccup caused by the years ’39-’45.

    2. This is a good point. The claim being made is that as you move forward in time the gradient of the decay curve gets steeper. What you’d want to do, to smooth out the “WW2 effect”, is take a series of rolling averages of this gradient over the preceding hundred years; what’s the average of the gradients from 1850 to 1950, from 1851 to 1951, and so on?, and then plot these averages. If the claim is good, then the rolling average should decline, and you’d only see a slight hiccup caused by the years ’39-’45.

  2. Every year there are more years in the past to talk about, so less attention is paid to each individual year.

  3. The speed with which modern stuff gets well-known is mind-boggling. The iPad went from a secret to a multi-million seller in weeks, whereas the Xerox machine took a decade to take off.

    But that’s because all of us have nothing better to do than to talk about stuff on the interwebs, since we don’t spend all day on the farm working.

  4. Surely they noticed that 1951 is a prime number, impacting the frequency of the string ” 1951 ” in general… meaning they’re really looking at hits of the years, not just the numbers?

    1. The graph disagrees with the text and shows 1950. (Which is 2*3*5^2*13.) Also, I’m not sure that there are that many mentions of prime numbers under say 50 or so – at least not enough to skew the result.

  5. I would also suspect that decade-defining years (ending in 0) would get significantly more hits than years with any other ending. So the years 2000, 2010, 2020 would register more frequently than 2006, 2017, etc. When looking into the past, it is more common generalize about the 1960’s or 1970’s than any specific year. Alternately, when talking about the future of technology or politics, exact dates are unknown so it’s easier to refer to dates such as 2015, 2020, etc.

  6. The thing I find most daunting/discouraging?

    from the paper: “The corpus cannot be read by a human. If you tried to read only the entries from the year 2000 alone, at the reasonable pace of 200 words/minute, without interruptions for food or sleep, it would take eighty years. The sequence of letters is one thousand times longer than the human genome: if you wrote it out in a straight line, it would reach to the moon and back 10 times over”

    1. The majority of those would be false fucks, as the letter “s” in English was spelled like a lower-case “f” for a really long time. Any reader of Renaissance English will surely remember the first time they read Ariel’s lovely “Where the bee fucks, there fuck I,” or a similar schoolboyishly funny incident. So not only is the dataset skewed, as posters have noted above, but they haven’t really done the work of refining said data: they’re going by the form of the word, not parsing them into individual lexical-semantic units, the way a lexicographer would. (Thus, you can’t find true “fuck” from among the welter of false positives in, say, the eighteenth century, unless you knew what authors or texts to look for already: Lord Rochester, say.)

      Louis Menand noted in the NY Times that there wasn’t a single humanist attached to the project, and it shows in simple boners such as these. A lot of promise here, though, and I really hope they’ll craft tools to perform this kind of function, or find some way to introduce human cognition to the process: imagine the human-power used to produce the OED harnessed to this kind of data and computing power. “Holee shit,” as we say in the humanities….

  7. I’m having trouble with this. Need to think about it more. At first blush, it looks like the frequency is just a raw count. So naturally as publishing capacity increases, there will be more mentions (numerator) because the capacity (the denominator) has increased.

    I’d like to see the years mentioned as a % of the total published to that point.

    THEN, the rates of extinction for each spike could be compared.

    And adjusted for context, such as if the number appeared in a sentence or in a list of numbers. Or what genre. Or in what intra-sentence context – subject, object or what-have-you.

    Basically, I’m not satisfied with a frequency chart to explain this. A lot of things could explain these distributions, the way they are presented here in raw form.

  8. Does anyone find their use of the word, “culture” problematic?

    Also from the paper–
    “Reading small collections of carefully chosen works enables
    scholars to make powerful inferences about trends in human
    thought.”
    Their analysis is limited to the English language and cannot be generalized to say “human thought”.

    Yes, sorry for being nitpicky.

    1. It’s more than problematic. They are trying to replace anthropology, which admittedly seems adrift in the past few decades, as the study of culture (a means, in addition to biology, which humans use to adapt to changing environments) with something called “Culturnomics”. [“We introduce culturomics: the application of high-throughput data collection and analysis to the study of human culture.”]

      The neologism is silly and hardly up to the task of understanding the richness of humanity.

      “Culture” has been reduced to correlation of a limited metric, by folks sponsored by a commercial monopoly.

      By the way, anthropologists, as scientists of culture, have developed these techniques as depicted here for over half a century, but the old fashioned way, manually with Abney cards or on isolated computers, and with pittances. [Santa, are you tweeting this?]

      The dataset could be quite interesting within a larger context of study.

  9. Cory, this is by far the stupidest heading I’ve seen on boingboing in the 5+ years I’ve been visiting.

    Human race? Too bad for Turkic languages (stretching from occupied Turkestan through to Turkey, Sanskrit & descendents (Hindi, Urdu & others), Arabic, the vibrant Dravidian languages such as Telugu or Tamil that have survived the Aryan invasion and 5000 (can’t recall exact date) years of Northern Sanskrit-centric cultural pressure, not to mention the other smaller languages with their own early modern and modern literatures. Way to go with the sweeping generalizations.

    Twelve anglophones, a Russian, a German, a Gaul and a Chinese – since this is essentially the sample used, doesn’t represent anything resembling humanity, nor even the West in the last 500 years (with a sprinkle of the Kingdom of Heaven).

    What they are doing is interesting, but the sample is seriously flawed and unrepresentative of anything at this point.

    So, yeah, kudos for accuracy and truthiness.

    On a side tangent, we’ll never have a true intellectual history of the human race, unless that history excludes pre-Columbian history

    1. Are you sure you meant to say this? If so, please explain.

      On a side tangent, we’ll never have a true intellectual history of the human race, unless that history excludes pre-Columbian history

      1. please explain? take a chill pill…

        we’ll never have a true intellectual history of the human race using data from the 16th century to present as the literature of pre-Columbian America was destroyed by Europeans. Sure there are bits and pieces around, but no significant body that would meaningfully represent the intellectual traditions of those extinct urban civilisations.

      2. mmmm, ok, i take your point now.

        rewrite as:

        On a side tangent, this project will never be capable of a true intellectual history of the human race as it will have to exclude the literature of pre-Columbian Americas. Sadly.

  10. As they released the data for public consumption, I wonder what would happen if you did the following:

    1) take the most popular 5-gram.
    2) find the most popular 5-gram beginning with the last 4 words of the previous 5-gram.
    3) repeat step 2.

    Would the result make any sort of sense?

Comments are closed.