Data mining the intellectual history of the human race with Google Book Search

Discuss

28 Responses to “Data mining the intellectual history of the human race with Google Book Search”

  1. ibbers says:

    Cory, this is by far the stupidest heading I’ve seen on boingboing in the 5+ years I’ve been visiting.

    Human race? Too bad for Turkic languages (stretching from occupied Turkestan through to Turkey, Sanskrit & descendents (Hindi, Urdu & others), Arabic, the vibrant Dravidian languages such as Telugu or Tamil that have survived the Aryan invasion and 5000 (can’t recall exact date) years of Northern Sanskrit-centric cultural pressure, not to mention the other smaller languages with their own early modern and modern literatures. Way to go with the sweeping generalizations.

    Twelve anglophones, a Russian, a German, a Gaul and a Chinese – since this is essentially the sample used, doesn’t represent anything resembling humanity, nor even the West in the last 500 years (with a sprinkle of the Kingdom of Heaven).

    What they are doing is interesting, but the sample is seriously flawed and unrepresentative of anything at this point.

    So, yeah, kudos for accuracy and truthiness.

    On a side tangent, we’ll never have a true intellectual history of the human race, unless that history excludes pre-Columbian history

    • mpb says:

      Are you sure you meant to say this? If so, please explain.

      On a side tangent, we’ll never have a true intellectual history of the human race, unless that history excludes pre-Columbian history

      • ibbers says:

        please explain? take a chill pill…

        we’ll never have a true intellectual history of the human race using data from the 16th century to present as the literature of pre-Columbian America was destroyed by Europeans. Sure there are bits and pieces around, but no significant body that would meaningfully represent the intellectual traditions of those extinct urban civilisations.

      • ibbers says:

        mmmm, ok, i take your point now.

        rewrite as:

        On a side tangent, this project will never be capable of a true intellectual history of the human race as it will have to exclude the literature of pre-Columbian Americas. Sadly.

  2. Thalia says:

    I suspect that the number of interesting things that happened in a particular year has much to do with it. For example, 1945 is likely to have a higher spike than 1950, with a significantly longer half-life. Books still regularly discuss the time between 1939 and 1945, whereas few books have an in-depth discussion of 1957. I’d be curious what the results would look like if they normalized for “major news events.”

    • Anonymous says:

      This is a good point. The claim being made is that as you move forward in time the gradient of the decay curve gets steeper. What you’d want to do, to smooth out the “WW2 effect”, is take a series of rolling averages of this gradient over the preceding hundred years; what’s the average of the gradients from 1850 to 1950, from 1851 to 1951, and so on?, and then plot these averages. If the claim is good, then the rolling average should decline, and you’d only see a slight hiccup caused by the years ’39-’45.

    • Rothbarth says:

      This is a good point. The claim being made is that as you move forward in time the gradient of the decay curve gets steeper. What you’d want to do, to smooth out the “WW2 effect”, is take a series of rolling averages of this gradient over the preceding hundred years; what’s the average of the gradients from 1850 to 1950, from 1851 to 1951, and so on?, and then plot these averages. If the claim is good, then the rolling average should decline, and you’d only see a slight hiccup caused by the years ’39-’45.

    • rastronomicals says:

      Hey man, I.G.Y.!

  3. PrettyBoyTim says:

    Every year there are more years in the past to talk about, so less attention is paid to each individual year.

  4. nixiebunny says:

    The speed with which modern stuff gets well-known is mind-boggling. The iPad went from a secret to a multi-million seller in weeks, whereas the Xerox machine took a decade to take off.

    But that’s because all of us have nothing better to do than to talk about stuff on the interwebs, since we don’t spend all day on the farm working.

  5. jjsaul says:

    Surely they noticed that 1951 is a prime number, impacting the frequency of the string ” 1951 ” in general… meaning they’re really looking at hits of the years, not just the numbers?

    • querent says:

      Neat point.

      And this is a very neat idea.

    • knappa says:

      The graph disagrees with the text and shows 1950. (Which is 2*3*5^2*13.) Also, I’m not sure that there are that many mentions of prime numbers under say 50 or so – at least not enough to skew the result.

  6. Rincewind says:

    Great news! You can now do this sort of analysis for yourself, using Google’s Ngrams Labs site:

    http://ngrams.googlelabs.com/

    e.g.:

    http://ngrams.googlelabs.com/graph?content=freedom,+liberty&year_start=1920&year_end=2000&corpus=0&smoothing=3

  7. Anonymous says:

    This isn’t news, but I’m glad to see data like this to back it up.

    “I miss nostalgia.”

    • Tdawwg says:

      The majority of those would be false fucks, as the letter “s” in English was spelled like a lower-case “f” for a really long time. Any reader of Renaissance English will surely remember the first time they read Ariel’s lovely “Where the bee fucks, there fuck I,” or a similar schoolboyishly funny incident. So not only is the dataset skewed, as posters have noted above, but they haven’t really done the work of refining said data: they’re going by the form of the word, not parsing them into individual lexical-semantic units, the way a lexicographer would. (Thus, you can’t find true “fuck” from among the welter of false positives in, say, the eighteenth century, unless you knew what authors or texts to look for already: Lord Rochester, say.)

      Louis Menand noted in the NY Times that there wasn’t a single humanist attached to the project, and it shows in simple boners such as these. A lot of promise here, though, and I really hope they’ll craft tools to perform this kind of function, or find some way to introduce human cognition to the process: imagine the human-power used to produce the OED harnessed to this kind of data and computing power. “Holee shit,” as we say in the humanities….

  8. awjtawjt says:

    I’m having trouble with this. Need to think about it more. At first blush, it looks like the frequency is just a raw count. So naturally as publishing capacity increases, there will be more mentions (numerator) because the capacity (the denominator) has increased.

    I’d like to see the years mentioned as a % of the total published to that point.

    THEN, the rates of extinction for each spike could be compared.

    And adjusted for context, such as if the number appeared in a sentence or in a list of numbers. Or what genre. Or in what intra-sentence context – subject, object or what-have-you.

    Basically, I’m not satisfied with a frequency chart to explain this. A lot of things could explain these distributions, the way they are presented here in raw form.

  9. Salviati says:

    I would also suspect that decade-defining years (ending in 0) would get significantly more hits than years with any other ending. So the years 2000, 2010, 2020 would register more frequently than 2006, 2017, etc. When looking into the past, it is more common generalize about the 1960′s or 1970′s than any specific year. Alternately, when talking about the future of technology or politics, exact dates are unknown so it’s easier to refer to dates such as 2015, 2020, etc.

  10. khanhniferous says:

    Does anyone find their use of the word, “culture” problematic?

    Also from the paper–
    “Reading small collections of carefully chosen works enables
    scholars to make powerful inferences about trends in human
    thought.”
    Their analysis is limited to the English language and cannot be generalized to say “human thought”.

    Yes, sorry for being nitpicky.

    • mpb says:

      It’s more than problematic. They are trying to replace anthropology, which admittedly seems adrift in the past few decades, as the study of culture (a means, in addition to biology, which humans use to adapt to changing environments) with something called “Culturnomics”. ["We introduce culturomics: the application of high-throughput data collection and analysis to the study of human culture."]

      The neologism is silly and hardly up to the task of understanding the richness of humanity.

      “Culture” has been reduced to correlation of a limited metric, by folks sponsored by a commercial monopoly.

      By the way, anthropologists, as scientists of culture, have developed these techniques as depicted here for over half a century, but the old fashioned way, manually with Abney cards or on isolated computers, and with pittances. [Santa, are you tweeting this?]

      The dataset could be quite interesting within a larger context of study.

  11. BasilGanglia says:

    The thing I find most daunting/discouraging?

    from the paper: “The corpus cannot be read by a human. If you tried to read only the entries from the year 2000 alone, at the reasonable pace of 200 words/minute, without interruptions for food or sleep, it would take eighty years. The sequence of letters is one thousand times longer than the human genome: if you wrote it out in a straight line, it would reach to the moon and back 10 times over”

  12. Anonymous says:

    As they released the data for public consumption, I wonder what would happen if you did the following:

    1) take the most popular 5-gram.
    2) find the most popular 5-gram beginning with the last 4 words of the previous 5-gram.
    3) repeat step 2.

    Would the result make any sort of sense?

  13. Anonymous says:

    I can’t search for “fuck” It’s just a word, did they censor the databasse!?!?

    http://ngrams.googlelabs.com/graph?content=fuck&year_start=1700&year_end=2008&corpus=0&smoothing=3

  14. Anonymous says:

    Timewave Zero!

Leave a Reply