GDELT, a digital news monitoring service backed by Google Jigsaw, has released a massive, open set of linking data, containing 1.78 billion links in CSV, with four fields for each link: "FromSite,ToSite,NumDays,NumLinks."
The dataset has been purged of boilerplate links from headers and footers and is intended to help researchers analyze trends in linking behavior, in service of GDELT's mission to "support new theories and descriptive understandings of the behaviors and driving forces of global-scale social systems from the micro-level of the individual through the macro-level of the entire planet."
It's 396MB compressed, or 986MB uncompressefd.
One of the most useful ways to use this dataset is to sort by the "NumDays" field to rank the top outlets linking to a given site or the top outlets that linked to another outlet. Using the NumDays field allows you to rank connections based on their longevity and filter out momentary bursts (such as a major story leading an outlet to run dozens and dozens of articles linking to an outside website for several days and then never linking to that website again).
The entire dataset was created with a single line of SQL in Google BigQuery, taking just 64.9 seconds and processing 199GB.
Who Links To Whom? The 30M Edge GKG Outlink Domain Graph April 2016 To Jan 2019 – The GDELT Project [GDELT Project]
(via Naked Capitalism)
After Deadspin's Laura Wagner published an incredible, brave, detailed look at how her new private equity masters -- Jim Spanfeller/Great Hill Partners -- were running Gawker now that they'd acquired it from Univision, the company (now called "G/O Media") struck back.
The Wall Street Journal investigates major corporations' ad buyers' practice of blacklisting of ads on news stories that deal with the world's most urgent issues, including any news story that contains the word "Trump" or "racism" or "gun" or "Brexit" or "suicide" (so much for reporting on the opioid epidemic).
For more than a decade, consumer rights groups (including EFF) worked with technologists and companies to try to standardize Do Not Track, a flag that browsers could send to online companies signaling that their users did not want their browsing activity tracked. Despite long hours and backing from the FTC, foot-dragging from the browser vendors […]
Accidents happen. And when they do, you’re going to want a dash cam for a second pair of eyes. At the minimum, a decent dash cam can save you vast sums of time and money in case of an accident. But a really good dash cam can do a whole lot more. Here are six […]
The field of data analytics is growing as fast as the internet itself. Self-driving cars, airline pricing, and huge marketing campaigns are all driven by the insights that data scientists can distill out of vast sums of information. Even with the help of powerful software like Python, it’s a highly skilled position. But those skills […]
If you’re marketing on the web, your Google-fu needs to be strong – and up to date. Without a firm grasp on what drives traffic, you’ll never be able to take the wheel. That’s why even if you know where to put your keywords, a little extra effort goes a long way on any marketer’s […]