Web Trackers Paint a Fresh Picture of You

The canvas HTML element, designed to provide a space on a web page in which JavaScript may draw graphics and paint type, has been hijacked to track visits to web sites, according to a newly released draft of an academic paper. "The Web never forgets" explains its discovery in the wild of a previously theoretically described way to identify an individual browser with some certainty by its eccentricities in rendering the items in question.

The paper also explains an updated approach in testing for certain kinds of nearly unkillable, persistent browser tracking ids known as "evercookies," and presents a fresh and large survey of the use of a variety of user- and browser-tracking techniques across popular Web sites.

The upshot? Not only are the techniques found in use years ago still employed widely on some of the Internet's highest-trafficked sites, but new methods and improvements have appeared, and there's still neither an effective way to block most of these tracking tricks nor effective regulation and enforcement to dissuade companies from pursuing them.

We can be tracked from page to page, session to session, and often site to site even when we tell companies not to and take every available measure to halt it. And it's not getting better. Sorry.

With great power, comes great responsive ability

Many of the features added to HTML5 were designed to allow better browser-side support for web apps, such as local, persistent database storage to hold documents or other data being manipulated on a page. In the past, the heavy lifting happened at a server. This was necessary when JavaScript was clunky, devices were slow, and tons of browser-specific kludges were required for basic interactivity.

But the forward march of HTML elements and requirements came in lockstep with improved compatibility and sophistication of browsers and spurred dramatic speed improvements in JavaScript even as the coding language's list of commands grew. Browsers, whether on mobile or desktop, can carry out so much more with so much less effort than just a matter of years ago, and with much less latency than continuous communication with servers. Smart web sites have shifted the computational burden to browsers, which lets them do cooler or more useful things with less server power, even as the price of data center hardware and virtual machines has plummeted. (Apple, Google, and Microsoft all have wanted and still want to break some of the hegemony of each other's locked-in desktop and mobile markets; providing consistent and fast web apps powered by better web browsers was part of that, while native apps for other companies' platforms is another.)

More sophistication brings with it a predicted price related to privacy about which this latest paper reveals more. The greater the power and flexibility of any given option for data to be stored or created in a browser, the more likely it can be used to uniquely identify a browser, if not an individual. Browser makers typically remain neutral or minimize the privacy issues around new features that have the potential to push information into browsers or identify them uniquely. Even in cases where they are not, the technology may be too powerful to overcome being subverted for tracking.

In the case of canvas, the element defines via HTML a region in which JavaScript can draw graphics primitives (shapes and lines) and type to produce unique, dynamically created local graphics for customizing a site's appearance, playing a game, or building data visualizations (charts, graphs, heatmaps, and the like). It also allows rich image and text composition, collage, and production without requiring locally installed image-manipulation software and without server involvement.

It's a partial replacement for features found in Flash, and as implemented a significantly more powerful and standardized method than anything previously widely available without a plug-in or on more than one or two browser platforms. About 85% of users worldwide have browsers capable of rendering text onto a canvas.

Canvas joins Flash cookies, HTML5 Session Storage, ETags, and many other tools used to provide state to a medium designed originally to be stateless and storage for a medium designed to rely on servers—as well as other innocent bystander components that were just minding their business—as mechanisms to follow us around.

Follow the cookie monster's money

Tracking a browser or a user across browsers is most obviously done with logins and regular browser cookies. Visit a site or log in to it, and various identifying information is typically stashed in cookie storage in a browser. Whenever the browser requests a file of any kind from a server at a domain that matches the cookie's domain or domain wildcard, the browser sends it back.

But cookies can be easily blocked or, if required to work with a site, deleted: browsers often include primitive controls and privacy modes in which cookies are deleted when an incognito session ends; third-party software can selectively block known advertising sites and cookies or, on demand, crush cookies left behind.

This makes marketers terribly unhappy, because allegedly the more information they know about you over time, the more carefully they can target advertising, which allows their customers paying for ads to produce a better conversion rate into purchases of goods and services. Further, maintaining identity over time allows better understanding of the lifecycle of someone's decision making from seeing information about a thing to consummating a sale or action (such as signing up for and reading an email list) to cancelling or switching to a competitor.

That's all very well and good with our consent. A battle raging now for several years is whether or not browsers, tied in with the Do Not Track HTTP header extension, should assume a user wants to be tracked, does not want to be tracked, or has expressed no opinion at all—and whether and how the advertising and tracking industries should honor that preference when expressed as "no." Many ad networks and sites offer some way opt out a browser or an identity from specific tracking data being used, but most still collect the data and claim to anonymize it for aggregated metrics; this paper provides some statistics on this topic as well.

Persistent, hidden cookies subvert the entire area of discussion, however. Ashkan Soltani, a security and privacy researcher with a deep history of exposing tracking methods (and a one-time FTC employee), says that in regards to a regulatory or technical approach for controlling tracking of users, "the pushback was that consumers had choice and could always opt-out." But in practice that hasn't been the case. "There's been a number of studies (including a few I've done) demonstrating the inability for consumers to opt-out of tracking."

Soltani notes that the Internet's economics are heavily driven by specific metrics, such as publishers being paid based on unique visitors and impressions. "There's a huge incentive to make sure you're identifying (i.e. cookie-ing/fingerprinting) each individual user in order to have an accurate count (and subsequently accurate dollar amount)," he says.

This incentive has led to a combination of testing and deployment of ways to track users even when they use every single tool at their disposal to prevent such snooping. Sites and ad networks spread a tracking id and replicate it across every nook and cranny that they can find in a browser and plug-ins like Flash. In 2010, Samy Kamkar wrote demonstration code he dubbed "evercookie" that would stash values in every possible location and automatically "respawn" the cookie (retrieve it and push it back into the browser cookie stash) when it was deleted.

The evercookie has entered common parlance as a term because it showed us just how vulnerable browsers are to such tracking, and how nearly impossible it is, even four years later, to block persistent identity. And that's where canvas now comes in.

Draw me like one of your French URLs

Canvas drawing isn't pixel identical across every instance of every version of every browser on every platform. It can be close, but any two browsers seemingly produce a slightly different outcome. As the paper's authors put it, "The same text can be rendered in different ways on different computers depending on the operating system, font library, graphics card, graphics driver and the browser."

The researchers from KU Leuven in Belgium and Princeton University found multiple versions of code that relied on this browser variation used on 5% of the top 100,000 sites worldwide as ranked by Alexa Internet. Most of the usage was from a single service, AddThis, which adds social-media and other sharing buttons to a site through an external JavaScript library reference and tiny bits of code. (I use AddThis on web pages for my publication, The Magazine.)

The code is based on a common, MIT open-source licensed library called fingerprintjs. It uses JavaScript to render a box, a bit of color, and, with AddThis's variant, a unicode character and other tests that help add variation or entropy. The resulting rendered image's binary data is base64-encoded (a representation of binary data as a subset of ASCII text characters) and then run through a hashing algorithm to produce a 32-bit value that has a high probability of being unique even for nearly pixel-identical images. That hash is a fingerprint, and is sent by the script in combination with other data, such as which fonts are installed or the browser ID string, that helps create a more distinct identity.

AddThis did not respond to a request for comment, but told ProPublica that its script was part of ongoing research, used on a subset of sites on which it's deployed, and didn't provide results that weren't "uniquely identifying enough" to rely on. AddThis also said that it doesn't ask permission from web sites to deploy such tests and that it doesn't use the data collected at government sites for "ad targeted or personalization," but didn't disclaim such use on other sites.

Less widely mentioned in the coverage of this paper are the extent to which the researchers surveyed and tested sites' use of not just canvas but evercookies, as well as automated testing of respawning. They also found the use of a new vector (the IndexedDB storage option for browsers) and some methods of respawning that they haven't yet been able to determine.

Typically, respawning tests involving Flash have been limited because hand checking of results was required. However, the authors built tools to monitor Flash cookies used to respawn browser cookies (tested against 10,000 sites) and vice versa (3,000 sites). Flash cookies are particularly insidious because they can be spread to any browser that has access to Flash on the same device. Some of the top-ranked Alexa sites make use of Flash respawning; of the top ten such by their rank, nine are registered in China (one in Hong Kong) and one in Russia.

The researchers also looked at cookie synchronization, in which the same identifier is used in a tracking ID across multiple sites. They examined Alexa's top 3,000 sites and estimate based on various options and tracking that 40% (with third-party cookie use disabled) or 50% (with it enabled) of a user's browsing history across those sites have the potential to be reconstructed by backend database analysis. Sites that engage in cookie sync include DoubleClick and Amazon's CloudFront, according to the report.

Panopticookie

The worst news from the report is how relatively prevalent these insidious techniques have been deployed and also how they have been extended. There is little legitimate purpose for almost all of these respawning and tracking methods beyond subverting the intent of users. If the intent were legitimate, they would be less persistent and less aggressive with more disclosure when they respawned.

Canvas fingerprinting can be defeated by not allowing JavaScript to read image data that it has created, an option that's part of the Tor Browser. Allowing case-by-case access on trusted sites in which one is using some form of graphics-based interaction might make sense, and browser makers and plug-in designers could add these options for those concerned.

The paper notes, however, that "tracking vectors such as localStorage, IndexedDB and canvas cannot be disabled, often due to the fact that doing so would break core functionality."

Thus one has to choose between the full-featured web of today and a more limited version. And, even when a user makes every possible choice and engages every mechanism, the tools may be inadequate to prevent all tracking, and new exploits could pop-up tomorrow that current knowledge and third-party software can't address.

Soltani is convinced after his time at the FTC (alongside Chris Soghoian, now at the ACLU, one of the Do Not Track concept's instigators) that "consumers will ultimately lose an arms race that's technology based." He worked to create a policy approach that would override any technological innovation, but "this has been mired in DC lobbying and not really progressed to be an effective mechanism."

Yet Soltani hasn't given up yet. He says, "A strong public debate is necessary and each subsequent news story or academic article on this topic will help inform that debate." Keeping the pressure up on companies that are found to engage in subterfuge for tracking purposes can help as well, but there is no consistent surveying done by any privacy-advocacy parties.

Soltani notes, "For every one or two academic publications, there are dozens of new methods/techniques to monetize users' Internet activity." Perhaps that's an area in which to place hope and support new research: automated and regular examination of tracking techniques would cast more light on the practices as they develop and expand.