Collecting user data is a competitive disadvantage

Warren Buffet is famous for identifying the need for businesses to have "moats" and "walls" around their profit-centers to keep competitors out, and data-centric companies often cite their massive collections of user-data as "moats" that benefit from "network effects" to make their businesses good investments.

In a smart, eye-opening essay, Martin Casado and Peter Lauten from the VC firm Andreesen Horowitz dismantle the idea that data benefits from "network effects" and that it presents any kind of "moat" to protect businesses: instead, the VCs demonstrate how collecting data gets more expensive, and less useful, over time.

To understand why, think of Netflix's data-collection, performed in service to its famous recommendation engine, which suggests programs you might enjoy based on the preferences of people who are similar to you. When Netflix is starting out, it needs to develop a "minimum viable corpus" in order to produce recommendations, but once that data is in place, new data produces diminishing returns in recommendations. Going from 100 to 1,000,000 users allows Netflix to dramatically improve its recommendations, but going from 1,000,000 to 1,000,100 (or even 2,000,000) produces very little new benefit.

Meanwhile, adding in all that data is expensive: first, because once everyone who already understands why they might subscribe to Netflix is a customer, Netflix has a much harder job of convincing the remaining population that it's worth their while to join up (their "cost of user-acquisition" goes up). Second, the computational costs of incorporating new data into a prediction model don't necessarily go down with volume, so the cost of recomputing the model when you add your 1,000,001st user isn't necessarily cheaper than when you add your 101st user (it might even be more expensive), and since the new user adds less value to the model than previous users did, the real costs of new users' data (relative to the benefits) are constantly going up. Add to that the other costs associated with new data: accurate labeling, more noise to lose the signal in, higher security costs…

And if that wasn't bad enough, the advantage your data brings you declines over time as the data gets stale (video tastes, traffic patterns, and other sources of business advantage change with time), and as time goes by, your competitors are able to assemble their own "minimum viable corpuses," further eroding that advantage.

As you gather data, the data also tends to become less valuable to add to the corpus. Why? Even if the new arbitrary batch of data has the same cost to collect as the last batch acquired, it yields less value given some of the new data you acquire already overlaps with your existing corpus. And this only gets worse over time: Benefits of new data go down.

In most of the startups we've seen, new data early on applies to the entire customer base. But beyond a certain point — such as the asymptote in the example graph above — new data collected will only apply to the small subsets that lie in the "long tail" of special use cases. As such, any data scale effect moats also become less valuable as the data set gets expanded.

The Empty Promise of Data Moats [Martin Casado and Peter Lauten/Andreesen Horowitz]

(via Four Short Links)

(Image: Rose and Trev Clough, CC-BY-SA)