Wiki-inspired "transparent" search-engine

Wikia Search is a new, wiki-inspired search-engine project that attempts to create a transparent set of ranking algorithms that fight spam and promote good stuff to the top. This is in contrast to Google, Yahoo, and other search engines, where the ranking algorithms are treated as trade secrets and high-risk tactics that have to be guarded from spammers.

The idea of a ranking algorithm is that it produces "good results" -- returns the best, most relevant results based on the user's search terms. We have a notion that the traditional search engine algorithm is "neutral" -- that it lacks an editorial bias and simply works to fulfill some mathematical destiny, embodying some Platonic ideal of "relevance." Compare this to an "inorganic" paid search result of the sort that Altavista used to sell.

But ranking algorithms are editorial: they embody the biases, hopes, beliefs and hypotheses of the programmers who write and design them. What's more, a tiny handful of search engines effectively control the prominence and viability of the majority of the information in the world.

And those search engines use secret ranking systems to systematically and secretly block enormous swaths of information on the grounds that it is spam, malware, or using deceptive "optimization" techniques. The list of block-ees is never published, nor are the criteria for blocking. This is done in the name of security, on the grounds that spammers and malware hackers are slowed down by the secrecy.

But "security through obscurity" is widely discredited in information security circles. Obscurity stops dumb attackers from getting through, but it lets the smart attackers clobber you because the smart defenders can't see how your system works and point out its flaws.

Seen in this light, it's positively bizarre: a few companies' secret editorial criteria are used to control what information we see, and those companies defend their secrecy in the name of security-through-obscurity? Yikes!

The Wikia Search project has assembled the basic technologies for a search engine, including a search application, search algorithm and Web crawler. The project will allow technology enthusiasts to help filter sites and rank search results, using a community model akin to that of Wikipedia.

The idea is to challenge the established players by offering a search service that is more transparent to end users, meaning they can see how search results are arrived at. Wales has described Yahoo and Google as opaque services that don't explain how results are arrived at.

Link (via /.)

(Disclosure: Jimmy Wales and I are writing a book together about a related subject)



  1. There’s a problem with making the page ranking algorithm transparent. The problem is that there are vast numbers of people who want to make money by getting you to visit particular sites, even though those sites are definitely not what you intend to visit. If they can do it, they will fool the search engines into ranking their clients’ sites highly for queries that have nothing to do, really, with anything you care about.

    So we’re talking about a game theory problem, in essence. The search engine confronts a hostile party.

    Keeping at least some aspects of the algorithm hidden is one defense, but it gives too much power to people like Google. Other defenses include changing the algorithm rapidly to keep defeating attacks on it. But if the Wikipedia model is used, then some of the volunteer developers may in fact be “traitors”, all set to cash in when the algorithm is updated.

    This doesn’t mean it can’t be done. It just means it’s going to be damned hard to get right.

  2. But “security through obscurity” is widely discredited in information security circles. Obscurity stops dumb attackers from getting through, but it lets the smart attackers clobber you because the smart defenders can’t see how your system works and point out its flaws.

    Sure. But it’s not obvious that this is relevant here.

    Cryptography algorithms are designed to turn meaningful data into something that’s indistinguishable from noise unless you have the necessary data (e.g., a private key) to interpret it.

    Search engines are designed to take a set of meaningful criteria (including the text of your query) and return a set of results; most of them also associate an ordering with this set. This set, and its ordering, should be ones that the vast majority of the users will find relevant and reasonable. (Personalized search is another can of worms that I won’t open here.)

    That is, the output of a search engine is designed to be as _transparently obvious_ as possible.

    To the extent that the criteria on which a page’s relevance and ranking depend on properties that are easily manipulable by third parties, you’re kind of screwed.

    Now, Google, Yahoo, Microsoft, and the rest are almost certainly doing their best to arrange matters so that the inputs to their algorithms are not something that can be manipulated (in a bad way) easily if at all, or at least in such a way that manipulation is obvious and possible to circumvent. (I am familiar with some of the research in this area.) But the means that criminals have to screw with these algorithms are the same ones that genuine users and contributors of data (i.e., creators of links) have to improve things in the first place, so you have to be very careful about locking things down.

    To put it another way: if your system has inherent flaws that are a function of the problem you’re trying to solve, then sometimes security through obscurity may be the best you can do.

    As a practical matter, I’d guess that in practice the relevance and ranking methods are undergoing constant and rapid metamorphosis to both promote good results and combat (perceived) manipulation…so I could easily imagine that keeping up with the changes (to examine them for problems) would be tricky at best.

    Now, it’s possible that search engines could publish some parts of their algorithms for external review. But…

    …getting back to that can of worms that we mentioned earlier: the “correctness” of relevance and ranking algorithms is subjective by definition. You need a broad spectrum of users (and usage data) in order to be able to measure how well the algorithms are doing. It’s not clear that third-party basement hackers would be able to help much…but third-party criminals might be given a major bonanza.

    Finally, the relevance/ranking algorithms are a large part of the IP upon which companies like Google and Yahoo (and to a lesser extent MS) are based. Granted, knowing Google’s algorithms wouldn’t give you access to their server farms (or their collected data)…but releasing them would basically hand Google’s competitors a gun with which to shoot them.

  3. I thought the main reason for the lack of transparency among search engines was to do with IP protection. Google does not publish its algorithms so Yahoo cannot use them, and vice versa. The argument of security might be used, but the real important one is competition.

  4. Every few years there’s a new search engine, but what really matters is coverage. Even if it starts out indexing half the pages Google does, that means it’s going to have on average half the hits, which is a huge difference in the queries returning just a few hits.

  5. Cory said, “But ranking algorithms are editorial…”

    I don’t know how that can ever be avoided. As long as information is being filtered (and we know it has to be filtered, just like water),we have to depend on someone’s sensibilites. Even if that someone is an emergent AI, it’s still going to use a program that reflect some value system that knows how to keep out most of the the crap. I woulnd’t want my emergent AI to be open source for just anyone to look at. These things take too much human effort to just be given away. Data needs to be worth something. Information Wants to be Free. Let’s be glad it’s not.

  6. I hope I’m wrong, but it looks like we can’t yet try it out? BB.n only linked to the article, and all that Google turns up is: which talks about the project, but doesn’t give a search box.

  7. @2: haha, you had the exact same insight I did (which doesn’t surprise me — hi, btw!)

    To put it another way: an encryption algorithm is considered cryptographically secure when it is equally hard to figure out the plaintext when you know how the algorithm works or when you don’t. Applying this notion to search, this would mean that it’s equally hard for an attacker to inject irrelevant data (pages) into the search results whether the attacker knows how the ranking algorithm works or not.

    These are two very, very different problems that can naively be described as the inverse of one another. Encryption is about turning a meaningful signal into a signal that’s indistinguishable from noise; search is about filtering the most meaningful signals out of a universe of noise. We’ve been rather successful at systematically turning signal into noise for the last few decades now; extracting signal from noise is a much harder problem. Spammy search results are another type of noise, but they’re noise that’s carefully crafted to look like signal. So it turns into an arms race: signal-extractors develop methods to find signals in noise, fake-signal-injectors develop methods to make their noise look like signal, signal-extractors develop ways to identify fake signals as noise, goto step 2.

    And it really doesn’t help the situation that extracting semantic content from text is a problem that’s almost definitely well past the Turing boundary; AI-hard problems are, well, hard. Mechanical Turks are useful for this kind of thing, but do they scale? I guess we’ll see.

  8. …HAH! If it works anything the way Wikipedia currently works, it’ll be a clusterfrack just as soon as the little Wikipedos and their ilk discover it and start adapting their little powertrip games to it.

  9. The problem with “anti-spam through obscurity” is that search algorithms can’t be entirely counter-intuitive. If they were, they couldn’t generate sane results. Ergo, they have some predictable features, and a relatively simple black-box structure that must be reasonably susceptible to empirical analysis.

    The problem is not dissimilar to that of DRM via encryption: the attacker has complete access to what the algorithm does, even though they are ignorance of (some parts) of how the algorithm does it. From a spammers point of view, this is sufficient information.

    Legitimate webmasters need to know how to set up sites that get decent responses in search results. For most modern search engines this is very simple: have large amounts of raw text that is relevant to the topic your site is concerned with. I once ran a small company selling gene expression analysis software, and our site was number one on Google within a fairly short time of going live if you searched for “gene expression analysis software”, simply because we had lots and lots of words on the topic.

    For a spammer, the problem is: create a site (with minimum effort) that looks as much as possible like a legitimate site while at the same time containing lots of ads or links to whatever is the actual revenue-generating part of the operation.

    This is a simple problem to solve in almost all cases, and search engines have the problem that on the one hand the site itself is the best source of information about its relevance to a given search, and on the other hand none of the information the site supplies about itself can be trusted.

    Externally based ranking algorithms, which is where Google got it’s initial value, only solve the problem for a while, because websites are cheap and spammers resourceful.

    It seems to me that this problem is ultimately going to result in relatively closed communities of trusted websites that will be used as an anchor for judging other site’s quality. Gaining legitimacy relative to one of those communities will be necessary for a good search ranking.

    Back in the day there were search engines and there were directory services. If what I’m suggesting in the above paragraph comes to pass, the communities will be like amorphous, searchable directories, giving some of the reliability of the rigid directories of the days of yore with some of the depth that search engines would give us if they could deal with the problem of co-evolving parasites.

  10. To supplement my comments (@2) above a bit…

    First, I think that an open-source search service is a nice idea. (Lucene, as an open-source search _engine_, has been around for some time.)

    Second, I like the notion of using user feedback on relevance and ranking results as an additional data source to improve the algorithms. This is actually a completely separate notion from the idea of making the algorithms themselves open to review, and should be judged separately.

    It could be that the characterization of what Wikia’s going to do as providing a more “transparent” search service is misleading. But if we assume that they are actually going to make the algorithms themselves available (that is, world-readable if not world-writeable), I offer here an excerpt from some comments that I made on this proposal elsewhere:

    Making the algorithms open doesn’t mean that any flaws will be fixed quickly; it just means that they’ll be _found_ (more) quickly. Part of the problem here is that it’s a lot harder to make a good fix to an algorithm [especially in this space!] than it is to correct an historical article. For one thing, as I mentioned on BB, random Wikia users won’t have access to the data that informs the algorithm design; all they’ll be able to see is the algorithm and the resultant rankings. Nor, I suspect, will Wikia be letting just anyone _edit_ their algorithms, unless they’re complete idiots.

    So, yeah. I guess I’ll reserve judgement until I see what they’re actually proposing. Wikia will have one advantage: at least initially, they might not have to worry about third parties gaming their system because it’s not worth the effort (at least in terms of monetary gain) to corrupt a system that no one’s using yet.

  11. First, I think that an open-source search service is a nice idea. (Lucene, as an open-source search _engine_, has been around for some time.)

    or YaCy.

  12. In terms of the algorithms having their own bias, I’ve written about that for years. And those in the search space understand it well, too. General public? Well, considering Google had to do a special box explaining why they didn’t hate George W. Bush despite ranking him tops for miserable failure — since so many people thought they did it on purpose — I suspect plenty of people think there are humans behind the scenes at Google pushing buttons. But mainly, when you say “we have a notion,” I’m not “we” nor are many people I know — so please qualify and say perhaps “there is a notion.”

    As for those secret penalties. Um, no quite. Lots of the criteria about why you might be blocked is published in search engine help files. And give Google – which is Jimmy Wales’s big PR scary bear in all this – some credit for actually notifying many webmasters about issues that might be wrong with their sites. The set up a formal program for this last year.

    But agreed totally that the security through obscurity thing should potentially apply to search, as well – plus there are real issues that not everyone knows if they are blocked. But let’s talk to Jimmy if (1) he gets enough traffic to become a target and (2) discovers what it’s like to have someone through a million pages of spam into your index with no effort. But if it’s really not so much a problem, why does Wikipedia need to nofollow outbound links? Surely there should be security through all the people just watching stuff, right? Or is it that yep, Wikipedia gets slammed hard. Also, Cory — um, it’s not like the search engineers at the major players don’t talk to the smart people who break in.

    Finally, when the first thing listed on Wikia Search is this pledge of transparency, why is it only the selected few that Jimmy picks get to get in there right now? Why isn’t it completely open? Why is there no news posted in the News & Notes section that it is even happening in a few days, much less than you can request an invite now?

  13. I think this approach is complementary to automatic algorithms used by current big search engines.

    Reputation of people is a reason that can help
    this hybrid (people+computers) method
    produce a good quality of search.

    Wikipedia items are is already
    quite often near top of automatic search results.
    This is because somebody has put significant
    effort to prepare such pages, so they are better.

    For same reason content of books
    is usually better than of newspapers articles.
    The name of author is on the book (transparent)
    so is her/his reputation.
    More effort = better results.

    In fact Google is already preparing
    similar human-edited system (knol).

    I think this is a first wave of Web 3.0,
    or “Smart Web”

    Dragan Sretenovic

  14. However the engine works, contributors have already decided that ‘mini article’ = ‘advertising opportunity’. I clicked the ‘random mini article’ link a couple of dozen times and around a third of articles were pushing a product or service. When wikia goes final some real editorial input will be needed to maintain their ‘no spam’ rules…

Comments are closed.