<?xml version="1.0" encoding="UTF-8"?><rss version="2.0"
	xmlns:content="http://purl.org/rss/1.0/modules/content/"
	xmlns:dc="http://purl.org/dc/elements/1.1/"
	xmlns:atom="http://www.w3.org/2005/Atom"
	xmlns:sy="http://purl.org/rss/1.0/modules/syndication/"
		>
<channel>
	<title>Comments on: Wiki-inspired &quot;transparent&quot;&#160;search-engine</title>
	<atom:link href="http://boingboing.net/2008/01/01/wikiinspired-transpa.html/feed" rel="self" type="application/rss+xml" />
	<link>http://boingboing.net/2008/01/01/wikiinspired-transpa.html</link>
	<description>Brain candy for Happy Mutants</description>
	<lastBuildDate>Tue, 21 May 2013 08:34:00 +0000</lastBuildDate>
	<sy:updatePeriod>hourly</sy:updatePeriod>
	<sy:updateFrequency>1</sy:updateFrequency>
	<generator>http://wordpress.org/?v=3.4.1</generator>
	<item>
		<title>By: Antonio Silva</title>
		<link>http://boingboing.net/2008/01/01/wikiinspired-transpa.html#comment-99082</link>
		<dc:creator>Antonio Silva</dc:creator>
		<pubDate>Wed, 30 Nov -0001 00:00:00 +0000</pubDate>
		<guid isPermaLink="false">#comment-99082</guid>
		<description>I thought the main reason for the lack of transparency among search engines was to do with IP protection. Google does not publish its algorithms so Yahoo cannot use them, and vice versa. The argument of security might be used, but the real important one is competition.</description>
		<content:encoded><![CDATA[<p>I thought the main reason for the lack of transparency among search engines was to do with IP protection. Google does not publish its algorithms so Yahoo cannot use them, and vice versa. The argument of security might be used, but the real important one is competition.</p>
]]></content:encoded>
	</item>
	<item>
		<title>By: zuzu</title>
		<link>http://boingboing.net/2008/01/01/wikiinspired-transpa.html#comment-99342</link>
		<dc:creator>zuzu</dc:creator>
		<pubDate>Wed, 30 Nov -0001 00:00:00 +0000</pubDate>
		<guid isPermaLink="false">#comment-99342</guid>
		<description>&lt;blockquote&gt;First, I think that an open-source search service is a nice idea. (Lucene, as an open-source search _engine_, has been around for some time.)&lt;/blockquote&gt;
or &lt;a href=&quot;http://en.wikipedia.org/wiki/YaCy&quot;&gt;YaCy&lt;/a&gt;.</description>
		<content:encoded><![CDATA[<blockquote><p>First, I think that an open-source search service is a nice idea. (Lucene, as an open-source search _engine_, has been around for some time.)</p></blockquote>
<p>or <a href="http://en.wikipedia.org/wiki/YaCy">YaCy</a>.</p>
]]></content:encoded>
	</item>
	<item>
		<title>By: js7a</title>
		<link>http://boingboing.net/2008/01/01/wikiinspired-transpa.html#comment-99102</link>
		<dc:creator>js7a</dc:creator>
		<pubDate>Wed, 30 Nov -0001 00:00:00 +0000</pubDate>
		<guid isPermaLink="false">#comment-99102</guid>
		<description>Every few years there&#039;s a new search engine, but what really matters is coverage.  Even if it starts out indexing half the pages Google does, that means  it&#039;s going to have on average half the hits, which is a huge difference in the queries returning just a few hits.</description>
		<content:encoded><![CDATA[<p>Every few years there&#8217;s a new search engine, but what really matters is coverage.  Even if it starts out indexing half the pages Google does, that means  it&#8217;s going to have on average half the hits, which is a huge difference in the queries returning just a few hits.</p>
]]></content:encoded>
	</item>
	<item>
		<title>By: Jeff</title>
		<link>http://boingboing.net/2008/01/01/wikiinspired-transpa.html#comment-99111</link>
		<dc:creator>Jeff</dc:creator>
		<pubDate>Wed, 30 Nov -0001 00:00:00 +0000</pubDate>
		<guid isPermaLink="false">#comment-99111</guid>
		<description>Cory said, &quot;But ranking algorithms are editorial...&quot;

I don&#039;t know how that can ever be avoided. As long as information is being filtered (and we know it has to be filtered, just like water),we have to depend on someone&#039;s sensibilites. Even if that someone is an emergent AI, it&#039;s still going to use a program that reflect some value system that knows how to keep out most of the the crap. I woulnd&#039;t want my emergent AI to be open source for just anyone to look at. These things take too much human effort to just be given away. Data needs to be worth something. Information Wants to be Free. Let&#039;s be glad it&#039;s not.</description>
		<content:encoded><![CDATA[<p>Cory said, &#8220;But ranking algorithms are editorial&#8230;&#8221;</p>
<p>I don&#8217;t know how that can ever be avoided. As long as information is being filtered (and we know it has to be filtered, just like water),we have to depend on someone&#8217;s sensibilites. Even if that someone is an emergent AI, it&#8217;s still going to use a program that reflect some value system that knows how to keep out most of the the crap. I woulnd&#8217;t want my emergent AI to be open source for just anyone to look at. These things take too much human effort to just be given away. Data needs to be worth something. Information Wants to be Free. Let&#8217;s be glad it&#8217;s not.</p>
]]></content:encoded>
	</item>
	<item>
		<title>By: charliekkendo</title>
		<link>http://boingboing.net/2008/01/01/wikiinspired-transpa.html#comment-99134</link>
		<dc:creator>charliekkendo</dc:creator>
		<pubDate>Wed, 30 Nov -0001 00:00:00 +0000</pubDate>
		<guid isPermaLink="false">#comment-99134</guid>
		<description>Fascinating discussion. </description>
		<content:encoded><![CDATA[<p>Fascinating discussion. </p>
]]></content:encoded>
	</item>
	<item>
		<title>By: Zachariah</title>
		<link>http://boingboing.net/2008/01/01/wikiinspired-transpa.html#comment-99158</link>
		<dc:creator>Zachariah</dc:creator>
		<pubDate>Wed, 30 Nov -0001 00:00:00 +0000</pubDate>
		<guid isPermaLink="false">#comment-99158</guid>
		<description>I hope I&#039;m wrong, but it looks like we can&#039;t yet try it out?  BB.n only linked to the article, and all that Google turns up is: http://search.wikia.com/ which talks about the project, but doesn&#039;t give a search box.</description>
		<content:encoded><![CDATA[<p>I hope I&#8217;m wrong, but it looks like we can&#8217;t yet try it out?  BB.n only linked to the article, and all that Google turns up is: <a href="http://search.wikia.com/" rel="nofollow">http://search.wikia.com/</a> which talks about the project, but doesn&#8217;t give a search box.</p>
]]></content:encoded>
	</item>
	<item>
		<title>By: jrtom</title>
		<link>http://boingboing.net/2008/01/01/wikiinspired-transpa.html#comment-99165</link>
		<dc:creator>jrtom</dc:creator>
		<pubDate>Wed, 30 Nov -0001 00:00:00 +0000</pubDate>
		<guid isPermaLink="false">#comment-99165</guid>
		<description>@7: The linked article states that it&#039;s opening 7 January, I believe.</description>
		<content:encoded><![CDATA[<p>@7: The linked article states that it&#8217;s opening 7 January, I believe.</p>
]]></content:encoded>
	</item>
	<item>
		<title>By: dannysullivan</title>
		<link>http://boingboing.net/2008/01/01/wikiinspired-transpa.html#comment-99457</link>
		<dc:creator>dannysullivan</dc:creator>
		<pubDate>Wed, 30 Nov -0001 00:00:00 +0000</pubDate>
		<guid isPermaLink="false">#comment-99457</guid>
		<description>In terms of the algorithms having their own bias, I&#039;ve written about that for years. And those in the search space understand it well, too. General public? Well, considering Google had to do a special box explaining why they didn&#039;t hate George W. Bush despite ranking him tops for miserable failure -- since so many people thought they did it on purpose -- I suspect plenty of people think there are humans behind the scenes at Google pushing buttons. But mainly, when you say &quot;we have a notion,&quot; I&#039;m not &quot;we&quot; nor are many people I know -- so please qualify and say perhaps &quot;there is a notion.&quot;

As for those secret penalties. Um, no quite. Lots of the criteria about why you might be blocked is published in search engine help files. And give Google â€“ which is Jimmy Walesâ€™s big PR scary bear in all this â€“ some credit for actually notifying many webmasters about issues that might be wrong with their sites. The set up a formal program for this last year.

But agreed totally that the security through obscurity thing should potentially apply to search, as well â€“ plus there are real issues that not everyone knows if they are blocked. But letâ€™s talk to Jimmy if (1) he gets enough traffic to become a target and (2) discovers what itâ€™s like to have someone through a million pages of spam into your index with no effort. But if it&#039;s really not so much a problem, why does Wikipedia need to nofollow outbound links? Surely there should be security through all the people just watching stuff, right? Or is it that yep, Wikipedia gets slammed hard. Also, Cory -- um, it&#039;s not like the search engineers at the major players don&#039;t talk to the smart people who break in.

Finally, when the first thing listed on Wikia Search is this pledge of transparency, why is it only the selected few that Jimmy picks get to get in there right now? Why isn&#039;t it completely open? Why is there no news posted in the News &amp; Notes section that it is even happening in a few days, much less than you can request an invite now?</description>
		<content:encoded><![CDATA[<p>In terms of the algorithms having their own bias, I&#8217;ve written about that for years. And those in the search space understand it well, too. General public? Well, considering Google had to do a special box explaining why they didn&#8217;t hate George W. Bush despite ranking him tops for miserable failure &#8212; since so many people thought they did it on purpose &#8212; I suspect plenty of people think there are humans behind the scenes at Google pushing buttons. But mainly, when you say &#8220;we have a notion,&#8221; I&#8217;m not &#8220;we&#8221; nor are many people I know &#8212; so please qualify and say perhaps &#8220;there is a notion.&#8221;</p>
<p>As for those secret penalties. Um, no quite. Lots of the criteria about why you might be blocked is published in search engine help files. And give Google â€“ which is Jimmy Walesâ€™s big PR scary bear in all this â€“ some credit for actually notifying many webmasters about issues that might be wrong with their sites. The set up a formal program for this last year.</p>
<p>But agreed totally that the security through obscurity thing should potentially apply to search, as well â€“ plus there are real issues that not everyone knows if they are blocked. But letâ€™s talk to Jimmy if (1) he gets enough traffic to become a target and (2) discovers what itâ€™s like to have someone through a million pages of spam into your index with no effort. But if it&#8217;s really not so much a problem, why does Wikipedia need to nofollow outbound links? Surely there should be security through all the people just watching stuff, right? Or is it that yep, Wikipedia gets slammed hard. Also, Cory &#8212; um, it&#8217;s not like the search engineers at the major players don&#8217;t talk to the smart people who break in.</p>
<p>Finally, when the first thing listed on Wikia Search is this pledge of transparency, why is it only the selected few that Jimmy picks get to get in there right now? Why isn&#8217;t it completely open? Why is there no news posted in the News &#038; Notes section that it is even happening in a few days, much less than you can request an invite now?</p>
]]></content:encoded>
	</item>
	<item>
		<title>By: Teresa Nielsen Hayden / Moderator</title>
		<link>http://boingboing.net/2008/01/01/wikiinspired-transpa.html#comment-99207</link>
		<dc:creator>Teresa Nielsen Hayden / Moderator</dc:creator>
		<pubDate>Wed, 30 Nov -0001 00:00:00 +0000</pubDate>
		<guid isPermaLink="false">#comment-99207</guid>
		<description>Great thread.</description>
		<content:encoded><![CDATA[<p>Great thread.</p>
]]></content:encoded>
	</item>
	<item>
		<title>By: mlp</title>
		<link>http://boingboing.net/2008/01/01/wikiinspired-transpa.html#comment-99216</link>
		<dc:creator>mlp</dc:creator>
		<pubDate>Wed, 30 Nov -0001 00:00:00 +0000</pubDate>
		<guid isPermaLink="false">#comment-99216</guid>
		<description>@2: haha, you had the exact same insight I did (which doesn&#039;t surprise me -- hi, btw!)

To put it another way: an encryption algorithm is considered cryptographically secure when it is equally hard to figure out the plaintext when you know how the algorithm works or when you don&#039;t. Applying this notion to search, this would mean that it&#039;s equally hard for an attacker to inject irrelevant data (pages) into the search results whether the attacker knows how the ranking algorithm works or not.

These are two very, very different problems that can naively be described as the inverse of one another. Encryption is about turning a meaningful signal into a signal that&#039;s indistinguishable from noise; search is about filtering the most meaningful signals out of a universe of noise. We&#039;ve been rather successful at systematically turning signal into noise for the last few decades now; extracting signal from noise is a much harder problem. Spammy search results are another type of noise, but they&#039;re noise that&#039;s carefully crafted to look like signal. So it turns into an arms race: signal-extractors develop methods to find signals in noise, fake-signal-injectors develop methods to make their noise look like signal, signal-extractors develop ways to identify fake signals as noise, goto step 2.

And it really doesn&#039;t help the situation that extracting semantic content from text is a problem that&#039;s almost definitely well past the Turing boundary; AI-hard problems are, well, hard. Mechanical Turks are useful for this kind of thing, but do they scale? I guess we&#039;ll see.</description>
		<content:encoded><![CDATA[<p>@2: haha, you had the exact same insight I did (which doesn&#8217;t surprise me &#8212; hi, btw!)</p>
<p>To put it another way: an encryption algorithm is considered cryptographically secure when it is equally hard to figure out the plaintext when you know how the algorithm works or when you don&#8217;t. Applying this notion to search, this would mean that it&#8217;s equally hard for an attacker to inject irrelevant data (pages) into the search results whether the attacker knows how the ranking algorithm works or not.</p>
<p>These are two very, very different problems that can naively be described as the inverse of one another. Encryption is about turning a meaningful signal into a signal that&#8217;s indistinguishable from noise; search is about filtering the most meaningful signals out of a universe of noise. We&#8217;ve been rather successful at systematically turning signal into noise for the last few decades now; extracting signal from noise is a much harder problem. Spammy search results are another type of noise, but they&#8217;re noise that&#8217;s carefully crafted to look like signal. So it turns into an arms race: signal-extractors develop methods to find signals in noise, fake-signal-injectors develop methods to make their noise look like signal, signal-extractors develop ways to identify fake signals as noise, goto step 2.</p>
<p>And it really doesn&#8217;t help the situation that extracting semantic content from text is a problem that&#8217;s almost definitely well past the Turing boundary; AI-hard problems are, well, hard. Mechanical Turks are useful for this kind of thing, but do they scale? I guess we&#8217;ll see.</p>
]]></content:encoded>
	</item>
	<item>
		<title>By: OM</title>
		<link>http://boingboing.net/2008/01/01/wikiinspired-transpa.html#comment-99219</link>
		<dc:creator>OM</dc:creator>
		<pubDate>Wed, 30 Nov -0001 00:00:00 +0000</pubDate>
		<guid isPermaLink="false">#comment-99219</guid>
		<description>...HAH! If it works anything the way Wikipedia currently works, it&#039;ll be a clusterfrack just as soon as the little Wikipedos and their ilk discover it and start adapting their little powertrip games to it.
</description>
		<content:encoded><![CDATA[<p>&#8230;HAH! If it works anything the way Wikipedia currently works, it&#8217;ll be a clusterfrack just as soon as the little Wikipedos and their ilk discover it and start adapting their little powertrip games to it.</p>
]]></content:encoded>
	</item>
	<item>
		<title>By: orangos</title>
		<link>http://boingboing.net/2008/01/01/wikiinspired-transpa.html#comment-103829</link>
		<dc:creator>orangos</dc:creator>
		<pubDate>Wed, 30 Nov -0001 00:00:00 +0000</pubDate>
		<guid isPermaLink="false">#comment-103829</guid>
		<description>However the engine works, contributors have already decided that &#039;mini article&#039; = &#039;advertising opportunity&#039;. I clicked the &#039;random mini article&#039; link a couple of dozen times and around a third of articles were pushing a product or service. When wikia goes final some real editorial input will be needed to maintain their &#039;no spam&#039; rules...</description>
		<content:encoded><![CDATA[<p>However the engine works, contributors have already decided that &#8216;mini article&#8217; = &#8216;advertising opportunity&#8217;. I clicked the &#8216;random mini article&#8217; link a couple of dozen times and around a third of articles were pushing a product or service. When wikia goes final some real editorial input will be needed to maintain their &#8216;no spam&#8217; rules&#8230;</p>
]]></content:encoded>
	</item>
	<item>
		<title>By: dragansr</title>
		<link>http://boingboing.net/2008/01/01/wikiinspired-transpa.html#comment-101017</link>
		<dc:creator>dragansr</dc:creator>
		<pubDate>Wed, 30 Nov -0001 00:00:00 +0000</pubDate>
		<guid isPermaLink="false">#comment-101017</guid>
		<description>
I think this approach is complementary to automatic algorithms used by current big search engines.

Reputation of people is a reason that can help 
this hybrid (people+computers) method 
produce a good quality of search.

Wikipedia items are is already 
quite often near top of automatic search results.
This is because somebody has put significant
effort to prepare such pages, so they are better.

For same reason content of books 
is usually better than of newspapers articles. 
The name of author is on the book (transparent)
so is her/his reputation.
More effort = better results. 

In fact Google is already preparing
similar human-edited system (knol).

I think this is a first wave of Web 3.0,
or &quot;Smart Web&quot;

Dragan Sretenovic









</description>
		<content:encoded><![CDATA[<p>I think this approach is complementary to automatic algorithms used by current big search engines.</p>
<p>Reputation of people is a reason that can help<br />
this hybrid (people+computers) method<br />
produce a good quality of search.</p>
<p>Wikipedia items are is already<br />
quite often near top of automatic search results.<br />
This is because somebody has put significant<br />
effort to prepare such pages, so they are better.</p>
<p>For same reason content of books<br />
is usually better than of newspapers articles.<br />
The name of author is on the book (transparent)<br />
so is her/his reputation.<br />
More effort = better results. </p>
<p>In fact Google is already preparing<br />
similar human-edited system (knol).</p>
<p>I think this is a first wave of Web 3.0,<br />
or &#8220;Smart Web&#8221;</p>
<p>Dragan Sretenovic</p>
]]></content:encoded>
	</item>
	<item>
		<title>By: Tom</title>
		<link>http://boingboing.net/2008/01/01/wikiinspired-transpa.html#comment-99247</link>
		<dc:creator>Tom</dc:creator>
		<pubDate>Wed, 30 Nov -0001 00:00:00 +0000</pubDate>
		<guid isPermaLink="false">#comment-99247</guid>
		<description>The problem with &quot;anti-spam through obscurity&quot; is that search algorithms can&#039;t be entirely counter-intuitive.  If they were, they couldn&#039;t generate sane results.  Ergo, they have some predictable features, and a relatively simple black-box structure that must be reasonably susceptible to empirical analysis.

The problem is not dissimilar to that of DRM via encryption:  the attacker has complete access to &lt;i&gt;what&lt;/i&gt; the algorithm does, even though they are ignorance of (some parts) of how the algorithm does it.  From a spammers point of view, this is sufficient information.

Legitimate webmasters need to know how to set up sites that get decent responses in search results.  For most modern search engines this is very simple:  have large amounts of raw text that is relevant to the topic your site is concerned with.  I once ran a small company selling gene expression analysis software, and our site was number one on Google within a fairly short time of going live if you searched for &quot;gene expression analysis software&quot;, simply because we had lots and lots of words on the topic.

For a spammer, the problem is:  create a site (with minimum effort) that looks as much as possible like a legitimate site while at the same time containing lots of ads or links to whatever is the actual revenue-generating part of the operation.

This is a simple problem to solve in almost all cases, and search engines have the problem that on the one hand the site itself is the best source of information about its relevance to a given search, and on the other hand none of the information the site supplies about itself can be trusted.

Externally based ranking algorithms, which is where Google got it&#039;s initial value, only solve the problem for a while, because websites are cheap and spammers resourceful.

It seems to me that this problem is ultimately going to result in relatively closed communities of trusted websites that will be used as an anchor for judging other site&#039;s quality.  Gaining legitimacy relative to one of those communities will be necessary for a good search ranking.

Back in the day there were search engines and there were directory services.  If what I&#039;m suggesting in the above paragraph comes to pass, the communities will be like amorphous, searchable directories, giving some of the reliability of the rigid directories of the days of yore with some of the depth that search engines would give us if they could deal with the problem of co-evolving parasites.
</description>
		<content:encoded><![CDATA[<p>The problem with &#8220;anti-spam through obscurity&#8221; is that search algorithms can&#8217;t be entirely counter-intuitive.  If they were, they couldn&#8217;t generate sane results.  Ergo, they have some predictable features, and a relatively simple black-box structure that must be reasonably susceptible to empirical analysis.</p>
<p>The problem is not dissimilar to that of DRM via encryption:  the attacker has complete access to <i>what</i> the algorithm does, even though they are ignorance of (some parts) of how the algorithm does it.  From a spammers point of view, this is sufficient information.</p>
<p>Legitimate webmasters need to know how to set up sites that get decent responses in search results.  For most modern search engines this is very simple:  have large amounts of raw text that is relevant to the topic your site is concerned with.  I once ran a small company selling gene expression analysis software, and our site was number one on Google within a fairly short time of going live if you searched for &#8220;gene expression analysis software&#8221;, simply because we had lots and lots of words on the topic.</p>
<p>For a spammer, the problem is:  create a site (with minimum effort) that looks as much as possible like a legitimate site while at the same time containing lots of ads or links to whatever is the actual revenue-generating part of the operation.</p>
<p>This is a simple problem to solve in almost all cases, and search engines have the problem that on the one hand the site itself is the best source of information about its relevance to a given search, and on the other hand none of the information the site supplies about itself can be trusted.</p>
<p>Externally based ranking algorithms, which is where Google got it&#8217;s initial value, only solve the problem for a while, because websites are cheap and spammers resourceful.</p>
<p>It seems to me that this problem is ultimately going to result in relatively closed communities of trusted websites that will be used as an anchor for judging other site&#8217;s quality.  Gaining legitimacy relative to one of those communities will be necessary for a good search ranking.</p>
<p>Back in the day there were search engines and there were directory services.  If what I&#8217;m suggesting in the above paragraph comes to pass, the communities will be like amorphous, searchable directories, giving some of the reliability of the rigid directories of the days of yore with some of the depth that search engines would give us if they could deal with the problem of co-evolving parasites.</p>
]]></content:encoded>
	</item>
	<item>
		<title>By: yotta</title>
		<link>http://boingboing.net/2008/01/01/wikiinspired-transpa.html#comment-99257</link>
		<dc:creator>yotta</dc:creator>
		<pubDate>Wed, 30 Nov -0001 00:00:00 +0000</pubDate>
		<guid isPermaLink="false">#comment-99257</guid>
		<description>Fine, but all I really want is some decent searching within Wikipedia. Their search box is terrible.</description>
		<content:encoded><![CDATA[<p>Fine, but all I really want is some decent searching within Wikipedia. Their search box is terrible.</p>
]]></content:encoded>
	</item>
	<item>
		<title>By: Joe</title>
		<link>http://boingboing.net/2008/01/01/wikiinspired-transpa.html#comment-99034</link>
		<dc:creator>Joe</dc:creator>
		<pubDate>Wed, 30 Nov -0001 00:00:00 +0000</pubDate>
		<guid isPermaLink="false">#comment-99034</guid>
		<description>There&#039;s a problem with making the page ranking algorithm transparent.  The problem is that there are vast numbers of people who want to make money by getting you to visit particular sites, even though those sites are definitely not what you intend to visit.  If they can do it, they will fool the search engines into ranking their clients&#039; sites highly for queries that have nothing to do, really, with anything you care about.

So we&#039;re talking about a game theory problem, in essence.  The search engine confronts a hostile party.

Keeping at least some aspects of the algorithm hidden is one defense, but it gives too much power to people like Google.  Other defenses include changing the algorithm rapidly to keep defeating attacks on it.  But if the Wikipedia model is used, then some of the volunteer developers may in fact be &quot;traitors&quot;, all set to cash in when the algorithm is updated.

This doesn&#039;t mean it can&#039;t be done.  It just means it&#039;s going to be damned hard to get right.

</description>
		<content:encoded><![CDATA[<p>There&#8217;s a problem with making the page ranking algorithm transparent.  The problem is that there are vast numbers of people who want to make money by getting you to visit particular sites, even though those sites are definitely not what you intend to visit.  If they can do it, they will fool the search engines into ranking their clients&#8217; sites highly for queries that have nothing to do, really, with anything you care about.</p>
<p>So we&#8217;re talking about a game theory problem, in essence.  The search engine confronts a hostile party.</p>
<p>Keeping at least some aspects of the algorithm hidden is one defense, but it gives too much power to people like Google.  Other defenses include changing the algorithm rapidly to keep defeating attacks on it.  But if the Wikipedia model is used, then some of the volunteer developers may in fact be &#8220;traitors&#8221;, all set to cash in when the algorithm is updated.</p>
<p>This doesn&#8217;t mean it can&#8217;t be done.  It just means it&#8217;s going to be damned hard to get right.</p>
]]></content:encoded>
	</item>
	<item>
		<title>By: jrtom</title>
		<link>http://boingboing.net/2008/01/01/wikiinspired-transpa.html#comment-99044</link>
		<dc:creator>jrtom</dc:creator>
		<pubDate>Wed, 30 Nov -0001 00:00:00 +0000</pubDate>
		<guid isPermaLink="false">#comment-99044</guid>
		<description>&lt;i&gt;But &quot;security through obscurity&quot; is widely discredited in information security circles. Obscurity stops dumb attackers from getting through, but it lets the smart attackers clobber you because the smart defenders can&#039;t see how your system works and point out its flaws.&lt;/i&gt;

Sure.  But it&#039;s not obvious that this is relevant here.

Cryptography algorithms are designed to turn meaningful data into something that&#039;s indistinguishable from noise unless you have the necessary data (e.g., a private key) to interpret it.

Search engines are designed to take a set of meaningful criteria (including the text of your query) and return a set of results; most of them also associate an ordering with this set.  This set, and its ordering, should be ones that the vast majority of the users will find relevant and reasonable.  (Personalized search is another can of worms that I won&#039;t open here.)

That is, the output of a search engine is designed to be as _transparently obvious_ as possible.

To the extent that the criteria on which a page&#039;s relevance and ranking depend on properties that are easily manipulable by third parties, you&#039;re kind of screwed.

Now, Google, Yahoo, Microsoft, and the rest are almost certainly doing their best to arrange matters so that the inputs to their algorithms are not something that can be manipulated (in a bad way) easily if at all, or at least in such a way that manipulation is obvious and possible to circumvent.  (I am familiar with some of the research in this area.)  But the means that criminals have to screw with these algorithms are the same ones that genuine users and contributors of data (i.e., creators of links) have to improve things in the first place, so you have to be very careful about locking things down.

To put it another way: if your system has inherent flaws that are a function of the problem you&#039;re trying to solve, then sometimes security through obscurity may be the best you can do.

As a practical matter, I&#039;d guess that in practice the relevance and ranking methods are undergoing constant and rapid metamorphosis to both promote good results and combat (perceived) manipulation...so I could easily imagine that keeping up with the changes (to examine them for problems) would be tricky at best.

Now, it&#039;s possible that search engines could publish some parts of their algorithms for external review.  But...

...getting back to that can of worms that we mentioned earlier: the &quot;correctness&quot; of relevance and ranking algorithms is subjective by definition.  You need a broad spectrum of users (and usage data) in order to be able to measure how well the algorithms are doing.  It&#039;s not clear that third-party basement hackers would be able to help much...but third-party criminals might be given a major bonanza.

Finally, the relevance/ranking algorithms are a large part of the IP upon which companies like Google and Yahoo (and to a lesser extent MS) are based.  Granted, knowing Google&#039;s algorithms wouldn&#039;t give you access to their server farms (or their collected data)...but releasing them would basically hand Google&#039;s competitors a gun with which to shoot them.

</description>
		<content:encoded><![CDATA[<p><i>But &#8220;security through obscurity&#8221; is widely discredited in information security circles. Obscurity stops dumb attackers from getting through, but it lets the smart attackers clobber you because the smart defenders can&#8217;t see how your system works and point out its flaws.</i></p>
<p>Sure.  But it&#8217;s not obvious that this is relevant here.</p>
<p>Cryptography algorithms are designed to turn meaningful data into something that&#8217;s indistinguishable from noise unless you have the necessary data (e.g., a private key) to interpret it.</p>
<p>Search engines are designed to take a set of meaningful criteria (including the text of your query) and return a set of results; most of them also associate an ordering with this set.  This set, and its ordering, should be ones that the vast majority of the users will find relevant and reasonable.  (Personalized search is another can of worms that I won&#8217;t open here.)</p>
<p>That is, the output of a search engine is designed to be as _transparently obvious_ as possible.</p>
<p>To the extent that the criteria on which a page&#8217;s relevance and ranking depend on properties that are easily manipulable by third parties, you&#8217;re kind of screwed.</p>
<p>Now, Google, Yahoo, Microsoft, and the rest are almost certainly doing their best to arrange matters so that the inputs to their algorithms are not something that can be manipulated (in a bad way) easily if at all, or at least in such a way that manipulation is obvious and possible to circumvent.  (I am familiar with some of the research in this area.)  But the means that criminals have to screw with these algorithms are the same ones that genuine users and contributors of data (i.e., creators of links) have to improve things in the first place, so you have to be very careful about locking things down.</p>
<p>To put it another way: if your system has inherent flaws that are a function of the problem you&#8217;re trying to solve, then sometimes security through obscurity may be the best you can do.</p>
<p>As a practical matter, I&#8217;d guess that in practice the relevance and ranking methods are undergoing constant and rapid metamorphosis to both promote good results and combat (perceived) manipulation&#8230;so I could easily imagine that keeping up with the changes (to examine them for problems) would be tricky at best.</p>
<p>Now, it&#8217;s possible that search engines could publish some parts of their algorithms for external review.  But&#8230;</p>
<p>&#8230;getting back to that can of worms that we mentioned earlier: the &#8220;correctness&#8221; of relevance and ranking algorithms is subjective by definition.  You need a broad spectrum of users (and usage data) in order to be able to measure how well the algorithms are doing.  It&#8217;s not clear that third-party basement hackers would be able to help much&#8230;but third-party criminals might be given a major bonanza.</p>
<p>Finally, the relevance/ranking algorithms are a large part of the IP upon which companies like Google and Yahoo (and to a lesser extent MS) are based.  Granted, knowing Google&#8217;s algorithms wouldn&#8217;t give you access to their server farms (or their collected data)&#8230;but releasing them would basically hand Google&#8217;s competitors a gun with which to shoot them.</p>
]]></content:encoded>
	</item>
	<item>
		<title>By: jrtom</title>
		<link>http://boingboing.net/2008/01/01/wikiinspired-transpa.html#comment-99327</link>
		<dc:creator>jrtom</dc:creator>
		<pubDate>Wed, 30 Nov -0001 00:00:00 +0000</pubDate>
		<guid isPermaLink="false">#comment-99327</guid>
		<description>To supplement my comments (@2) above a bit...

First, I think that an open-source search service is a nice idea.  (&lt;a href=&quot;http://lucene.apache.org/&quot;&gt;Lucene&lt;/a&gt;, as an open-source search _engine_, has been around for some time.)  

Second, I like the notion of using user feedback on relevance and ranking results as an additional data source to improve the algorithms.  This is actually a completely separate notion from the idea of making the algorithms themselves open to review, and should be judged separately.

It could be that the characterization of what Wikia&#039;s going to do as providing a more &quot;transparent&quot; search service is misleading.  But if we assume that they are actually going to make the algorithms themselves available (that is, world-readable if not world-writeable), I offer here an excerpt from some comments that I made on this proposal &lt;a href=&quot;http://maradydd.livejournal.com/360558.html&quot;&gt;elsewhere&lt;/a&gt;: 

&lt;i&gt;Making the algorithms open doesn&#039;t mean that any flaws will be fixed quickly; it just means that they&#039;ll be _found_ (more) quickly. Part of the problem here is that it&#039;s a lot harder to make a good fix to an algorithm&lt;/i&gt; [especially in this space!] &lt;i&gt;than it is to correct an historical article. For one thing, as I mentioned on BB, random Wikia users won&#039;t have access to the data that informs the algorithm design; all they&#039;ll be able to see is the algorithm and the resultant rankings. Nor, I suspect, will Wikia be letting just anyone _edit_ their algorithms, unless they&#039;re complete idiots.&lt;/i&gt;

So, yeah.  I guess I&#039;ll reserve judgement until I see what they&#039;re actually proposing.  Wikia will have one advantage: at least initially, they might not have to worry about third parties gaming their system because it&#039;s not worth the effort (at least in terms of monetary gain) to corrupt a system that no one&#039;s using yet.</description>
		<content:encoded><![CDATA[<p>To supplement my comments (@2) above a bit&#8230;</p>
<p>First, I think that an open-source search service is a nice idea.  (<a href="http://lucene.apache.org/">Lucene</a>, as an open-source search _engine_, has been around for some time.)  </p>
<p>Second, I like the notion of using user feedback on relevance and ranking results as an additional data source to improve the algorithms.  This is actually a completely separate notion from the idea of making the algorithms themselves open to review, and should be judged separately.</p>
<p>It could be that the characterization of what Wikia&#8217;s going to do as providing a more &#8220;transparent&#8221; search service is misleading.  But if we assume that they are actually going to make the algorithms themselves available (that is, world-readable if not world-writeable), I offer here an excerpt from some comments that I made on this proposal <a href="http://maradydd.livejournal.com/360558.html">elsewhere</a>: </p>
<p><i>Making the algorithms open doesn&#8217;t mean that any flaws will be fixed quickly; it just means that they&#8217;ll be _found_ (more) quickly. Part of the problem here is that it&#8217;s a lot harder to make a good fix to an algorithm</i> [especially in this space!] <i>than it is to correct an historical article. For one thing, as I mentioned on BB, random Wikia users won&#8217;t have access to the data that informs the algorithm design; all they&#8217;ll be able to see is the algorithm and the resultant rankings. Nor, I suspect, will Wikia be letting just anyone _edit_ their algorithms, unless they&#8217;re complete idiots.</i></p>
<p>So, yeah.  I guess I&#8217;ll reserve judgement until I see what they&#8217;re actually proposing.  Wikia will have one advantage: at least initially, they might not have to worry about third parties gaming their system because it&#8217;s not worth the effort (at least in terms of monetary gain) to corrupt a system that no one&#8217;s using yet.</p>
]]></content:encoded>
	</item>
</channel>
</rss>
