Machine-learning algorithm develops heuristics for trustworthy tweets in time of emergency

In "Credibility ranking of tweets during high impact events," a paper published in the ACM's Proceedings of the 1st Workshop on Privacy and Security in Online Social Media , two Indraprastha Institute of Information Technology researchers describe the outcome of a machine-learning experiment that was asked to discover factors correlated with reliability in tweets during disasters and emergencies:

The number of unique characters present in tweet was positively correlated to credibility, this may be due to the fact that tweets with hashtags, @mentions and URLs contain more unique characters. Such tweets are also more informative and linked, and hence credible. Presence of swear words in tweets indicates that it contains the opinion / reaction of the user and would have less chances of providing informa- tion about the event. Tweets that contain information or are reporting facts about the event, are impersonal in nature, as a result we get a negative correlation of presence of pronouns in credible tweets. Low number of happy emoticons [:-), :)] and high number of sad emoticons [:-(, :(] act as strong predictors of credibility. Some of the other important features (p-value < 0.01) were inclusion of a URL in the tweet, number of followers of the user who tweeted and presence of negative emotion words. Inclusion of URL in a tweet showed a strong positive correlation with credibility, as most URLs refer to pictures, videos, resources related to the event or news articles about the event.

Of course, this is all non-adversarial: no one is trying to trick a filter into mis-assessing a false account as a true one. It's easy to imagine an adversarial tweet-generator that suggests rewrites to deliberately misleading tweets to make them more credible to a filter designed on these lines. This is actually the substance of one of the cleverest science fiction subplots I've read: in Peter Watt's Behemoth, in which a self-modifying computer virus randomly hits on the strategy of impersonating communications from patient zero in a world-killing pandemic, because all the filters allow these through. It's a premise that's never stopped haunting me: the co-evolution of a human virus and a computer virus.

Credibility Ranking of Tweets during High Impact Events [PDF] (via /.)