Train your AI with the world's largest data-set of sarcasm, courtesy of redditors' self-tagging

Cory Doctorow 12:23 pm Mon May 1, 2017

Redditors' convention of tagging their sarcastic remarks is a dream come true for machine learning researchers hoping to teach computers to recognize and/or generate sarcasm.

The Self-Annotated Reddit Corpus (SARC) is a corpus with 1.3 million sarcastic remarks ("10 times more than any previous dataset") that were tagged by redditors and stored in the database along with "user, topic, and conversation context."

Reddit comments from December 2005 have been
made available due to web-scraping 4
; we construct
our dataset as a subset of comments from
2009-2016, comprising the vast majority of comments
and excluding noisy data from earlier years.
For each comment we provide a sarcasm label, author,
the subreddit it appeared in, the comment score as voted on by users, the date of the comment,
and the parent comment or submission.

A Large Self-Annotated Corpus for Sarcasm

[Mikhail Khodak, Nikunj Saunshi and Kiran Vodrahalli/Princeton University]

(via Marginal Revolution)

Old English word of the day

The Old English Wordhord unlocks one medieval word a day, pairing each term with its definition, pronunciation and, often, a manuscript illustration. The site's creator (or should it be wordwyrm?)… READ THE REST
Fed Chair Alan Greenspan intentionally babbled meaningless slop and got away with it

While chairman of the Federal Reserve, Alan Greenspan developed what economist Alan Blinder called "a turgid dialect of English" — deliberate obscurantism designed to prevent financial markets from overreacting to… READ THE REST
Decoding whale language could unlock new legal rights for them

Project CETI (Cetacean Translation Initiative) has announced that it's analyzed sperm whales' pattern of vocalizations and broken it down into a complex phoenetic alphabet, including consonants and vowels. They are… READ THE REST
Only 24 hours left to score MS Office Pro+ 2024 for $55 with Deal Days

Disclosure: Boing Boing earns a commission on purchases made through links in this post. TL;DR: Grab the complete Office 2024 suite with advanced features with the Microsoft Office 2024 Professional Plus Lifetime… READ THE REST
A crash course in modern IT is on sale for $19.97 during Deal Days

Disclosure: Boing Boing earns a commission on purchases made through links in this post. TL;DR: The All-in-One CompTIA Certification Prep Courses Bundle includes training for certifications covering IT support, networking, cybersecurity, cloud… READ THE REST
This tiny tracker card can help you find your stuff for $20

Disclosure: Boing Boing earns a commission on purchases made through links in this post. TL;DR: The MagTag Ultra Slim Tracker Card works with Apple's Find My app, fits inside wallets and luggage,… READ THE REST