Microsoft takes down MS Celeb facial recognition database, 10 million+ pics of ~100,000 faces, maybe yours, scraped under Creative Commons

Military research and Chinese firms had access to the data Microsoft scraped under Creative Commons licenses.

It's very bad that this existed, and still does, away from public view. They did it quietly, and only after the Financial Times shamed them over it. But it's still good news.

Microsoft has taken down its online database of 10 million or more human faces.

Maybe yours.

The 'MS Celeb' database was first published on the internet in 2016, and Microsoft claimed it was the world's largest publicly available facial recognition data set, containing over 10 million images of nearly 100,000 individual people.

Microsoft used the facial data to train facial recognition systems, including those used by U.S. military researchers, and by various firms in China — SenseTime and Megvii among them.

China uses facial recognition to commit mass human rights abuses against minority populations including the predominantly Muslim Uighur people, and the ethnically Tibetan people who live in the region China calls its Tibet Autonomous Region, and the rest of us call China-occupied Tibet.

Stanford and Duke universities also removed facial recognition data after the publication of work by Berlin-based security researcher Adam Harvey. His Megapixels project documents many large data sets, how they are used, and what's at stake for your privacy.

Here's Madhumita Murgia, writing for The Financial Times:

The people whose photos were used were not asked for their consent, their images were scraped off the web from search engines and videos under the terms of the Creative Commons license that allows academic reuse of photos.

Microsoft, which took down the database days after the FT reported on its use by companies, said: "The site was intended for academic purposes. It was run by an employee that is no longer with Microsoft and has since been removed."

Two other data sets have also been taken down since the FT report was published in April, including the Duke MTMC surveillance data set built by Duke University researchers, and a Stanford University data set called Brainwash.

Brainwash used footage of customers in a café called Brainwash in San Francisco's Lower Haight district, taken through a livestreaming camera. Duke did not respond to requests for comment. Stanford said it had removed the data set after a request by one of the authors of a study it was used for. A spokesperson said the university is "committed to protecting the privacy of individuals at Stanford and in the larger community".

All three data sets were uncovered by Berlin-based researcher Adam Harvey, whose project Megapixels documented the details of dozens of data sets and how they are being used.

Microsoft's MS Celeb data set has been used by several commercial organisations, according to citations in AI papers, including IBM, Panasonic, Alibaba, Nvidia, Hitachi, Sensetime and Megvii. Both Sensetime and Megvii are Chinese suppliers of equipment to officials in Xinjiang, where minorities of mostly Uighurs and other Muslims are being tracked and held in internment camps.

Microsoft quietly deletes largest public face recognition data set [ via]