Improper court record redaction: a study

Timothy Lee has conducted an initial study of improper redaction in PACER, the US court records system. Sensitive information like social security numbers are redacted in these records, but sometimes the redaction is accomplished by drawing a black box over the text in the PDF; the text is still present in the PDF file, it's just not displayed, and it's easy to recover.

So how many PACER documents have this problem? We're in a good position to study this question because we have a large collection of PACER documents–1.8 million of them when I started my research last year. I wrote software to detect redaction rectangles–it turns out these are relatively easy to recognize based on their color, shape, and the specific commands used to draw them. Out of 1.8 million PACER documents, there were approximately 2000 documents with redaction rectangles. (There were also about 3500 documents that were redacted by replacing text by strings of Xes, I also excluded documents that were redacted by Carl Malamud before he donated them to our archive.)

Next, my software checked to see if these redaction rectangles overlapped with text. My software identified a few hundred documents that appeared to have text under redaction rectangles, and examining them by hand revealed 194 documents with failed redactions. The majority of the documents (about 130) appear be from commercial litigation, in which parties have unsuccessfully attempted to redact trade secrets such as sales figures and confidential product information. Other improperly redacted documents contain sensitive medical information, addresses, and dates of birth. Still others contain the names of witnesses, jurors, plaintiffs, and one minor.

Studying the Frequency of Redaction Failures in PACER

NSA primer on secure redaction (PDF)

(Image: CIMG4941, a Creative Commons Attribution (2.0) image from kaiban's photostream)