Friday, February 8, 2013

The inside story of Aaron Swartz’s campaign to liberate court filings

And how his allies are trying to finish the job by tearing down a big paywall.

Years before the JSTOR scraping project that led to Aaron Swartz's indictment on federal hacking charges—and perhaps to his suicide—the open-data activist scraped documents from PACER, the federal judiciary's paywalled website for public access to court records. (The acronym PACER stands for Public Access to Court Electronic Records, which may sound like it's straight out of 1988 because it is.) Swartz got 2.7 million documents before the courts detected his downloads and blocked access. The case was referred to the FBI, which investigated Swartz's actions but declined to prosecute him.
A key figure in Swartz's PACER effort was Steve Schultze, now a researcher at Princeton's Center for Information Technology Policy. Schultze recruited Swartz to the PACER fight and wrote the Perl script Swartz modified and then used to scrape the site.
Until recently, Schultze has been quiet about his role in Swartz's PACER scraping caper. But Swartz's death inspired Schultze to speak out. In a recent phone interview, Schultze described how Swartz downloaded gigabytes of PACER data and how that data has been put to use throughout the last four years. Schultze told us he hopes the outrage over Swartz's death will provide momentum for legislation to finish the job Swartz and Schultze started almost five years ago: tearing down PACER's paywall.
In the interest of full disclosure: Schultze and I were colleagues at Princeton while I was in grad school there from 2009 to 2011. With another Princeton graduate student, Harlan Yu, we created RECAP, a Firefox extension that helps PACER users share documents they purchase both with each other and the public. And Carl Malamud, who played a key role in our story, provided financial support for some of my PACER-related research during this period.

The thumb drive corps

Enlarge / Steve Schultze.
The documents in PACER—motions, legal briefs, scheduling orders, and the like—are public records. Most of these documents are free of copyright restrictions, yet the courts charge hefty fees for access. Even as the costs of storage and bandwidth have declined over the last decade, PACER fees have risen from seven to 10 cents per page.
Facing criticism that high fees limit public access, the US courts announced a pilot project in 2007 to provide free PACER access to users at 17 libraries around the country. Schultze and other open government activists saw the announcement as an opportunity to liberate documents from the PACER system.
Schultze began working on a Perl script to automate the process of downloading documents from PACER. He envisioned a "thumb drive corps" of volunteers going into libraries, plugging in thumb drives containing his script (packaged as a Windows executable), and using the library's free access to download millions of PACER documents.
Schultze developed and tested the script using a personal PACER account, paying for every document he downloaded. The nearest library participating in the PACER program was more than a hundred miles from his home in the Boston area, so he would need help from volunteers around the country to put the plan into action.
In the summer of 2008, Schultze told Swartz, also in the Boston area at the time, about the PACER scraping scheme. "He said what Aaron would always say: 'show me the code,'" Schultze told Ars. "So I showed him the code. He said, 'Oh, I don't really like Perl. I'm not a Perl programmer.' Then he took my Perl code and made a whole bunch of great improvements."

“This is not how we do things”

Schultze and Swartz conferred with open government advocate Carl Malamud, who offered to provide server space to store the gigabytes of data they hoped to liberate. For the documents to be useful, they needed to capture not only the PDFs themselves but also docket files that contain key metadata such as filing dates and document descriptions.
Steve Schultze's version of the Perl script Swartz used to liberate 2.7 million documents from PACER.
In early September, Swartz e-mailed Malamud to discuss an alternative approach: instead of sending volunteers to libraries, they could crawl PACER directly from Malamud's server. Malamud was skeptical. "The thumb drive corps is based on going to the library and using their access," he noted. "Do you have some kind of magic account or something?"
Swartz asked a friend to go to a Sacramento library that was participating in the program. After the librarian logged the friend into the library's PACER account, the friend extracted an authentication cookie set by the PACER site. Because this cookie wasn't tied to any specific IP address, it allowed access to the library's PACER account from anywhere on the Internet. But Swartz admitted to Malamud that he didn't have the library's permission to use this cookie for off-site scraping.
"This is not how we do things," Malamud scolded in a September 4 e-mail. "We don't cut corners, we belly up to the bar and get permission."
"Fair enough," Swartz replied. "Stephen is building a team to go to the library."
But without telling Malamud or Schultze, Swartz pushed forward with his offsite scraping plan. Rather than using Malamud's server, he began crawling PACER from Amazon cloud servers.
"I thought at the time he was actually in the libraries" downloading the documents that were accumulating on his server, Malamud told Ars in a phone interview. In reality, Swartz merely had to dispatch a volunteer to the library once a week to get a fresh authentication cookie. Swartz could do the rest of the work from the comfort of his apartment.

Access denied

It took a while for the courts to figure out what was happening. "The way the library trial was set up was that the courts would continue to track usage but would simply never bill the libraries for the usage that occurred," Schultze told us.
Swartz started his downloading in early September. On September 29, court administrators noticed the Sacramento library racked up a $1.5 million bill. The feds shut down the library's account.
"I thought at the time he was actually in the libraries."
"Apparently PACER access at the main library I was crawling from has been shut down, presumably because of the crawl," Swartz told Schultze and Malamud in an e-mail that day.
The courts issued a vague statement about suspending the program "pending an evaluation." A few weeks later, a court official revealed law enforcement had been called to investigate the suspected security breach. Malamud told us that after Swartz fessed up, Malamud grilled him to understand whether any laws had been broken. Malamud believes the fact that neither PACER nor the library had terms of service prohibiting offsite downloading made it likely Swartz's actions were within the law.
Malamud thought they would be in an even stronger position if they could demonstrate the value of the data Swartz extracted, so he began an intensive privacy audit. For most of October, Malamud worked around the clock searching for documents containing Social Security numbers and other sensitive information. Out of the 2.7 million documents Swartz downloaded—about 700GB of data in all—Malamud discovered about 1,600 with privacy issues. He then sent a report to court administrators disclosing the poorly redacted documents he had found and encouraging the courts to examine the rest of the documents in PACER to ferret out similar privacy problems.
Malamud and Swartz wanted to tell their side of the story to the public, so they began talking to a reporter at the New York Times. The result was an article in February 2009 explaining the issue and Swartz's actions.
"This was part of how Aaron approached things," Schultze told us. His PACER activities were "a project to liberate the documents but also an effort to make public the problems that existed to hopefully solve the larger policy problem."
Both the FBI and the Department of Justice investigated the case. They identified Swartz via his ownership of the Amazon servers used to crawl PACER. Both agencies dropped the case by April 2009. Later that year, Swartz made an open records request for his own FBI file and gleefully posted it online, calling it "truly delightful."
Listing image by Aurich Lawson
Page break by AutoPager.  Page(    2    ).  Goto Window Top  Page Up  Page Down  Goto Window Bottom  LoadPages  

Tear down this wall

In a back-of-the-envelope calculation a few days before the offsite crawl was shut down, Swartz guessed he got around 25 percent of the documents in PACER. The New York Times similarly reported Swartz had downloaded "an estimated 20 percent of the entire database." Other media outlets have repeated the figure ever since. Unfortunately, neither is accurate. PACER has more than 500 million documents, so the 2.7 million documents Swartz downloaded accounts for less than one percent of the database.
Enlarge / Carl Malamud.
Nevertheless, the Swartz corpus proved valuable. Malamud's privacy audit helped to publicize the need for more rigorous privacy protections in the e-filing system. When Schultze, Harlan Yu, and I began work on RECAP, we pre-loaded it with Swartz's documents so at least some cases would be pre-populated with documents. Swartz's documents also served as the basis for some of my own privacy research.
Swartz, Malamud, and Schultze always saw the PACER scraping project primarily as a way to pressure the judiciary to provide free public access to the full PACER database. Ever since 2008, Schultze has made PACER a major focus of his work, writing extensively about the case for tearing down PACER's paywall.
Schultze believes the courts are breaking the law by charging 10 cents a page for public documents. As then-senator Joe Lieberman (I-CT) pointed out in a 2009 letter, the 2002 E-Government Act, which authorizes PACER fees, permits them to be charged only "to the extent necessary" to cover the costs of providing the service. In its legislative report, the Senate committee behind the bill stated it "intends to encourage the Judicial Conference to move from a fee structure in which electronic docketing systems are supported primarily by user fees to a fee structure in which this information is freely available to the greatest extent possible."
Yet PACER fee collections appear to have dramatically outstripped the cost of running the PACER system. PACER users paid about $120 million in 2012, thanks in part to a 25 percent fee hike announced in 2011. But Schultze says the judiciary's own figures show running PACER only costs around $20 million. Schultze believes this massive disparity is inconsistent with the court's mandate to charge PACER fees only "to the extent necessary" to run the PACER system.
And even the $20 million figure may overstate the cost of running PACER. "We don't know what is included in these line items because the courts have never told us," Schultze said. "But the PACER system is run extremely inefficiently. It has individual servers in each district, individual staff for each district, and privately leased network connections."
Schultze believes costs could be slashed if the courts moved to a modern cloud-based hosting platform. Indeed, he notes, the Government Accountability Office, the auditing wing of the federal government, has already developed a streamlined process for government agencies to lease cloud computing resources.
The GAO has even granted some hosting providers "FISMA level 2 security certification," Schultze points out, which allows the Department of Homeland Security to use them for its applications. "If it's good enough for DHS, it's good enough for the courts," Schultze argued.
“The PACER system is run extremely inefficiently.”
Schultze believes the courts could shift their servers to the cloud with minimal technical changes. "They would just start up a new virtual machine for every court. Each court could continue to administer their own PACER instance. There's no complicated engineering required."
Schultze believes the judiciary's Amazon bill could be as little as $1 million per year, or less than one percent of what the courts are currently charging. Malamud is less optimistic, given the inherent inefficiencies of government bureaucracies. But he believes an efficient PACER system shouldn't cost more than $10 million.
Interestingly, the executive branch pays the courts millions of dollars every year in PACER fees. The Department of Justice alone pays the courts about $4 million per year for access to public court documents. Schultze believes the money Congress currently allocates for executive branch agencies to pay PACER fees would be sufficient to fund the entire PACER system. That would allow the judiciary to eliminate PACER fees to private users.

Open PACER

Enlarge / The Administrative Office sees no need for change.
When we contacted the Administrative Office of the courts for comment, they stressed that "fully 95 percent of all PACER fees come from just five percent of all users. Court opinions are free, and 65 to 75 percent of active PACER users don't exceed $15 of use in a quarter, and therefore are not charged. In addition, academic researchers, pro bono lawyers, and indigent users can apply for exemptions." But Schultze doesn't believe waivers address the problems with PACER's fee system. "Obtaining a waiver requires filing a separate request with each court, which may grant and revoke the waiver at its discretion," Schultze noted in an e-mail. "Many classes of individuals are not even eligible to apply, including the media."
As a practical matter, the major obstacle to opening PACER likely hinges on finances. The judiciary tells Ars that in addition to financing PACER itself, PACER fees go to pay for "electronic case filing and about a half-dozen other information technology categories" in what it calls its "public access program." In other words, PACER has become a cash cow for the judicial branch, generating $100 million in profits the court has plowed into non-PACER IT projects.
It's understandable the courts wouldn't want to give up that revenue in an era of austerity. But for Schultze, that revenue stream isn't a good enough reason to restrict public access to public documents. He drafted the Open PACER Act to mandate the paywall's elimination.
"My bill is one page," Schulze told us. "It does two things. First, it repeals the court's ability to charge for access to electronic public records. Second it mandates that they provide electronic public records to the public for free."
In recent weeks, Schultze made multiple trips to DC to lobby for the proposal. He hasn't found a sponsor yet, but he's optimistic he'll find one soon. "I've been talking to potential sponsors in both the House and the Senate," Schultze said. "There are many members of Congress that see government transparency as a high priority. I expect that those are the members that will sponsor the bill."
Several members of Congress stopped to pay their respects at a memorial service for Swartz held in DC on February 4. Among the speakers was Rep. Darrell Issa (R-CA), an influential Republican who has championed open government. Sen. Ron Wyden (D-OR), a reform-minded Democrat, also spoke. At times, the event took on the tone of a political rally.
So far, most of the legislative attention in the wake of Swartz's death has focused on "Aaron's Law" to reform the Computer Fraud and Abuse Act. But Schultze believes tearing down the PACER paywall should also be a priority. After all, public access to information was a central theme of Swartz's life. Opening PACER would be another fitting tribute to his memory.

No comments:

Post a Comment