---BREAKAWAY CIVILIZATION ---ALTERNATIVE HISTORY---NEW BUSINESS MODELS--- ROCK & ROLL 'S STRANGE BEGINNINGS---SERIAL KILLERS---YEA AND THAT BAD WORD "CONSPIRACY"--- AMERICANS DON'T EXPLORE ANYTHING ANYMORE.WE JUST CONSUME AND DIE.---
Friday, February 8, 2013
The inside story of Aaron Swartz’s campaign to liberate court filings
And how his allies are trying to finish the job by tearing down a big paywall.
Years before the JSTOR scraping project that led to Aaron Swartz's indictment on federal hacking charges—and perhaps to his suicide—the
open-data activist scraped documents from PACER, the federal
judiciary's paywalled website for public access to court records. (The
acronym PACER stands for Public Access to Court Electronic Records,
which may sound like it's straight out of 1988 because it is.)
Swartz got 2.7 million documents before the courts detected his
downloads and blocked access. The case was referred to the FBI, which
investigated Swartz's actions but declined to prosecute him.
A key figure in Swartz's PACER effort was Steve Schultze, now a
researcher at Princeton's Center for Information Technology Policy.
Schultze recruited Swartz to the PACER fight and wrote the Perl script
Swartz modified and then used to scrape the site.
Until recently, Schultze has been quiet about his role in Swartz's
PACER scraping caper. But Swartz's death inspired Schultze to speak out.
In a recent phone interview, Schultze described how Swartz downloaded
gigabytes of PACER data and how that data has been put to use throughout
the last four years. Schultze told us he hopes the outrage over
Swartz's death will provide momentum for legislation to finish the job
Swartz and Schultze started almost five years ago: tearing down PACER's
paywall.
In the interest of full disclosure: Schultze and I were colleagues at
Princeton while I was in grad school there from 2009 to 2011. With
another Princeton graduate student, Harlan Yu, we created RECAP,
a Firefox extension that helps PACER users share documents they
purchase both with each other and the public. And Carl Malamud, who
played a key role in our story, provided financial support for some of
my PACER-related research during this period.
The documents in PACER—motions, legal briefs, scheduling orders, and the like—are
public records. Most of these documents are free of copyright
restrictions, yet the courts charge hefty fees for access. Even as the
costs of storage and bandwidth have declined over the last decade, PACER
fees have risen from seven to 10 cents per page.
Facing criticism that high fees limit public access, the US courts
announced a pilot project in 2007 to provide free PACER access to users
at 17 libraries around the country. Schultze and other open government
activists saw the announcement as an opportunity to liberate documents
from the PACER system.
Schultze began working on a Perl script to automate the process of
downloading documents from PACER. He envisioned a "thumb drive corps" of
volunteers going into libraries, plugging in thumb drives containing
his script (packaged as a Windows executable), and using the library's
free access to download millions of PACER documents.
Schultze developed and tested the script using a personal PACER
account, paying for every document he downloaded. The nearest library
participating in the PACER program was more than a hundred miles from
his home in the Boston area, so he would need help from volunteers
around the country to put the plan into action.
In the summer of 2008, Schultze told Swartz, also in the Boston area
at the time, about the PACER scraping scheme. "He said what Aaron would
always say: 'show me the code,'" Schultze told Ars. "So I showed him the
code. He said, 'Oh, I don't really like Perl. I'm not a Perl
programmer.' Then he took my Perl code and made a whole bunch of great
improvements."
“This is not how we do things”
Schultze and Swartz conferred with open government advocate Carl
Malamud, who offered to provide server space to store the gigabytes of
data they hoped to liberate. For the documents to be useful, they needed
to capture not only the PDFs themselves but also docket files that
contain key metadata such as filing dates and document descriptions.
Steve Schultze's version of the Perl script Swartz used to liberate 2.7 million documents from PACER.
Steve Schultze
In early September, Swartz e-mailed Malamud to discuss an alternative
approach: instead of sending volunteers to libraries, they could crawl
PACER directly from Malamud's server. Malamud was skeptical. "The thumb
drive corps is based on going to the library and using their access," he
noted. "Do you have some kind of magic account or something?"
Swartz asked a friend to go to a Sacramento library that was
participating in the program. After the librarian logged the friend into
the library's PACER account, the friend extracted an authentication
cookie set by the PACER site. Because this cookie wasn't tied to any
specific IP address, it allowed access to the library's PACER account
from anywhere on the Internet. But Swartz admitted to Malamud that he
didn't have the library's permission to use this cookie for off-site
scraping.
"This is not how we do things," Malamud scolded in a September 4 e-mail. "We don't cut corners, we belly up to the bar and get permission."
"Fair enough," Swartz replied. "Stephen is building a team to go to the library."
But without telling Malamud or Schultze, Swartz pushed forward with
his offsite scraping plan. Rather than using Malamud's server, he began
crawling PACER from Amazon cloud servers.
"I thought at the time he was actually in the libraries" downloading
the documents that were accumulating on his server, Malamud told Ars in a
phone interview. In reality, Swartz merely had to dispatch a volunteer
to the library once a week to get a fresh authentication cookie. Swartz
could do the rest of the work from the comfort of his apartment.
Access denied
It took a while for the courts to figure out what was happening. "The
way the library trial was set up was that the courts would continue to
track usage but would simply never bill the libraries for the usage that
occurred," Schultze told us.
Swartz started his downloading in early September. On September 29,
court administrators noticed the Sacramento library racked up a $1.5
million bill. The feds shut down the library's account.
"I thought at the time he was actually in the libraries."
"Apparently PACER access at the main library I was crawling from has been shut down, presumably because of the crawl," Swartz told Schultze and Malamud in an e-mail that day.
The courts issued a vague statement
about suspending the program "pending an evaluation." A few weeks
later, a court official revealed law enforcement had been called to
investigate the suspected security breach. Malamud told us that after
Swartz fessed up, Malamud grilled him to understand whether any laws had
been broken. Malamud believes the fact that neither PACER nor the
library had terms of service prohibiting offsite downloading made it
likely Swartz's actions were within the law.
Malamud thought they would be in an even stronger position if they
could demonstrate the value of the data Swartz extracted, so he began an
intensive privacy audit. For most of October, Malamud worked around the
clock searching for documents containing Social Security numbers and
other sensitive information. Out of the 2.7 million documents Swartz
downloaded—about 700GB of data in all—Malamud discovered about 1,600
with privacy issues. He then sent a report to court administrators
disclosing the poorly redacted documents he had found and encouraging
the courts to examine the rest of the documents in PACER to ferret out
similar privacy problems.
Malamud and Swartz wanted to tell their side of the story to the public, so they began talking to a reporter at the New York Times. The result was an article in February 2009 explaining the issue and Swartz's actions.
"This was part of how Aaron approached things," Schultze told us. His
PACER activities were "a project to liberate the documents but also an
effort to make public the problems that existed to hopefully solve the
larger policy problem."
Both the FBI and the Department of Justice investigated the case.
They identified Swartz via his ownership of the Amazon servers used to
crawl PACER. Both agencies dropped the case by April 2009. Later that
year, Swartz made an open records request for his own FBI file and gleefully posted it online, calling it "truly delightful." Listing image by Aurich Lawson
In a back-of-the-envelope calculation a few days before the offsite crawl was shut down, Swartz guessed he got around 25 percent of the documents in PACER. The New York Times
similarly reported Swartz had downloaded "an estimated 20 percent of
the entire database." Other media outlets have repeated the figure ever
since. Unfortunately, neither is accurate. PACER has more than 500
million documents, so the 2.7 million documents Swartz downloaded
accounts for less than one percent of the database.
Nevertheless, the Swartz corpus proved valuable. Malamud's privacy
audit helped to publicize the need for more rigorous privacy protections
in the e-filing system. When Schultze, Harlan Yu, and I began work on RECAP,
we pre-loaded it with Swartz's documents so at least some cases would
be pre-populated with documents. Swartz's documents also served as the
basis for some of my own privacy research.
Swartz, Malamud, and Schultze always saw the PACER scraping project
primarily as a way to pressure the judiciary to provide free public
access to the full PACER database. Ever since 2008, Schultze has made
PACER a major focus of his work, writing extensively about the case for tearing down PACER's paywall.
Schultze believes the courts are breaking the law by charging 10
cents a page for public documents. As then-senator Joe Lieberman (I-CT)
pointed out in a 2009 letter,
the 2002 E-Government Act, which authorizes PACER fees, permits them to
be charged only "to the extent necessary" to cover the costs of
providing the service. In its legislative report, the Senate committee
behind the bill stated it "intends to encourage the Judicial Conference
to move from a fee structure in which electronic docketing systems are
supported primarily by user fees to a fee structure in which this
information is freely available to the greatest extent possible."
Yet PACER fee collections appear to have dramatically outstripped the
cost of running the PACER system. PACER users paid about $120 million
in 2012, thanks in part to a 25 percent fee hike announced in 2011. But Schultze says the judiciary's own figures
show running PACER only costs around $20 million. Schultze believes
this massive disparity is inconsistent with the court's mandate to
charge PACER fees only "to the extent necessary" to run the PACER
system.
And even the $20 million figure may overstate the cost of running
PACER. "We don't know what is included in these line items because the
courts have never told us," Schultze said. "But the PACER system is run
extremely inefficiently. It has individual servers in each district,
individual staff for each district, and privately leased network
connections."
Schultze believes costs could be slashed if the courts moved to a
modern cloud-based hosting platform. Indeed, he notes, the Government
Accountability Office, the auditing wing of the federal government, has
already developed a streamlined process for government agencies to lease
cloud computing resources.
The GAO has even granted some hosting providers "FISMA level 2
security certification," Schultze points out, which allows the
Department of Homeland Security to use them for its applications. "If
it's good enough for DHS, it's good enough for the courts," Schultze
argued.
“The PACER system is run extremely inefficiently.”
Schultze believes the courts could shift their servers to the cloud
with minimal technical changes. "They would just start up a new virtual
machine for every court. Each court could continue to administer their
own PACER instance. There's no complicated engineering required."
Schultze believes the judiciary's Amazon bill could be as little as
$1 million per year, or less than one percent of what the courts are
currently charging. Malamud is less optimistic, given the inherent
inefficiencies of government bureaucracies. But he believes an efficient
PACER system shouldn't cost more than $10 million.
Interestingly, the executive branch pays the courts millions of
dollars every year in PACER fees. The Department of Justice alone pays
the courts about $4 million per year for access to public court
documents. Schultze believes the money Congress currently allocates for
executive branch agencies to pay PACER fees would be sufficient to fund
the entire PACER system. That would allow the judiciary to eliminate
PACER fees to private users.
Open PACER
Enlarge/ The Administrative Office sees no need for change.
When
we contacted the Administrative Office of the courts for comment, they
stressed that "fully 95 percent of all PACER fees come from just five
percent of all users. Court opinions are free, and 65 to 75 percent of
active PACER users don't exceed $15 of use in a quarter, and therefore
are not charged. In addition, academic researchers, pro bono lawyers,
and indigent users can apply for exemptions."
But Schultze doesn't believe waivers address the problems with
PACER's fee system. "Obtaining a waiver requires filing a separate
request with each court, which may grant and revoke the waiver at its
discretion," Schultze noted in an e-mail. "Many classes of individuals
are not even eligible to apply, including the media."
As a practical matter, the major obstacle to opening PACER likely
hinges on finances. The judiciary tells Ars that in addition to
financing PACER itself, PACER fees go to pay for "electronic case filing
and about a half-dozen other information technology categories" in what
it calls its "public access program." In other words, PACER has become a
cash cow for the judicial branch, generating $100 million in profits
the court has plowed into non-PACER IT projects.
It's understandable the courts wouldn't want to give up that revenue
in an era of austerity. But for Schultze, that revenue stream isn't a
good enough reason to restrict public access to public documents. He
drafted the Open PACER Act to mandate the paywall's elimination.
"My bill is one page," Schulze told us. "It does two things. First,
it repeals the court's ability to charge for access to electronic public
records. Second it mandates that they provide electronic public records
to the public for free."
In recent weeks, Schultze made multiple trips to DC to lobby for the
proposal. He hasn't found a sponsor yet, but he's optimistic he'll find
one soon. "I've been talking to potential sponsors in both the House and
the Senate," Schultze said. "There are many members of Congress that
see government transparency as a high priority. I expect that those are
the members that will sponsor the bill."
Several members of Congress stopped to pay their respects at a memorial service
for Swartz held in DC on February 4. Among the speakers was Rep.
Darrell Issa (R-CA), an influential Republican who has championed open
government. Sen. Ron Wyden (D-OR), a reform-minded Democrat, also spoke.
At times, the event took on the tone of a political rally.
So far, most of the legislative attention in the wake of Swartz's death has focused on "Aaron's Law"
to reform the Computer Fraud and Abuse Act. But Schultze believes
tearing down the PACER paywall should also be a priority. After all,
public access to information was a central theme of Swartz's life.
Opening PACER would be another fitting tribute to his memory.
No comments:
Post a Comment