Wednesday, June 12, 2013

What the NSA can do with “big data”

The NSA can't capture everything that crosses the Internet—but doesn't need to.

NSA Headquarters in Fort Meade, MD.
mjb
One organization's data centers hold the contents of much of the visible Internet—and much of it that isn't visible just by clicking your way around. It has satellite imagery of much of the world and ground-level photography of homes and businesses and government installations tied into a geospatial database that is cross-indexed to petabytes of information about individuals and organizations. And its analytics systems process the Web search requests, e-mail messages, and other electronic activities of hundreds of millions of people.
No one at this organization actually "knows" everything about what individuals are doing on the Web, though there is certainly the potential for abuse. By policy, all of the "knowing" happens in software, while the organization's analysts generally handle exceptions (like violations of the law) picked from the flotsam of the seas of data that their systems process.
I'm talking, of course, about Google. Most of us are okay with what Google does with its vast supply of "big data," because we largely benefit from it—though Google does manage to make a good deal of money off of us in the process. But if I were to backspace over Google's name and replace it with "National Security Agency," that would leave a bit of a different taste in many people's mouths.
Yet the NSA's PRISM program and the capture of phone carriers' call metadata are essentially about the same sort of business: taking massive volumes of data and finding relationships within it without having to manually sort through it, and surfacing "exceptions" that analysts are specifically looking for. The main difference is that with the NSA, finding these exceptions can result in Foreign Intelligence Surveillance Act (FISA) warrants to dig deeper—and FBI agents knocking at your door.
So what is it, exactly, that the NSA has in its pile of "big data," and what can they do with it?.

Drinking from the fire hose

Let's set aside what US law allows the NSA to do for a moment, and focus on some other laws that constrain the intelligence agency: the laws of physics and Moore's Law, to start with. The NSA has the capability to collect massive amounts of data on traffic over switched phone networks and the Internet and has had that capability for some time, thanks to cooperation from the phone companies themselves, deep packet inspection and packet capture hardware, and other signals monitoring capabilities. But they haven't had the ability to truly capture and store that data en masse and retain it indefinitely until relatively recently, due in part to work started at Google and Yahoo.
We know some of this thanks to an earlier whistleblower—former AT&T employee Mark Klein, who revealed in 2006 that AT&T had helped NSA install a tap into the fiber backbone for AT&T's WorldNet, "splitting" the traffic to run into a Narus Insight Semantic Traffic Analyzer. (The gear has since been rebranded as "Intelligence Traffic Analyzer," or ITA.)
The "secret room" in AT&T's Folsom Street office in San Francisco is believed to be one of several Internet wiretapping facilities at AT&T offices around the country feeding data to the NSA.
Mark Klein
Narus' gear was also used by the FBI as a replacement for its homegrown "Carnivore" system. It scans packets for "tag pairs"—sets of packet attributes and values that are being monitored for—and then grabs the data for packets that match the criteria. In an interview I conducted with Narus' director of product management for cyber analytics Neil Harrington in September of 2012, Harrington said the company's Insight systems can analyze and sort gigabits of data each second. "Typically with a 10 gigabit Ethernet interface, we would see a throughput rate of up to 12 gigabits per second with everything turned on. So out of the possible 20 gigabits, we see about 12. If we turn off tag pairs that we’re not interested in, we can make it more efficient."
A single Narus ITA is capable of processing the full contents of 1.5 gigabytes worth of packet data per second. That's 5400 gigabytes per hour, or 129.6 terabytes per day, for each 10-gigabit network tap. All that data gets shoveled off to a set of logic servers using a proprietary messaging protocol, which process and reassemble the contents of the packets, turning petabytes per day into gigabytes of tabular data about traffic—the metadata of the packets passing through the box— and captured application data.
NSA operates many of these network tap operations both in the US and around the world. But that's a massive fire-hose of data to try to digest in any meaningful way and in the early days of packet capture, NSA faced a few major problems with that vast stream of data. Storing it, indexing it, and analyzing it in volume required technology beyond what was generally available commercially. Considering that, according to Cisco, the total world Internet traffic for 2012 was 1.1 exabytes per day is physically impossible, let alone practical, for the NSA to capture and retain even a fraction of the world's Internet traffic on a daily basis.
There's also the issue of intercepting packets protected by Secure Socket Layer (SSL) encryption. Breaking encryption of SSL-protected traffic is, under the best of circumstances, computationally costly and can't be applied across the whole of Internet traffic (despite the apparent certificate-cracking success demonstrated by the Flame malware attack on Iran). So while the NSA can probably do it, they probably can't do it in real-time.

The original social network

Internet monitoring wasn't the only NSA data collection exposed in 2006. In May of that same year, details emerged about NSA's phone call database, obtained from phone carriers. Comprised of call data records—data on the time and length of calls, the phone numbers involved, and location data for mobile devices, among other things—the database collection started shortly after the terrorist attacks of September 11, 2001, with the cooperation of AT&T, Verizon, and BellSouth. Long-distance provider Qwest declined to participate in the program without the issuance of a FISA warrant.

According to reporting by USA Today, the NSA used the database for "social network analysis." While the target of the analysis was intended to be calls connecting to individuals overseas, the NSA scooped up the entire database from these companies, including domestic calls.
That database, or at least its successor, is called MARINA, according to reporting by The Week's Marc Ambinder. And according to documents revealed by the Guardian last week, the NSA is still collecting call data records for all domestic calls and calls between US and foreign numbers—except now the agency is armed with FISA warrants. That includes (according to the FISA order) "comprehensive communications routing information, including but not limited to session identifying information (e.g., originating and terminating telephone number, International Mobile Subscriber Identity (IMEI) number, etc.), trunk identifier, telephone calling card numbers, and time and duration of call."
In 2006, USA Today called the call database "the largest database in the world." That transactional record data for billions upon billions of phone calls presents a physical-space problem on a similar scale to that which the NSA encountered in its Internet monitoring—or perhaps, initially, on an even larger scale. To find the relationships inferred by phone calls between people requires massive amounts of columnar data to be indexed and analyzed.

The secret social graph

Ironically, about the same time these two programs were being exposed, Internet companies such as Google and Yahoo were solving the big data storage and analysis problem. In November of 2006, Google published a paper on BigTable, a database with petabytes of capacity capable of indexing the Web and supporting Google Earth and other applications. And the work at Yahoo to catch up with Google's GFS file system—the basis for BigTable—resulted in the Hadoop.
BigTable and Hadoop-based databases offered a way to handle huge amounts of data being captured by the NSA's operations, but they lacked something critical to intelligence operations: compartmentalized security (or any security at all, for that matter). So in 2008, NSA set out to create a better version of BigTable, called Accumulo—now an Apache Foundation project.
Accumulo is a "NoSQL" database, based on key-value pairs. It's a design similar to Google's BigTable or Amazon's DynamoDB, but Accumulo has special security features designed for the NSA, like multiple levels of security access. The program is built on the open-source Hadoop platform and other Apache products.
One of those is called Column Visibility—a capability that allows individual items within a row of data to have different classifications. That allows users and applications with different levels of authorization to access data but see more or less information based on what each column's "visibility" is. Users with lower levels of clearance wouldn't be aware that the column of data they're prohibited from viewing existed.
Accumulo also can generate near real-time reports from specific patterns in data. So, for instance, the system could look for specific words or addressees in e-mail messages that come from a range of IP addresses; or, it could look for phone numbers that are two degrees of separation from a target's phone number. Then it can spit those chosen e-mails or phone numbers into another database, where NSA workers could peruse it at their leisure.
In other words, Accumulo allows the NSA to do what Google does with your e-mails and Web searches—only with everything that flows across the Internet, or with every phone call you make.
It works because of a type of server process called "iterators." These pieces of code constantly process the information sent to them and send back reports on emerging patterns in the data. Querying a multi-petabyte database and waiting for a response would be deadly slow, especially because there is always new data being added. The iterators are like NSA's tireless data elves.
Accumulo is just one weapon in the NSA's armory. The aggregated data pumped out of Accumulo can be pulled into other tools for analysis, such as Palantir's analytic databases and its Graph application. Graph builds a visualization of the links between "entities" based on attributes and relationships and searches based on those relationships—conceptually similar to Facebook's Unicorn search and social graphGoogle's Knowledge Graph, and Microsoft Research's Satori.                  
A demonstration of the visualization of relationships and search capabilities of Palantir's Graph application.
Tools like Palantir can only work with smaller subsets of big databases like the MARINA phone database. But the back-end work done by Accumulo can generate data sets from massive data stores that are much more manageable for analysis tools. And thanks to the NSA's connection to other social networks, there's another source of relationship data that's always on tap: PRISM.

PRISM's backdoor

One of the obstacles to NSA monitoring of Internet communications is SSL. On the surface, "cloud" services such as Gmail, Facebook, and the service formerly known as Hotmail have made that problem harder to overcome as they've pulled more interactions in behind SSL-protected sessions. But ironically, those communications services actually started to make it easier for the NSA to collect that protected data through the PRISM program.
According to the slides leaked by NSA contractor Edward Snowden and published by The Washington Post and The Guardian, Microsoft actually began to provide data to the NSA in 2007. Within this program, the NSA started to obtain access to the servers behind cloud services and the user data within them, allowing them to bypass cracking SSL certificates and pull in the stored data directly.
PRISM gives the NSA an online connection to cloud providers. There is some dispute over how PRISM connects the NSA to these cloud providers, though. The slides leaked by The Guardian and The Washington Post call it a "direct connection" to the services' servers. However, The Guardian and the New York Times reported that company sources claim the set-up is more DropBox-like, with "secure online rooms" that the services can use to hand over data to the NSA, synchronized from their servers. This includes information about where users have connected from and who they're communicating with, as well as the raw data from e-mails and shared documents themselves. Similar information obtained by an FBI probe, ironically, uncovered the affair of former CIA Director David Patraeus and his biographer Paula Broadwell.
The NSA could theoretically export much of the metadata from these services—without having a specific target—in order to preserve data in the event that the NSA has cause to perform a search. But it's unlikely, simply for storage capacity reasons, that they copy the application data itself—e-mails, attachments, etc.—on a large scale. PRISM also allows the NSA to implement surveillance of chosen subjects' live interactions through the service, including presence data (notification of when a subject is logged in), instant messaging, video and voice chat, and Voice over IP phone calls using the services.
Enlarge / The NSA's Data Center under construction in Bluffdale, Utah will have a storage capacity measured in zettabytes.

Ways and means

With these vast amounts of data being collected, it's easy to understand why NSA would need a data center in Utah with a capacity measured in zettabytes. And it's also easy to see why privacy advocates would be concerned about the potential for abuse of the data.
Setting policy aside and focusing purely on capabilities, the NSA has the technology in hand to create detailed maps of the relationships between hundreds of millions of people inside and outside the US, and the means to dip into the communications that make up those relationships. It also has the ability to safeguard that data from casual access by people not cleared to use it. And it likely has the ability, when the need arises, to defeat the most basic means used to protect communications from surveillance.
Even with the giant data center the agency is constructing, it's still not going to be able to capture the entirety of Internet traffic. But the NSA doesn't have to capture all of it to have a view of what nearly anyone is up to—the metadata collected from traffic alone is enough to gather significant information about individual's online activities.
The question is not, then, whether the NSA can or can't uncover nearly every aspect of an individual's digital life and go all "Enemy of the State" on someone. The question is whether the safeguards in place that govern their use of that capability are sufficient to protect against abuse. There are certainly layers of compartmentalization within the NSA's internal databases, but just how strict the safeguards are isn't known outside the NSA.
Director of National Intelligence James Clapper and other US officials say that the law guarantees that data "cannot be used to intentionally target any US citizen, or any other US person, or to intentionally target any person known to be in the United States." But Edward Snowden's statements would suggest that the law isn't enough to prevent government contractors from using the NSA as their personal search engine.
( I added this) :) r
Sean Gallagher / Sean is Ars Technica's IT Editor. A former Navy officer, systems administrator, and network systems integrator with 20 years of IT journalism experience, he lives and works in Baltimore, Maryland.                              

No comments:

Post a Comment