DARPA's Building a New Search Engine to Crawl the Deep Web
The massive internet brain is some 500 times bigger
than what we web users can actually see. Search engines only index a
fraction of the web pages online, and the rest of the internet remains
hidden from view—thousands of terabytes of invisible
information. Unsurprisingly, the Defense Department wants to gain access
to the internet's hidden data, and it has a plan to create an entirely
new search paradigm for the military, law enforcement, and intelligence
agencies to use to shine a light on the deep web.
Yesterday DARPA called for proposals to create a next-gen search engine to "revolutionize the discovery, organization and presentation of search results." The
project's name, Memex, a portmanteau of "memory" and "index," comes
from a way-ahead-of-its-time concept for indexing the world's
information that was floated in 1945 by scientist Vannevar Bush, and eventually led to the invention of hypertext, the World Wide Web, and personal computers.
More
on that later; first, here’s how DARPA plans to access the invisible
web. The agency laid out what it sees as the shortcomings
of search today: It ignores shared content across web pages, doesn't
save browsing sessions or allow results to be shared with collaborators.
It doesn't crawl sites that aren't indexed, only organizes results in a
list of links, and requires entering the exact right text to get the
results you’re looking for.
Most
importantly, it's centralized—search today is a one-sized-fits-all
product. Instead, DARPA wants a system that can tailor searches to focus
on a specific topic, or realm of the internet. It would automate the
process, continuously crawling the web for a mission-specific subject,
and would leverage image recognition and natural language technology to
find content beyond plugging in certain keywords.
It
would also drastically expand the scope of what is indexed, to include
"link discovery and inference of obfuscated links, discovery of deep
content such as source code and comments, discovery of dark web content,
hidden services, etc,” according to the project report.
The
idea is to eventually use the personalized indexing to comb through the
hoards of information that are in the public domain but currently not
indexed. But first, the military would focus on hunting down human
traffickers, and the modern-day slave trade that lives largely on the
web in forums, chats, advertisements, job postings, and hidden services.
It’s also eyeing the counterfeit goods, missing people, and found data
realms.
Naturally,
the government trying to pry into every nook and cranny of the internet
is a loaded topic right now. But the defense agency claimed, for what
it's worth, that while it's sniffing around the deep web it's not trying
to out any anonymous users or spy on anyone. It states it's
"specifically not interested in proposals for the following: attributing
anonymous services deanonymizing or attributing identity to servers or
IP addresses, or gaining access to information which is not intended to
be publicly available." But exactly
how the DoD plans to bust sex traffickers in the hidden web without
deanonymizing users or identifying IP addresses, you’ve got me.
That
mystery aside, the mid-Century memex contraption that's inspired
DARPA's latest project is fascinating in retrospect. The agency is
drawing on an idea first conceived during World War II, and described by
Bush in an Atlantic article called As We May Think.
Bush
wrote that when the war is over, scientists should get to work on the
"massive task of making more accessible our bewildering store of
knowledge." Decades before the personal computer came along, Bush
imagined a "device," he named memex, that would be used a a mechanism
for finding and organizing the world's information, basically acting as a
mechanical backup for the human brain.
Memex animation - Vannevar Bush's diagrams made real
He
imaged a desk with a keyboard, buttons, levers, and two slated
translucent screens for reading. It could store troves of
information—books, articles, scientific work all stored as microfilm.
Users would consult the record by inputting a code to pull up a certain
book, and pulling the lever to scan through the pages backward and
forward. They could also use a stylus to take notes on the second
screen.
But
where Bush's proto-hypertext vision deviates from modern day search is
that he envisioned being able to save and build on "trails" of
information gathering—like going down a series of Wikipedia rabbit holes
and then being able to save that adventure, recall it later, and share
it with other researchers.
Per As We May Think:
Wholly new forms of encyclopedias will appear, ready made with a mesh of associative trails running through them, ready to be dropped into the memex and there amplified. The lawyer has at his touch the associated opinions and decisions of his whole experience, and of the experience of friends and authorities. The patent attorney has on call the millions of issued patents, with familiar trails to every point of his client's interest. The physician, puzzled by a patient's reactions, strikes the trail established in studying an earlier similar case, and runs rapidly through analogous case histories, with side references to the classics for the pertinent anatomy and histology.
In
a nutshell, Bush wanted to mimic how the human brain thinks, learns,
and remembers information. Which is exactly what artificial intelligence
researchers at the DoD and in Silicon Valley
are trying to do now, to glean better insights from the unruly army of
big data being collected by web giants and the military alike.
Now
DARPA plans to extend that next-gen capability to the deep web, or at
least try to—a rather unsettling prospect regardless of the agency’s
no-spying disclaimer. While I’m all for improving search and unveiling
the internet’s untapped information, what are implications for people
with good reason to stay in the digital dark—users trying to evade
censorship, whistleblowers, journalists, and activists? Exactly how much
light does the military want to shine on the hidden web?
No comments:
Post a Comment