Data has no inherent value. To be useful, data must flow to agents who will ultimately process, analyze, and synthesize it to produce information that drives decisions. The recent conversation in DoD has focused on what is referred to as the “big data problem,” that is, since we don’t know what’s important in the data being collected, everything must be saved. But this is much harder than it sounds.
The problem can be summed up best with an example. One ARGUS wide-area EO sensor collects approximately 6 PB of data in a 24-hr period. The following graphic gives you an idea of how much data we are talking about.

How much data is a petabyte? (courtesy of Mozy.com)
Data is ubiquitous, storage is commoditized, comms are precious
As daunting as the data explosion might seem, it has been accompanied by a dramatic increase in the supply of mass data storage solutions that leverage “cloud” architectures. The “big data problem” actually has less to do with data storage than it does with transporting the data, that is, moving data from the edge (where it’s collected) to the core (where it’s stored).
The key constraint in the data storage equation is comms. And comms doesn’t scale. To my knowledge, there are no near term, brute force solutions that alleviate this constraint. Wireless data links provide nowhere near the throughput required of today’s military data sources (e.g. ARGUS), and expeditionary operating environments don’t lend themselves to the installation of physical pipes.
Save all the data?
The current state of the art allows us to save all of the data (more or less), but it doesn’t allow us to move all of the data from the edge to the cloud. It’s clear that we need to start thinking about this problem in a different way…
Data is not important. It’s the information that can be gleaned from the data that matters. The old data paradigm emphasizes precision: save only what you consider to be relevant at the time the data is collected. This approach works only so long as you are dealing with a more or less static context where “relevance” can be readily established.
In the contemporary threat environment, operational context is constantly changing. Since relevance can’t be established a priori, the natural inclination is to save all of the data based on some indeterminate future value. But “data hoarding” only works so long as the means to aggregate massively distributed data actually exists.
Analytics at the edge
It may be within the realm of the possible to save all of the data, but it’s not possible to move all of that data around. This realization has led the community to consider approaches that aggregate metadata (i.e. data that describes the underlying data sets). Such approaches provide a valuable window into the distributed data inventory but fail to address the problem of leveraging the aggregate data to produce information.
A smarter solution is to process data at the edge to derive feature vectors that describe the information contained in the data. More processing (and not just static data storage) at the edge supports rapid indexing, correlation, and fusion of data to establish the rich contextual relationships between data sets along with the spatial, temporal, phenomenological derivatives that capture the underlying dynamics of the data. Rather than “storing everything,” such an approach enables the community to “exploit everything” and store only what is needed.
Processing at the point of collection is the key idea underwriting Mav6′s Service Oriented Horizontal Information Exchange (SOHIX), which is the computational backbone of the Blue Devil Block II Payload Integration Infrastructure (PII). Leveraging SOHIX and the parallelized SOHIX data processing architecture, we are turning raw sensor data into actionable information that can be disseminated and accessed over conventional air-to-ground data links.