Yoga for Data

By August 16, 2016September 1st, 2018Army ALT Magazine

The path to enlightenment is not a straightforward one, for people or for data. That’s why flexibility is key when reaching for answers, and why it’s necessary to stretch data so that it can lead to more, better knowledge.

by Mr. Thom Hawkins and Mr. Matt Choinski

As acquisition professionals, with the hindsight of five, 10 or 20 years’ experience, we can move from blindly populating templates to an intuitive understanding of the connections between schedules and risk management, between our strategic plan and our daily operations. But even with experience, none of us has reached the pinnacle of perfect execution. There’s always more to learn, and the worst thing we can do is to close ourselves off to adaptation.

The Army’s No. 1 priority, readiness, must also be our top priority, and our readiness must be the ability to adjust to a rapidly changing world. We must be ready with the ability to provide new weapon system capabilities or information systems that can accommodate new categories of data, new ways to understand the complex world the warfighter must face.

THE DHARMA OF INFORMATION-SEEKING BEHAVIOR
Our information systems are not as dynamic as our information-seeking behavior. As T.D. Wilson notes in his paper “On user studies and information needs,” “It may be advisable to remove the term ‘information needs’ from our professional vocabulary and to speak instead of ‘information seeking towards the satisfaction of needs.’ ” This is our dharma, our path to truth, cosmic order. Wilson’s point is that information needs aren’t static—they change over time. “Now that I know that, I want to know this.” Now that I know we’re obligating the funds too early, why don’t we have better insight into the contractor burn rate? Each one of these questions would require a change to the structure of a database. A slightly different question may require changing how data are collected, stored or queried.

The Army’s ability to sustain its information systems is dependent upon the flexibility of those systems. If those systems cannot adapt to changing information needs, we will see a quick transition to obsolescence followed by another expensive investment in the next generation, or even another overlapping system, maintained alongside the first one. Information-seeking behavior on its own isn’t expensive, but what if you have spent thousands of dollars building an infrastructure to collect the data to provide the information? In other words, we can’t afford to change our minds about what we want to know.

YOGA FOR DATA
Our traditional data warehouses are highly structured and so rigid that they have become brittle. We need yoga for our data structures to increase their flexibility, to adapt to information-seeking behavior. The body of a data warehouse is its schema, a set of constraints that tells what the data must look like. Data must fit the schema to be entered into a database. If we want to add data that doesn’t fit the schema (for example, if we want to add a contractor burn rate not previously captured), then we must change the schema. While modifying the schema is marginally easier than forcing a human body into a new and difficult yoga position for which it has not prepared, it is still a costly and time-consuming exercise.

One of the underlying assumptions of a modern data warehouse is that the data must follow a common schema—if data is not consistent in description, in how it is measured, then we can’t relate the data to allow us to make that leap from data to information. This is a good assumption, but we’re applying it too early. We’re applying it to data collection rather than data analysis.

Forcing data into a common format complicates the process of pulling in data from other information systems. Imagine if we took the water piped into our houses and immediately separated it based on need. We’d have one tank of hot water with soap for showers, one tank for water with toothpaste for brushing our teeth, one for washing dishes, one for drinking, and so on. If we run out of drinking water, we can’t use the dish water, because it isn’t suitable. This is what we’re doing with our data when we force it into a schema—we’re assuming a particular use, but if we have a different question, it may not be suitable.

NAMASTE, DATA LAKE
A more efficient method is what we already do: Transform the water at the point of need, and add toothpaste when we’re ready to brush our teeth, or add soap when we’re ready to wash the dishes. With information systems, a pool of unstructured data is called a “data lake.” The key distinction between a data warehouse (a traditional relational database) and a data lake is when a structure is applied to the data. In a data warehouse, the schema is applied at the time the data are added to the warehouse; in a data lake, the schema is applied when data are called upon to answer an information need.

The data lake, therefore, is a better model for changing information needs. In the data lake model, information workers who understand what data are available and what the customers’ needs are at that time find the appropriate data and package it for each new information requirement. Users closer to the question are better positioned to answer it using the data at hand.

Recurring information needs can be answered just as quickly with a data lake as with a data warehouse, through a standard query and applied schema. As needs change, though, the data lake is the more responsive model—the data to answer the information need may already reside in the lake, or if not, can be extracted from other sources without any changes to the underlying infrastructure.

One application of the data lake concept is MIRARS, the Manpower Information Retrieval and Reporting System. MIRARS is designed by the Program Executive Office for Command, Control and Communications – Tactical’s Product Lead for Military Technology Solutions to provide personnel accountability (for example, through a daily roll call of employee locations). Several Army acquisition organizations rely on MIRARS for location awareness of their personnel in case of emergencies or other events. For example, in the January 2016 Naval Medical Center active shooter event in San Diego, these organizations were able to use MIRARS to almost instantly determine that no personnel were in the affected area.

Because of its flexible design, MIRARS can be modified quickly to accommodate new requirements from leadership without the difficult and cumbersome data migrations typical of relational databases. The ability to quickly adapt to new requirements is important because of the ever-increasing constraints on resources and budgets. Using a flexible schema allows teams to develop faster and in a more agile fashion, resulting in lower development and maintenance costs and higher-quality products.

A database structured by the relationships between its data elements is not flexible enough to withstand the stress of managing requirements from multiple stakeholders. Instead, adding a new field is as simple as adding the element to the resulting report—there are no direct changes applied to the database or its schema. For example, when there was a new requirement to track mandatory training for personnel, that information was added to the data lake, changing the source code, but with no need to change other database objects, like views or stored procedures. This capability also helps to resolve seemingly incompatible requirements from various stakeholders, such as associating matrixed personnel with their home organization or their matrix organization, because the data does not need to be changed, only the way each user sees it.

PEO C3T built MIRARS using MongoDB’s nonrelational database software, taking advantage of this structureless revolution. MongoDB’s other organizational users include Fortune 100 companies as well as local governments, along with the City of Chicago and Craigslist. The City of Chicago used MongoDB to build a predictive data management platform called WindyGrid that pairs analytics with maps to provide real-time insights on city operations. WindyGrid’s SmartData project allows Chicago city managers to predict trends and potential situations such as traffic congestion, resident migration and the depth of floods.

With 1.5 million new classified ads posted daily, Craigslist has built an archive of records numbering in the billions. Using a traditional relational database, Craigslist would need to apply schema changes to that entire archive to maintain the integrity of its data. By converting to a data lake concept, Craigslist can change the format for new ads or diversify the format across different types of ads without compromising access to its valuable historical data.

These applications by the city of Chicago and Craigslist have a clear relevance to today’s Army, extending forward to access and use mountains of data to inform decisions, and bending backwards to maintain access to historical records that could be mined for information if only we could afford to convert them to accessible formats.

THE PATH TO ENLIGHTENMENT
We may never achieve the wisdom of the yogi, but we can only learn through seeking, and as we seek, changing. As demonstrated by MIRARS, the endurance of a tool is based on its ability to change with the perspective and needs of its users. The information systems we’re building now, with their emphasis on responding to yesterday’s questions with today’s answers through a rigorously structured framework, will become legacy systems before we field them.

Because both our tactical and enterprise information needs change so rapidly in contrast with our requirements development and system procurement, rarely will we field a system that answers the needs of today’s Army, and never will we field one that will answer the needs of tomorrow’s Army. Our continued readiness is dependent on the versatility of our information systems to respond to our information-seeking behavior. Only by building flexibility into our systems through adaptive information techniques like the data lake will we maintain relevance without continuous unsustainable investment.

Unless we stretch, the peak will forever be out of reach.

For more information, go to http://peoc3t.army.mil/c3t. Information about the data lake concept can be found at http://martinfowler.com/bliki/DataLake.html, and information about Mongo DB is at https://www.mongodb.org/.


MR. THOM HAWKINS is the continuous performance improvement program director and chief of program analysis for the Program Executive Office for Command, Control and Communications – Tactical (PEO C3T). He holds a B.A. in English from Washington College and an M.L.I.S. from Drexel University. Hawkins is Level III certified in program management and Level I certified in financial management, and is a member of the Army Acquisition Corps. He is an Army-certified Lean Six Sigma Black Belt and holds the Project Management Professional and Risk Management Professional credentials from the Project Management Institute.

MR. MATT CHOINSKI is a senior software developer at Data Systems Analysts Inc., providing contract support to PEO C3T, and lead software developer of MIRARS. He holds an MBA from Loyola College and a B.A. in business administration from Towson University.

This article will be printed in the October – December issue of Army AL&T magazine.

Subscribe to Army AL&T News, the premier online news source for the Acquisition, Logistics, and Technology (AL&T) Workforce.


Related Links

On user studies and information needs,” Journal of Documentation