Open Source Big Data

By July 14, 2016September 3rd, 2018Army ALT Magazine
Print Friendly, PDF & Email

PEO EIS and ARCYBER found something interesting in analyzing the government’s off-the-shelf big data systems: Nearly all are built using high-quality, open-source software. With a similar, government-owned platform, DOD would no longer pay high licensing fees. It could increase the competitive playing field and make all of its big data analytics work together.

by Maj. Isaac J. Faber and Ms. Elissa Zadrozny

Big data analytics—the process of examining massive data sets containing a variety of data types to uncover hidden patterns, correlations and other strategic business and operational information—is among the hottest trends in information technology and one of the Army’s highest priorities. The Army chief information officer/G-6 (CIO/G-6), in releasing the Army Data Strategy in February 2016, stated, “The Army will utilize a two-pronged approach for managing big data. First, the Army will redouble its efforts to implement effective data management methodologies to ensure that data are authoritative, timely, secure and of the highest quality. Second, the Army will develop a process for the identification, development and implementation of efficient decision support and analytical tools to best maximize the use of information derived from big data extrapolation.”

Toward this end, the Program Executive Office for Enterprise Information Systems (PEO EIS) and the U.S. Army Cyber Command (ARCYBER) have been piloting a government off-the-shelf (GOTS), open-source platform based on open-source software and open standards. This effort is intended to potentially inform the way ahead.

The Army CIO/G-6 understands that Army data scientists, technologists and acquisition professionals need to work together and focus on identifying the best and most efficient ways to partner with industry to help the Army realize the promise of big data.

That’s because, in adopting a big data system, you gain an ability to sift through large volumes of data from a variety of sources at a faster rate than traditional databases. This is done by breaking the data into smaller pieces and spreading the processing of that data across many machines in “parallel” and returning the response to a consolidation point. This is known as parallel computation, and it’s what is needed to tackle the data management challenges faced by our cyber network defenders. Google is the most recognized pioneer in tackling the big data challenge of indexing and searching the unceasing volume, variety and velocity—known as the 3Vs of big data—of structured and unstructured data.

Hadoop is a free, Java-based programming framework that supports the processing of large data sets in a distributed computing environment. It is also an important tool to consider when implementing a big data strategy. Hadoop is sponsored by the Apache Software Foundation, which is dedicated to supporting open-source software projects for the public good. At its simplest, Hadoop provides a parallel-processing computing framework for data storage and processing. This is important for enterprise-level analysis because of physical limitations on how quickly a single machine can process information.

For example, when deploying a basic Hadoop system you first build all indexing strategies. These indexes are what allow you to organize data in a way that makes it quickly searchable, like a table of contents. For organizations looking to develop products to support big data, this first step has become a point of product differentiation, as performance is based on how well data is indexed. Product differentiation is key for companies looking to distinguish their product or service in the marketplace. Other differences (or divergences) become more evident as applications are built on top of the data store. Differences in visualizations, data science libraries, cloud architecture and access management are a few examples. While many of the same open-source distributions are used as a starting point, the end result is a product that is intended to work, on its own, from infrastructure to the user.


A vendor- and product-neutral government off-the-shelf (GOTS) platform provides an environment for developing complex, cyber-hardened systems that lend themselves to frequent technology refreshes and rapid insertion of cutting-edge technology. (SOURCE: USAASC/Exdez/iStock)

The government is developing a strategy to enable communities with big data needs to have access to this technology. There are special considerations that need to be taken into account to ensure that this is done in a sustainable manner. A strategy of an open government platform with vendor-provided applications and infrastructure is an approach derived, in part, from the National Institute of Standards and Technology’s (NIST) cloud computing reference architecture. Big data systems leveraged for cyber analytics are typically built using cloud standards and technology. For the end user, this means access to all of the services through a modern Web browser. For engineers, it means building access through a modular framework of infrastructure, platform and software applications (or “apps”).

Consider the following: This morning, you probably awoke to an alarm that you set on your mobile phone. In addition, you probably reviewed email messages or read today’s headlines over coffee. Perhaps you checked the weather or traffic before leaving home for the day. All on the same device.

You probably rely on several apps on your phone to improve productivity and quality of life. What you probably do not think about is how different organizations develop each of these apps across a very diverse and competitive industry. Most modern software development efforts are based on a NIST-type modular framework, where applications are built to operate on a common, shared platform. For example, Apple iOS, Android, Xbox and PlayStation are platforms that provide an environment in which innovation can flourish. The environment in which an app is created and deployed is completely separate from the app itself. This environment includes not only the platform, but an entire development system that encourages seamless integration. The user doesn’t see this technical nuance, but it’s enormously important when considering life cycle costs and quality.

With software sustainment, the choice of platforms is the linchpin that allows for versioning, expansion, adaptability and flexibility. A robust platform enables independent apps to have limited deployments that can scale to a large user base when ready. In the same way, applications can be added or removed without impact to related services. Using a common platform is a distinct tradeoff for end users. Applications will be limited to platform services; however, more individuals can participate in development. This creates more diversity and competition. The personal choice of your mobile phone platform is an excellent example where you might choose a device based on the variety of applications that can be built and used on it.

One of the major challenges with the government procurement approach to acquiring technical solutions is “vendor lock-in.” Vendor lock-in occurs when a customer using a specific product or service cannot easily transition to a competitor. It is usually the result of proprietary technologies that are incompatible with those of competitors.

Historically, large technical system contracts have been awarded for total solutions that create dependencies on a particular vendor or provider. These dependencies make a single contractor the sole provider for an extended time because the startup investment for a new solution is cost-prohibitive.

Consider weapon system software developed using commercial off-the-shelf (COTS) products that are relevant to today’s standards and technology. If the initial award is given to a firm using a proprietary platform, the government may be forced to continue working with that firm for decades, even if the firm sells the technology or operates under a different company name. This type of lock-in is created because of government reliance on existing solutions and long development and procurement cycles for replacements.


How can the Army reduce the risk of vendor lock-in when it comes to big data? The answer is simple: Partner with industry to develop standards for interoperability and place a premium on adaptive and iterated innovation control. (SOURCE: 4X-image/iStock)

Operating systems, databases and office productivity suites are other examples of capabilities that, once purchased, are nearly impossible to re-compete without massive organizational effects. Throughout the enterprise, proprietary solutions can become the center of policy and workflow, making product changes difficult and cost prohibitive. So, how can the Army reduce the risk of vendor lock-in when it comes to big data?

The answer is simple: Partner with industry to develop standards for interoperability and place a premium on adaptive and iterated innovation control. The Army should build a core, standards-based platform and encourage vendors to develop applications that are adaptable and responsive to new requirements on that platform.

The cybersecurity domain offers an excellent test bed to explore this approach. Within the cyber domain, an enormous amount of data has to be collected and analyzed to find the most advanced threats. With this come significant requirements that cross technical and policy considerations. The capability required by the cyber community comes from the service (an “analytic”) or services that sit on top of a platform.

With product differentiation, nearly every analytic vendor uses a proprietary platform when building an analytic. This creates a potential vendor lock-in trap. There is a legitimate fear that when committing to a vendor-specific analytic, a proprietary platform will come along with it, excluding participation from other vendors. Lack of portability and interoperability of this type of solution lessens big data’s potential for the Army to store and share data in one place for use with different analytics from a wide variety of sources.

Because the level of effort to migrate data to a platform is so high, most likely there would not be available funding for investment in multiple platforms. To this end, over the past few years, PEO EIS and ARCYBER have been experimenting with a big data cyber-analytics pilot.

Reviewing the technical requirements in the big data community uncovered something interesting: Nearly all vendor products are now based, largely, on high-quality open source distributions from the Apache Software Foundation. In addition, there are existing capabilities within DOD built for specific cyber use cases.

The pilot leverages these two resources to build a no-cost licensed platform that enables multiple participants to provide software. The platform uses open standards where most big data vendors’ products can easily be adapted. More importantly, the cyber community can develop its own small-scale capabilities without any additional contracting actions. This enables a competitive environment whereby vendors of all sizes can participate and the government has low risk of vendor lock-in.

The undersecretary of defense for acquisition, technology and logistics directed 22 years ago that all DOD components and agencies use open systems specifications and standards for acquisition of weapon systems implemented through what is called open systems architecture (OSA). OSA is a key tenet of Better Buying Power (BBP) 3.0 for promoting competition. OSA principles are also supportive of and consistent with the use of open source software (OSS), which is considered commercial computer software, in systems.

The big data cyber analytics pilot looks to OSS as a way to encourage industry partnerships. It also seeks to obtain maximum use of limited resources while avoiding vendor lock-in and licensing fees. Cloud-based access and the use of OSS development tools that allow participatory community feedback has created a force multiplier, bringing together multiple vendors under partner DOD organizations to create a GOTS big data platform. Other Army and DOD components can also be made aware of the platforms’ availability and are then able to deploy COTS or other apps to further their organizations’ missions.

The Army can help meet its missions by reducing barriers to sharing software through the use of OSS. The advantages include increased transparency and openness with industry. Writing contracts that favor maximum sharing, collaboration and adequate data rights to the government allows release of software as OSS by default. The technical core of openness is supporting competition and the ability to rapidly deploy capabilities to the force with the ability to add components and build larger systems. Development of competing components is motivated by larger marketplaces for those components.

Within the Army’s elite cyber units, including protection teams and regional defensive cyber operations divisions, capabilities are poorly interconnected single-vendor solutions, each only meeting one or two requirements. In an odd paradox, the security for the DOD Information Network is, in some way, dependent on how well our defenders navigate the capabilities they are provided. This increasingly complex web of disparate solutions is a call to reconsider future materiel developments and change the paradigm of vendor-bundled COTS solutions as a cure-all for competitive sourcing, rapid deployment and cost control. The common big data platform is just one example of how it’s possible to have openness with industry that still promotes competition and innovation at a low cost.

A vendor- and product-neutral GOTS platform provides an environment for developing complex, cyber-hardened systems that lend themselves to frequent technology refreshes and rapid insertion of cutting-edge technology. Sharing that platform with industry through the open source communities or common application programming interfaces inserts key capabilities as needed at the lowest possible cost through competitive sourcing rather than closed proprietary solutions. The adaptability and innovation needed to address legitimate national security concerns about maintaining a defended cyberspace domain can be achieved by supporting the Army’s efforts around big data cyber analytics and BBP 3.0 goals of achieving dominant capabilities while controlling life cycle cost.

For more information, contact Maj. Faber at

MAJ. ISAAC J. FABER is the lead data scientist at ARCYBER, Fort Belvoir, Virginia. He holds an M.S. in industrial and systems engineering from the University of Washington and a B.S. in computer information systems from Arizona State University. He holds an academic rank of assistant professor with the United States Military Academy at West Point.

MS. ELISSA ZADROZNY is the Technical Management Division chief for the project director for enterprise services within PEO EIS, Fort Belvoir. She holds an M.A. in computer resources and information management from Webster University and a B.A in international relations and American studies from the University of Richmond. She is Level III certified in information technology and program management, Level II in engineering and a member of the Army Acquisition Corps.

This article was originally published in the July – September 2016 issue of Army AL&T magazine.

Subscribe to Army AL&T News, the premier online news source for the Acquisition, Logistics, and Technology (AL&T) Workforce.