Ask the right questions

For the Army to successfully develop artificial intelligence, it needs to collect the right data before investing.

By Lt. Col. Jenny Stacy

Ongoing advances in artificial intelligence (AI) “will change society and ultimately, the character of war,” according to the 2018 National Defense Strategy. DOD has prioritized AI investments to increase lethality and retain multidomain dominance over peer and near-peer adversaries.

As part of this technology pivot, the Army is laying the foundation to integrate AI into future tactical network modernization efforts. AI technology has matured since the mid-1950s, when development first began, but acquisition professionals need to temper unrealistic expectations, be cautious of buying into industry hype, and gain enough understanding of AI to ask the right questions before making an investment.

AI IN THE ARMY: WHERE ARE WE NOW?

“A.I. refers to the ability of machines to perform tasks that normally require human intelligence—for example, recognizing patterns, learning from experience, drawing conclusions, making predictions, or taking action—whether digitally or as the smart software behind autonomous physical systems,” according to the 2018 DOD AI Strategy, released in February.

AI applications can quickly analyze vast amounts of data to produce actionable information. They can predict terrorist attacks, identify targets from imagery or audio surveillance, or enable faster and more informed decisions.

DOD AI strategy calls for accelerating delivery and adoption of AI; establishing a common foundation to scale AI’s impact across the department and enable decentralized development and experimentation; evolving partnerships with industry, academia, allies and partners; cultivating a leading AI workforce; and leading in military AI ethics and safety.

In October 2018, the Army established a scalable Army-AI Task Force under U.S. Army Futures Command to narrow existing AI capability gaps by leveraging current technological applications. The AI task force will work closely with the cross-functional teams at work on the Army’s modernization priorities to integrate AI into those efforts. The Army’s Rapid Capabilities and Critical Technologies Office (RCCTO) is already applying AI technology to address signal detection on the battlefield, by inserting AI and machine-learning prototypes into electronic warfare systems. These prototypes will be fielded to select operational units as early as August 2019.

RECENT AI FAILURES

AI technology has existed since the 1950s. In 1970, cognitive scientist Marvin Minsky predicted “a machine with the general intelligence of an average human being” would manifest within 10 years. The field has cycled through similar peaks of optimism that give way to failure since then—and has yet to produce a machine to the heights that Minsky predicted. Though recent advances in computer processors and sensors have enabled a leap in maturity, the technology is not fully mature. Computers still have difficulty classifying objects that are not the norm, and unintended errors can cause mistakes as well. It is not possible to predict all corner cases (situations outside of normal operating parameters), and misclassification of data can lead to fatal errors.

In March 2018, an Uber experimental autonomous vehicle operating in Tempe, Arizona, struck and killed a woman who was walking her bicycle outside of a crosswalk in a poorly illuminated area. The vehicle’s sensors detected an object six seconds before the crash and determined an emergency braking maneuver was necessary; it did not engage the brakes. The National Transportation Safety Board report on the incident, published in May 2018, noted: “According to Uber, emergency braking maneuvers are not enabled while the vehicle is under computer control, to reduce the potential for erratic vehicle behavior. The vehicle operator is relied on to intervene and take action. The system is not designed to alert the operator.”

In a 2017, National Science Foundation researchers built an algorithm to determine what changes to an object would confuse an AI classification program (like a driverless car program of the kind Uber was testing in Arizona). The algorithm generated two different attacks: a stop sign with graffiti on it and a stop sign with stickers strategically placed on it. In both cases the AI program misclassified the stop sign as a 45 miles per hour speed limit sign. “Adversarial attacks” with subtly altered images, sounds, or objects that normally would not fool humans are able to fool AI programs.

MACHINE LEARNING 101

There are many different applications of AI, including machine learning, a subspecialty of AI that uses probability and statistics to train computer programs. The computer “learning” is usually performed off-line using a training dataset to build a mathematical model to reflect the real world. The closer the model reflects reality, the more accurate the computer predictions. Once the program is fielded, it can continue to “learn” to improve its effectiveness.

EXAMPLE: SPAM VS. HAM

Early spam email filters were not very effective at identifying spam. Programs used “if-then” rules to identify spam. For instance, if a word like “Viagra” appeared in an email, then the email was automatically labeled as spam. Employees at those companies continually updated their word lists to adapt, while spammers only needed to slightly modify words in an email to create new scams and get around spam filters.

Machine learning automates that process by building a statistical model of spam email to classify emails as spam versus “ham” (good email). Companies gathered a large dataset of spam and ham emails. Using probabilistic and statistics algorithms in combination with spam and ham emails, the computer “learned” the probability of an email being either spam or ham. The machine could then automatically classify new emails based on the probability of being spam or ham, given the words in the email.

FACTORS FOR EFFECTIVE MACHINE LEARNING

It’s all about the model and the data used to build it. The more data used to train the model, the better it can reflect the world that is being modeled. The data, however, must be good data. Erroneous input, whether accidental or deliberate, will skew the model. Data also must be tagged or labeled with descriptions to train and test algorithms (e.g. emails classified as spam or ham or pictures tagged as “helicopter”). Without tags, the data is less useful and informative than it could be—a computer learns more from a picture of a helicopter tagged with the word “helicopter” than it does from just the picture without a tag. Depending on the type of data, tagging or classifying data can be a time-intensive manual process.

Rigorous testing measures how a model performs with a test dataset that does not contain the data used to train the AI model, to give a true representation of the model’s performance. Models tested against training data will have inflated performance scores, as the model has seen the data before and knows how to classify it. Precision, recall and f-scores better judge an algorithm’s performance than the traditional accuracy metric. Precision measures how many of the predicted items were classified correctly (e.g., how many of the emails labeled as spam were really spam). Recall measures how many in the total dataset were correctly identified (e.g., did the program find all the spam?). Having high recall is not meaningful if precision is low, and conversely, high precision does not necessarily entail high recall. F-score, the weighted average of precision and recall, overcomes the accuracy paradox because it takes into account false positives and false negatives and balances recall and precision.

Computational power also affects performance quality. The more parameters and the greater the complexity of an algorithm, the more computing power needed. Insufficient processing power prevents a timely and, therefore, useful result.

Programmers use heuristics, “rules of thumb,” to reduce complexity, parameters, processing power, or to fill knowledge gaps during algorithm development. These heuristics may trade off optimality, completeness, accuracy or precision. The heuristic could affect the program’s ability to find an optimal solution when multiple solutions exist or prevent it from finding the most correct or optimal solution. They may also only nominally decrease computing time. Poor heuristic choices and underlying assumptions degrade the validity of an algorithm’s output.

In the end, humans determine the underlying assumptions used to design artificial intelligence programs. The result presented to consumers is often a black box containing a mix of clever programming and smartly analyzed data. But if created poorly, models can be too sensitive or not sensitive enough, resulting in either too many false positives or false negatives. Corner cases, human insertion of errors and inaccurate models from bad or limited data sets will also lead to errors. Data requirements, accurate modeling, processing power and fallibility also apply to other AI specialties, such as facial and voice recognition.

ASK THE RIGHT QUESTIONS, GET THE RIGHT TECHNOLOGY

DOD is investing heavily in AI to gain military advantages and reduce workload. A working knowledge of AI will help product managers better understand industry presentations, and will help assess technical maturity and determine viability and scalability of a solution during the market research phase.

Preliminary market research questions include:

How is the model built? What are underlying assumptions?
How is the model tested? Was training data used in the test set?
How well does the model reflect the real world? What are the performance results of testing? How much better than random chance? What are the precision, recall and f-scores (closer to 100 percent is better) and confidence level in the results? What is the rate of false positives and negatives?
How was the data set gathered? If data was gathered from people, did the people know it was being gathered? How big are the training and test data sets? If the data set isn’t built yet, how long will it take and how much will it cost to build it?
How well does the program stand against deception and adversarial inputs (e.g., a subject wearing sunglasses or a hat)? What happens when the program is presented with corner cases?
How much computing power is required? Where does the processing occur? How long does it take for results to be computed?
Can the algorithm be updated easily? How are improvements inserted? How is real-time performance measured? Can operators determine when the algorithm is performing poorly in real time?
How well does the program work with existing programs to input and export insights?
Is the system autonomous or human-assisted? How much human assistance?
Where are decisions made? By humans or does the program automatically do it? This is a critical question for decisions about the use of force.
What rights does the government have to the dataset and the trained model?

CONCLUSION

Increases in processing power have enabled greater advances in AI to solve complex problems on and off the battlefield. There are still, however, limits to what AI can do. We can be cautiously optimistic but must exercise prudence and rigor to ensure we can identify the difference between a viable solution or a black box filled with empty promises. Asking the right questions up front will help unveil technology readiness—and help DOD steer clear of vendor oversell—enabling the right acquisition decisions and the efficient spending of Army resources.

For more information, go to the PEO C3T website at http://peoc3t.army.mil/c3t/ or contact the PEO C3T Public Affairs Office at 443-395-6489 or usarmy.APG.peo-c3t.mbx.pao-peoc3t@mail.mil.

COL. JENNY STACY is the product manager for Satellite Communications, Project Manager Tactical Network. She has an M.S. in computer science from the Naval Postgraduate School and her thesis, “Detecting Age in Online Chat,” received the Gary Kildall Award for computing innovation. She also holds a B.S. in computer science from the U.S. Military Academy at West Point. She is a member of the Army Acquisition Corps and is certified Level III in program management and Level II in information technology.

National Defense Strategy: https://dod.defense.gov/Portals/1/Documents/pubs/2018-National-Defense-Strategy-Summary.pdf

“Army integrates artificial intelligence and machine learning for electronic warfare,” Army Rapid Capabilities Office, Dec. 17, 2018: https://www.army.mil/article/215226/army_integrates_artificial_intelligence_and_machine_learning_for_electronic_warfare

May 2018 National Transportation Safety Board report on autonomous Uber crash: https://www.ntsb.gov/investigations/AccidentReports/Pages/HWY18MH010-prelim.aspx

“Who was really at fault in fatal Uber crash? Here’s the whole story,” Arizona Republic, March 17, 2019: https://www.azcentral.com/story/news/local/tempe/2019/03/17/one-year-after-self-driving-uber-rafaela-vasquez-behind-wheel-crash-death-elaine-herzberg-tempe/1296676002/

This article is published in the Summer 2019 issue of Army AL&T magazine.

Subscribe to Army AL&T News – the premier online news source for the Army Acquisition Workforce.
Subscribe