Devesh Tiwari in News@Northeastern: He Troubleshoots the World’s Fastest Supercomputers, Where System Failure Can Cost Millions

Excerpt:

From battling the coronavirus to modeling the forces responsible for the creation of galaxies, supercomputers are helping to solve some of the most pressing problems in the world today.

But these mammoth high-performance computing systems, some of which require football-field-size floor space and tens if not hundreds of miles of cabling to store and operate, are prone to numerous kinds of system failures, glitches, and bugs. These problems, which are notoriously hard to predict, can be costly—causing lost money and productivity, says Devesh Tiwari, an assistant professor of electrical and computer engineering at Northeastern.

Tiwari has been working on how to best identify these large-system vulnerabilities and recently earned a Rising Star in Dependability Award at the 51st annual International Conference on Dependable Systems and Networks for his work on improving the reliability and cost-effectiveness of supercomputers.

Using his experience as a staff scientist at the Oak Ridge National Laboratory in Tennessee, which houses the world’s second most powerful supercomputer—and the nation’s first—called Summit, Tiwari developed methods for rooting out hardware failures, predicting future ones, and optimizing data storage.

Ever-larger computer systems that perform increasingly complex tasks, and which rely on enormous amounts of power to operate, need to be reliable, Tiwari says, which in computing parlance is a measure of, among other things, how well a system can withstand threats and be repaired if there is a hardware failure.

It’s been a “famous problem” over the last couple decades—improving reliability and reducing costs—and one that many in the field are at work trying to solve using federal funds, Tiwari says. These improvements have implications across a range of sectors that rely on these sophisticated computer systems, from weather modeling and medical research to national security and military operations.

“You have all of these really large supercomputers that are trying to solve really important problems,” Tiwari says. “This is why reliability is so important.”