Recent advancements in research at Intel have shed light on the elusive phenomenon of silent data errors, which have long puzzled engineers. Manu Shamsa, a young leader in this important, emerging field. For many years he has devoted his time to find these type errors which cause wrong decisions to be made that often go undetected by normal quality control checks.
Silent data errors manifest even in a perfectly operating chip where all transistors are working as intended. Inverted inconsistencies from one transistor to another can result in a false correction. This makes it appear like everything is kosher and fine and dandy, but there is a really serious problem underneath. Shamsa explains the gravity of the issue:
“You’re thinking everything is fine, but underneath, an error is causing a wrong decision.” – Manu Shamsa, Intel
Intel’s engineers jokingly refer to these non-corrected data errors as “spooky action at a distance.” This memorable formula was first described by the brilliant Albert Einstein. This lightheartedness highlights the number of us in the industry who have grown frustrated trying to understand these wonky topics.
Manu Shamsa and his team have put together an exhaustive catalogue illustrating the many different sources of clandestine data corruption. One of the most intriguing findings has to do with a connection between these errors and electrical resistance in chips. Shamsa emphasizes the challenges engineers face:
“Finding flaws is not easy.” – Manu Shamsa
Intel is working hard to improve detection techniques as well. To do this, they are focusing their experimental work on the chips’ Floating Point Multiply-Add (FMA) area. Our FMA region is uniquely susceptible to silent errors due to its expansive size. This fact alone makes this program an ideal candidate for deeper investigation.
To speed up the detection of silent data errors, Intel engineers created a new method based on reinforcement learning. This novel approach will help proxy out more silent data errors faster. In early laboratory trials, Shamsa’s algorithm produced outstanding accuracy. Surprisingly, it drastically improved error correction inside the FMA territory post 500 test cycles.
This is one of the techniques that has produced extraordinary outcomes. It is five times as effective at detecting defects as traditional randomized Eigen testing methods. This improvement gives providers a new level of assurance in data center performance. This is made even more critical as dense nodes increase in use, increasing chances of silent errors.
“In a laptop you won’t notice any errors,” Shamsa notes. “In data centers, with really dense nodes, there are high chances the stars will align and an error will occur.” This awareness illuminates the grim ramifications for spaces that depend on precision real-time data management.
Although the technical hurdles posed by silent data errors are daunting, Shamsa is encouraged about Intel’s direction. Further study is needed to better understand these errors. The hope is to formulate approved proactive measures that will keep them from impairing essential functions in data centers.