When customers have a fully supported engineered system, they know they have a single point of contact for when things go wrong. So when an Oracle banking customer experienced motherboard failures in engineered systems located in business-critical data centers, they knew who to turn to.
Conventional telemetry monitoring for IT systems yielded no clues.
And frustratingly, when these multi-CPU motherboards costing $100,000 each were swapped and brought to service centers for extensive testing, none of the failure triggers could be reproduced.
How Machine Learning Discovered the Problem
We talked to Oracle Architect, Kenny Gross, to find out how they discovered the answer at Oracle Labs. Oracle applied machine-learning prognostics with AI-based pattern recognition to all systems in the data center using a machine-learning pattern recognition algorithm called Multivariate State Estimation Technique (MSET).
It almost immediately discovered the root cause of these motherboard trips.
In this case, there was an issue with the third-party bulk power supplies, where a tiny component called a voltage regulator would issue a short sequence of voltage wiggles. These wiggles weren't large enough to trigger any warning alerts for the power supplies, but they would cause the "downstream" motherboards to experience random (and unreproducible) fatal errors.
System-wide big-data pattern recognition identified the root problem, and the solution was actually very inexpensive. It involved more machine learning to find the small number of power supplies that had the defective voltage regulators.
Machine learning-based global pattern-recognition surveillance would proactively identify the incipience of tiny voltage irregularities in power supplies, and the power supplies were inexpensive and easy to swap with no interruption in service.
This was especially beneficial because of the ability to proactively swap a small number of power supplies exhibiting elevated risk, as opposed to having to swap all power supplies in all the data centers, which is what they would have done without machine learning surveillance.
Use Case Outcome After Applying Machine Learning
After applying machine-learning techniques and MSET, the customer went to “5-nines” availability for their business-critical IT assets.
The bank’s CIO and his staff of IT experts were so impressed by the evidence and solution derived from machine learning, they immediately ordered three more engineered systems and requested to leave the machine learning prognostics on all of the IT systems.
In addition, the evidence from the spurious fault mode was made available to the power-supply manufacturing company, which enabled them to implement a more robust voltage regulator component in their power supplies going forward.
That's the benefit of an engineered system, where the customer is supported at every step.
What Is MSET?
We talked a little about what MSET can do. But what is it?
MSET is an advanced prognostic pattern recognition method that was originally developed by Argonne National Laboratory for high-sensitivity prognostic fault monitoring applications in commercial nuclear power applications.
It has since been spun off and met with commercial success for prognostic machine learning applications in a broad range of applications, including NASA space shuttles, Lufthansa air fleets, and Disney theme park structural and system safety instrumentation, just to name a few examples.
In the last few years, Oracle has pioneered the use of real-time MSET prognostics for sensitive early detection of anomalies in business-critical enterprise computing servers (called Electronic Prognostics) and software systems (where MSET detects performance anomalies from resource-contention issues and complex memory leaks), storage systems, and networks.
The MSET advantages (versus conventional machine learning approaches such as neural networks and support vector machines) include:
- Higher prognostic accuracy
- Lower false-alarm probabilities
- Lower missed-alarm probabilities
- Lower overhead compute cost
Much of this is crucial for real-time dense-sensor streaming prognostics. It will also be crucial for the new Internet of Things (IoT), where sensors are ubiquitous, to discriminate between real problems and an inexpensive sensor failure.
How Does MSET Work?
The MSET framework consists of a training phase and a monitoring phase. The training procedure is used to characterize the monitored equipment using historical, error-free operating data covering the envelope of possible operating regimes for the system variables under surveillance. This training procedure evaluates the available training data and selects an optimal subset of the data observations (memory vectors) that are determined to best characterize the monitored asset’s normal operation.
It creates a stored model of the equipment that is used in the monitoring phase to estimate the expected values of the signals under surveillance. In the monitoring step, new incoming observations for all the asset signals are used in conjunction with the trained MSET model to estimate the expected values of the signals.
How Is MSET Used?
We mentioned one use case in this article already. But MSET has been gaining in popularity because its advantages scale to truly big data streaming analytics applications, which is a vital capability for IoT use cases. In MSET’s earlier days, as an example, some of the challenges involved massive-sensor fleets of critical assets. Today, one large (refrigerator-sized) enterprise server has 3400 internal sensors (same as a commercial nuclear reactor). One medium-sized datacenter contains a million sensors.
We’re now finding beneficial spinoff prognostic applications for multiple industries where the numbers of sensors have been growing with IoT digital transformation initiatives. For example, one jumbo jet now contains 75,000 sensors. One transmission grid for a medium sized US utility now comprises over 50,000 critical assets, all of which contain multiple sensors. And one modern oil refinery now contains a million sensors.
Most MSET use cases are realtime. But some customer use cases extract just as much prognostic value out of storing all of the telemetry into a "data historian" file and then running MSET once or twice per day. MSET is flexible and equally valuable for realtime surveillance or batch-mode prognostics from data historian signals.
Benefits of MSET
Some of the advantages of MSET for big data prognostic applications include:
- Estimates are extremely accurate, with uncertainty bounds that are usually only one to two percent of the standard deviation of the raw sensor input signals
- An extremely high sensitivity for detecting subtle disturbances in noisy process variables, but with extremely low false positives and false negatives.
Oracle’s MSET-based prognostic innovations help increase component reliability margins and system availability goals. At the same time, these innovations reduce (through improved root-cause analysis) costly sources of “no trouble found” events that have become a significant issue across enterprise computing and other industries.
The benefits for Oracle's MSET approach for big-data prognostics are the same for applications to other fields, including oil and gas and utilities, where proactive maintenance of business-critical assets is essential. It enables reduced operations and maintenance costs, improved up-time availability of revenue-generating assets, and improves safety margins for life-critical systems.
Source: Oracle Big Data Blog posts