January 16, 2023
If there is more than one name for a certain phenomenon, this could mean it is widespread, important, or confusing. “Trouble not Identified” (TNI) has many names—sometimes it is called “No Fault Found” and on occasion, sounding somewhat more defiant, “Retest OK”. (We have taken the trouble to include a table of common terms around the phenomenon.)
Whatever the term, TNI unfortunately is not only widespread and important, but confusing as well. Moreover, the lack of a standardized terminology probably obscures the full extent of the fault-not-found rate, which might be three times higher than previously thought. The problem is pervasive, varied, and impactful, it creates bewilderment and resists, by its nature, neat solutions. Once TNI is designated or diagnosed, trouble is in fact guaranteed.
TNI affects many industries with yearly costs in the billions. Not only is money lost, but worker’s hours are wasted, reputations damaged, schedules delayed, supply chains stretched, customers left unsatisfied. Engineers scratch their heads, customer reps search for explanations, jaded technicians recycle units back into service because they passed a retest. (Which can create a feedback loop of sorts: Service technicians on site may resort to zapping electronic units with undue voltage before sending them to the manufacturer, to not get them back with the dreaded non-diagnosis.)
The problem was historically most pressing in the airline industry. While it is imperative to get a plane back into service as soon as possible, safety considerations will make the replacement the default solution for any unit that has not been properly diagnosed. But this default comes at a steep price.
However, with more and more electronics in cars, and in particular the further development of ADAS and the self-driving car, pressure is mounting on the car sector as well. The issue has technical, organizational, procedural, and behavioral aspects and affects design, documentation, training, testing and communications. In this article we will mostly cover the technical and testing aspects and then discuss the cost in more detail.
TNI means that a Unit Under Test (UUT) passes bench tests and a cause for the supposed failure has not been identified. Proper investigation of root causes is often not a trivial thing, but it must be noted that causes can vary from definitely mundane to very complex, with many layers of causal structure to be investigated. To begin with the mundane, simple user or operator error is quite common and reaches from drivers who can’t comprehend their cruise control to trained operators of sophisticated systems who commit complex misuse—some of it even intentionally, for example, to cut short a laborious process. What these cases have in common is that the UUT is not malfunctioning, a bench test will not turn up an error, and there is little hope that replacement will fix the problem.
Another cause of failure and a possible case of TNI is a flaw in the design of the UUT. Such a flaw may only manifest itself under very special operating conditions—conditions that are not recreated in a bench test. It is important to note that in such a case the unit does what it was designed to do, but not what it was desired to do. The proper solution here is a redesign of the unit or replacement with a correctly designed unit. However, without an accurate diagnosis this may never happen.
Two further aspects of more technical intricacy are diagnostic ambiguity, in which a fault cannot be isolated to a certain unit because of limitations of testing or equipment, and false rejects, where tolerances or calibration lead to a test failure and rejection of a unit.
All these problem categories are potentially false alarms (FA). Sometimes they do require changes, but not necessarily repair or replacement. Training of users and redesign of units can be the solution. We will now get to the two core categories of TNI, because there is actual failure of the UUT and they are exceedingly difficult to diagnose.
First there are failure phenomena that are intermittent. While design flaws may produce intermittent errors that are difficult to reproduce, there is an abundance of other problem sources for electronics to intermittently fail: electromagnetic interference, contamination of interfaces and mating surfaces, moisture, vibration, quality of plating, copper unbalance, dielectric breakdown, delamination, insufficient solder, thermal cycling, electro-chemical migration, to name just a few. To exacerbate the difficulties, such mechanical and material problems and changes will not always create maloperation, even though a single component might not function within its specifications. Failures may only occur under very special operating conditions, like vibration, temperature, or movement that are difficult to reproduce in tests.
To add this, most units these days are stress tested in some ways before they go into production, but tests can neither account for long-term changes a unit undergoes, nor completely simulate actual operating environments. (Problems of simulation are covered in our article about Reductive Design.) Which brings us to the question of tests and the second category, the failure of tests.
Tests may, for example, not recreate the actual conditions. A parameter as simple as different positioning of a unit may trigger a fault that will not be seen in an otherwise well devised test. Another possibility is that the test method does not log all the failure modes or simply was not designed to identify a certain failure. Manuals for fault isolation and troubleshooting might be incomplete. And there is a people factor in testing as well. Technicians who conduct tests make mistakes, try to save time, may suffer from insufficient training. When the wrong process is applied, finding the right error gets unlikely.
A case from the early days of the connected vehicle may serve as an example. The problem at hand was that some telematics control units (TCUs) were not properly activated on the cellular network. The technicians, who did not have the ability to verify with the wireless service provider, replaced perfectly good TCUs, instead of updating the network registration requirements, which would have been the proper remedy of the problem.
The chain that begins with factors and causes ends with consequences and risks, which are just as manifold. The most imminent risk is that a failure not properly root caused may lead to a repetition of a critical situation, even if a unit was replaced. If units are repaired, instead of replaced, the repair quality is an unknown quantity. This is particularly perilous in the airline industry, but obviously risky in all areas where malfunctions may lead to personal injury, a category the car industry certainly falls under. The general principle is that if problems are perpetuated, then costs mount and damage is spread.
Another problem is the burden on the relationship between the parties involved. The communication effort alone can be considerable, and depending on the contractual situation, discord between manufacturer, tier 1 supplier, and operator can arise about who pays for the testing and the replacements. And in the end, there may still be a dissatisfied customer, a damaged reputation, lessened confidence in the product, and possibly a loss of future business.
As already mentioned, costs are mounting in many industries. Cost estimates, for example, of the Department of Defense, and the airline or mobile phone industries, run into billions of dollars per year. In terms of cost, two main categories can be distinguished:
Even in the first case, many more cost points are to be considered beyond the cost of replacement units and labor for replacement. There is the packing, shipping, and tracking of parts thought to be faulty. There is testing and sometimes retesting, and even the possibility of attempted repairs at the next level. There is equipment downtime—particularly detrimental in industries like trucking. On the level of the unit itself, there are the stresses put on supposedly faulty units by shipping and testing, which may lead to failure down the road. There is an impact on the inventory of spare parts that may drive secondary costs like obsolescence cost. Damage to reputation and relationships, already mentioned, is not easy to quantify, but a very real consequence. In general, product liability and availability contracts that do not properly acknowledge TNI issues are prone to generate problems between customer and supplier.
In the second case, with a defective unit replaced, there is at least some assurance that the primary replacement costs are not wasted. But uncertainty remains, as does the plethora of further costs, with some of these incurred to little avail, given that no root cause has been determined. And questions remain what further to do: train operating and testing personnel, test further, redesign the unit, change testing equipment and procedures, calibrate and reconsider tolerances? It may well happen that all these measures are taken without success and the whole cost cycle might repeat anytime. More problem potential is in store because with a failure not properly remedied, a pattern may emerge that could lead to lawsuits or a recall.
To give an overview of the costs and allow our readers to make their own computations, we have included the calculator below.
So, how to reach the goal to reduce TNI (and FA) to a minimum? Certain precautions are already taken in many industries. Examples are to improve documentation, train operators, and calibrate troubleshooting through improved diagnostics tools. However, if the root cause of a failure remains unknown, the efficacy of such measures remains an unknown. In the worst case, money and time is invested with no effect at all or perceived effects might be purely coincidental. Thus, the focus has to remain on optimizing the investigation of root causes. This investigation means to ask “Why?” until the actual underlying cause has been determined. Not incidentally, this mirrors the mindset of the good quality assurance engineer, who only is satisfied when an answer is found and proper corrective action is settled upon.
While it certainly is not new that data is important for such efforts, the new means to acquire real-world and real-time data will serve as a true game changer, if leveraged properly. The more that is known about how and when a unit failed, the better the chance to precisely determine what has happened. This deepened knowledge of what is going on with supposedly faulty units (and in fact with the fault-free units as well) can, beyond reducing TNI, pave the path to true preventive maintenance and replacement, and move away from expensive fixed schedules. In subsequent steps this can lead to an optimization of many aspects of operation and business.
To give an example, while manufacturers do a lot of research about how certain features are or will be used by drivers, real-world use remained largely opaque until now. These days, actual driving situations can be observed and logged.
More striking, intermittent failures, one of the most pressing problem areas of TNI, are a perfect use case for data logging. Data logging eliminates the necessity to somehow, somewhat reproduce a failure on a test bench; instead there are records of the real instances of failure, about standard and non-standard operation, with all accompanying conditions and parameters. The best case scenario here is a kind of data logging that provides event based logging where the conditions leading up to the event are specifically captured. Such snapshot data of a window before and after a trigger gives engineers the best chance to answer all the whys, to detect failure patterns, and to develop corrective action to permanently resolve the problem.
With the test bench, with all its shortcomings and uncertainties, now being replaced with logging and analysis tools, the choice of the right tool remains important, as always. Sibros’ Deep Logger acts independently from proprietary logging hardware and is a full vehicle life cycle solution. Data can be streamed in real time to the cloud and the logging is fully and dynamically configurable, based on complex conditions or events. This allows immediate vehicle health monitoring on the Sibros Web Portal.
With these features, Deep Logger enables functionalities as the following:
All this, and more, is achieved with a minimal system load of barely 5% on a 1 GHz QNX machine, even with 100% CAN load on five CAN buses.
The advent of the connected vehicle means that the cost, risk, and long-term consequences of TNI, the uncertainty around mitigation measures, the burden on engineering and business departments, not to speak of the dissatisfaction of end users or customers, the loss of trust in the product—all these things are not just fate any more. With Sibros’ Deep Connected Platform, powerful new tools are available to overcome many of these vexing situations. Data can be logged from standard and non-standard operation, remote diagnostic commands be sent and software updated in cases where software glitches are found to be the culprits. All this is achieved remotely, over-the-air. Our solutions reach every ECU and every connected sensor in any vehicle. To live with TNI, like an end user who resignedly restarts a PC that occasionally freezes, is not an option any more. Talk to us today.