From MTBF to HALT Testing: A Dependable Estimate of Product Reliability

At the end of this article is a link to an Excel spreadsheet that allows you to determine the Highly Accelerated Life Testing (HALT) requirements of the tested product based on elevated temperature.  This spreadsheet also provides product reliability prediction based on the Service Life, Time Point of Interest, HALT testing failures, and desired confidence level.

Management needs to have a dependable estimate for the reliability of the product being manufactured.  It also needs to plan for expected product field failures so that it can provide timely responses to customer needs for return and repair.  Furthermore, not planning for field failures creates cost burdens for management that can cause real harm to the bottom line.  Most engineers are familiar with the Mean Time Between Failure (MTBF) method of determining product reliability.  The primary purpose of MTBF is to identify preventive maintenance schedules to avoid catastrophic failures due to predictable piece parts wear-out.

In this article, I intend to explain why this method is insufficient for management planning.  I will then talk about HALT testing as an alternative method of experimentally determining the reliability of a product, illustrated by HALT testing examples.

Introduction to MTBF

Mean Time Between Failure (MTBF) is a measure of the reliability of a system or component. It’s a crucial element of maintenance management, representing the average time a system or component will operate before it fails.

MTBF is calculated by dividing the total time of operation by the number of failures that occur during that time. The result is an average value that can be used to estimate the expected service life of the system or component.

It’s important to note that MTBF is an average time, and does not guarantee that a particular system or component will last for the full MTBF period without failing. The actual time between failures can vary widely, and it is common for failures to occur well before or after the MTBF. Additionally, MTBF does not take into account the severity of the failures or the impact they may have on operations or safety.

MTBF Calculation

The MTBF value is a measure of reliability, but it is not a guarantee of reliability. It measures how frequently failures are expected to occur, but doesn’t necessarily take into account every external factor.

First, let’s define the scope: 

  1. We must define the system or component in question, along with operating conditions, including environmental factors and usage patterns. 
  2. Then, we collect data on the operating time of the system or component, including each operation cycle’s start and end times. 
  3. Then, we record the number of failures that occurred during the operating time.
  4. Finally, we can calculate the MTBF: Divide the total operating time by the number of failures. The result is usually expressed in hours, but can be any unit of time.

For example, let’s say you want to calculate the MTBF of a motor that operates for 8 hours per day, 5 days a week, for a total of 1 year. During this time, the motor fails 4 times. To calculate the MTBF:

Total operating time = 8 hours/day x 5 days/week x 52 weeks = 2,080 hours

Number of failures = 4

MTBF = Total operating time / Number of failures = 2,080 hours / 4 = 520 hours

The MTBF of the motor is 520 hours. This means that on average, the motor can be expected to operate for 520 hours before it fails. In reality, it may fail much sooner, or much later than 520 hours, and we won’t understand why the motor is failing, but this average time is a useful metric — a starting point that enables us to get a basic sense of how a system or component is performing in terms of reliability and helps us to analyze trends, which help us to understand the overall efficacy of our maintenance strategy.

So then, what are the drawbacks of using MTBF as a reliability/planning metric?

  1. MTBF requires actual field use data to be accurately determined.
  2. MTBF only predicts when a unit might fail on average.
    1. Anyone familiar with the Bell curve knows that leaves a lot of units unaccounted for.
  3. MTBF does not provide a method of predicting field failure return and repair volume over time.

To provide the metrics needed by management to predict the field return volume, Highly Accelerated Life Testing (HALT) is preferred.

An engineer inspects a machine equipment

HALT Testing

What is HALT?

Highly Accelerated Life Testing, HALT is the process of applying increased stressors to an electronic device to force failures and uncover design and construction weaknesses. The stressors applied are typically well beyond the expected field environments to quickly discover failures. This enables engineers to optimize designs, repair or replace failed components, and lower product development costs.

HALT Procedures

Setting clear expectations and directives for conducting HALT testing is a multistep process that starts with bringing the design engineers together to:

  • Develop a test plan based on reliability physics, including understanding potential failure modes and mechanisms and clearly defining the objectives.
  • Determine the expected environments, including applicable stresses such as temperature, vibration, and shock.
  • Decide how many devices under test (DUTs) are available for HALT testing. Generally, one to five samples are used.
  • Select the functional tests to be run during testing, such as what the device should be doing, which circuits should be active, and what codes and sensors should be gathering data.
  • Identify which parameters need to be monitored based on the desired functional tests and applications.
  • Define what constitutes a failure. This could include failing a functional test (monitored either continuously or periodically during the test), observing physical damage, failing to remain operational, etc.
  • Consider using reliability simulation software to simulate the vibration and thermal loads so a model can be created that may reach highly accelerated life testing limits. 

In conjunction with developing the foundational outline, two key areas must be addressed.

1. Applicable Stresses

Select the appropriate stresses and stress levels for HALT testing:

  • Vibration
  • High temperature
  • Low temperature
  • Voltage/frequency margining
  • Power cycling
  • Combined stresses (i.e., temperature and vibration)

Choosing the appropriate stresses is dependent upon the application and the environment in which the device operates. Suspect parts or areas of concern within the device can also help drive what stress levels to apply in tests.

2. Step Stress Approach

For each intended stress, clearly delineate:

  • The starting stress point.
  • The amount by which to increment the intended stress in each step.
  • The duration of each step.
  • The device or equipment limit for that stress.
Step stresses applied in HALT.

Typically, the operating and destructive limits of the device are not known before testing. HALT testing can be used to determine this through the step-stress approach. If failure occurs during monitoring or functional testing, the stress is subsequently reduced until the DUT recovers from the failure. This failure is known as the operating limit. When the stress is increased above the operating limit and the DUT can no longer recover without a repair, the destruct limit has been reached.

Setting Up a HALT

For accurate results, particular attention must be paid to the highly accelerated life testing configuration:

  • Design a vibration fixture to ensure vibrational energy is being transmitted into the product.
  • Design air ducting to ensure thermal energy is being transmitted into the product. This can include modification of the DUT to allow unimpeded airflow inside the device.
  • Tune the chamber for the sample being tested.
  • Determine locations for thermocouples and accelerometers to monitor temperature and acceleration, respectively.
  • Set up all functional test equipment and cabling.

Conducting a HALT and HALT Testing Examples

HALT testing is comprehensive and encompasses several testing phases, each with specific parameters to follow. 

Thermal Step Stress

Thermal step-stress testing is an important phase of highly accelerated life testing that applies incremental temperature stress levels throughout the product life cycle to identify product failure modes.

  • Connect the power and functional test equipment to the DUTs. Additional product-specific stresses may or may not be applied depending on the application.
  • Begin with cold step stress, followed by hot step stress.
  • Initially use 10 °C increments, decreasing to 5 °C increments as limits are approached.
  • Set the dwell time minimum at 10 minutes plus the time needed to run a functional test. Timing should commence once the temperature being monitored on the DUT has reached its set point.
  • Continue the test until operating and destruct limits are determined, or maximum stress as dictated by the test plan.

Thermal Shock Cycling

Thermal shock cycling is conducted between the DUT’s operating limits determined above. This phase of HALT testing exposes the DUT to fast thermal transitions, sometimes 60 ˚C per minute or as fast as the testing equipment/chamber allows. 

  • Connect the power and functional test equipment to the DUTs. Additional product-specific stresses may or may not be applied depending on the application.
  • Keep the temperature range between 10 °C below the upper operating limit and 10 ˚C above the lower operating limit determined during step-stress testing.
  • If the sample cannot withstand maximum thermal transitions, decrease the transition rate by 10 °C per minute until the allowable rate is found.
  • Continue hot and cold thermal transitions with 10-minute dwells at each extreme for five total cycles. 

Vibration Step Stress

An important contributor to HALT testing, vibration step stress testing applies incremental vibrational stress levels to identify product failure modes.

  • Connect the power and functional test equipment to the DUTs. Additional product-specific stresses may or may not be applied (power cycling, line voltage/frequency margining, etc.), depending on the application.
  • Determine the G-level root mean square (Grms) increments, typically ranging from 3-5 Grms on the product.
  • Set the dwell time minimum at 10 minutes plus the time needed to run a functional test.
  • At 30 Grms and above, perform “tickle” vibration between setpoint vibrations. The tickle vibration is performed at 5 Grms while performing the functional check.
  • Continue the test until the operating and destruct limits are determined, or the maximum stress as dictated by the test plan.

Combine Vibration and Thermal Shock Testing

Merge the HALT testing results and methodologies to further test products.

  • Connect the power and functional test equipment to the DUTs. Additional product-specific stresses may or may not be applied (power cycling, line voltage/frequency margining, etc.), depending on the application.
  • Use the vibration destruct limit and divide by 5 to determine the step increase at each of the five thermal cycles for this test.
  • Use the thermal shock cycle limits and ramp rate for the five thermal cycles.
  • At 30 Grms and above, perform a “tickle” vibration between setpoint vibrations (to manifest failure modes not evident under high vibration amplitudes or static conditions). The tickle vibration is performed at 5 Grms while performing the functional check.

Post-HALT Monitoring and Failure Analysis

Once HALT testing is completed, the design engineers’ focus becomes determining the root causes of all failures and corrective action. This can include identifying the failure site and failure mechanism for each failure mode. Afterward, a verification HALT needs to be implemented to evaluate whether the testing adjustments fixed the problems. 

If you are looking for reliable highly accelerated life testing services, contact FC Engineering Services. Our team of skilled engineers will help you perform a reliable test as well as other services to ensure that you take a quality product to market. 

References:

  1. MIL-HDBK-217F, Military Handbook Reliability Prediction of Electronic Equipment.           2 December 1991
  2. https://www.itl.nist.gov/div898/handbook/apr/apr.htm
  3. https://reliabilityanalytics.com/RADC_Reliability_Engineers_Toolkit.pdf
  4. https://www.acqnotes.com/Attachments/DoD Reliability Availability and Maintainability %28RAM%29 Guide.pdf
  5. https://www.microsemi.com/document-portal/doc_view/124041-calculating-reliability-using-fit-mttf-arrhenius-htol-model
  6. https://www.ansys.com/blog/planning-a-halt

Get the Spreadsheet

Were the HALT testing examples above useful? Get this Excel spreadsheet to help you to establish the Highly Accelerated Life Testing (HALT) prerequisites for product testing.

Malesuada Nunc Vel Risus Commodo

Our Services