United We Stand: Towards End-to-End Log-based Fault Diagnosis via Interactive Multi-Task Learning

Summary

Issues: Diagnostic Bias and Suboptimal Diagnosis in Log-based Software System Fault Diagnosis
Log-based fault diagnosis are divided into two stages: 1. anomaly detection is performed first, and 2. if a fault is detected, further root cause localization is conducted.

Diagnostic Bias: If the detector yields inaccurate results, these inaccuracies will also be passed on to the localizer.
Suboptimal Diagnosis: A system fault may not be simultaneously detected and localized.

Key technologies: Bridging the gap between anomaly detection and root cause localization through their Bidirectional Interaction and Knowledge Transfer
In our view, the root cause of these issues is that anomaly detection and root cause localization rely on different forms of data for training and have distinct diagnostic objectives for faults. The anomaly detection uses anomalous log sequences as training data, with the diagnostic objective of determining whether a fault has occurred in the system. The root cause localization utilizes root cause logs as training data, with the objective of localizing the fault that caused the failure. Simply piecing together expert models to construct a fault diagnosis system cannot effectively bridge the gap between the two in terms of data forms and diagnostic objectives, nor can it handle fault diagnosis within a unified framework.

Our propose method: Chimera
Addressing these issues requires bridging the gap between the two sub-tasks in terms of data forms and diagnostic objectives, and constructing an end-to-end fault diagnosis system. Our approach is to implement interactive multi-task learning between the two sub-tasks. Our key insight is that there is a strong mutual implication between the anomaly detection task and the root cause localization task, and their bidirectional interaction and knowledge transfer can bridge the gaps in data forms and diagnostic objectives.

Key Contributions

We proposed an end-to-end log-based fault diagnosis system, Chimera, which achieves anomaly detection and root cause localization interactively within a unified framework through carefully designed interaction strategies.
We designed a sequence-driven localizer based on the principle of multiple-instance learning, which utilizes only anomalous log sequences for training without the need for root cause logs.
We proposed a multi-task interaction strategy based on disentanglement learning and mutual information theory, which facilitates interaction at the feature and diagnostic result levels, effectively promoting knowledge transfer between tasks.
Evaluations on two public datasets and one industrial dataset demonstrate the significant effectiveness of our approach.

Insight

Two different fault diagnosis deployment paradigms. The existing log-based fault diagnosis method is deployed in a task-independent paradigm. Initially, the labeled log sequences and log entries are inputted, and the detector and localizer are trained independently. Finally, the two are integrated into a fault diagnosis network, leading to issues such as diagnostic bias accumulates. The proposed log-based fault diagnosis method is deployed in an task-interactive paradigm. The labeled log sequence is used to train the detector and localizer interactively in an end-to-end manner, resulting in excellent diagnostic performance.

Empirical Study

EQ1: How effective are existing methods in addressing diagnostic bias?

Existing log-based fault diagnosis methods conduct anomaly detection and root cause localization in a task-independent manner, which fails to adequately address the inevitable diagnostic bias in fault diagnosis, thereby severely impacting diagnostic performance.

EQ2: How effective are existing methods in addressing suboptimal diagnosis?

Existing fault diagnosis methods independently perform anomaly detection and root cause localization, neglecting their collaborative relationship, which prevents simultaneous detection and localization of faults, resulting in a large number of undesirable suboptimal diagnoses.

The distribution of diagnostic results from existing methods in performing fault diagnosis on the Thunderbird dataset. The green portion represents Detected but not Localized Fault (DF), the red portion represents Localized but not Detected Fault (LF), and the yellow portion represents Detected and Localized Fault (DLF).

Method

The proposed end-to-end log-based fault diagnosis pipeline for Chimera. Chimera pipeline comprises three key stages: Log Preprocessing, Interactive Log Representation Learning (ILRL), and Joint Fault Diagnosis. Firstly, the raw system logs are labeled and parsed into log event sequences, and corresponding log embeddings are extracted. Secondly, the ILRL module to learns shared and private representations for anomaly detection and root cause localization interactively, and combines them into a log representation for fault diagnosis. Finally, the learned log representation is fed into the localizer and detector for joint fault diagnosis. Chimera bridges the gap between anomaly detection and root cause localization through their bidirectional interaction and knowledge transfer, achieving more effective end-to-end fault diagnosis.

Architecture

The workflow of Sequence-driven Localizer. Given embeddings of log sequence, the network outputs a score for each log and locates the root cause log. The localizer is designed based on the principle of multi-instance learning, and locates the root cause log by comparing the scores of normal log sequences and anomalous log sequences.

Experiment Results

RQ1: How effective is Chimera in the anomaly detection and root cause localization?

Chimera achieved best performance even when compared to expert models addressing anomaly detection and root cause localization.

RQ2: How effective is Chimera in the log-based fault diagnosis?

Compared to the advanced task-independent method built on a pipeline paradigm, Chimera achieved the best performance across three datasets.

RQ3: How do different modules contribute to Chimera?

The ablation of all components resulted in different levels of performance degradation.

RQ4: How effective is Chimera in addressing diagnostic bias?

Chimera effectively reduced the adverse effects of diagnostic bias, achieving the best performance.

RQ5: How effective is Chimera in addressing suboptimal diagnosis?

Chimera produced the fewest suboptimal diagnoses, with the majority of faults being detected and localized simultaneously.

United We Stand : Towards End-to-End Log-based Fault Diagnosis via Interactive Multi-Task Learning