Issues: Diagnostic Bias and Suboptimal Diagnosis in Log-based Software System Fault Diagnosis
Log-based fault diagnosis are divided into two stages: 1. anomaly detection is performed first, and 2. if a fault is detected, further root cause localization is conducted.
Key technologies: Bridging the gap between anomaly detection and root cause localization through their Bidirectional Interaction and Knowledge Transfer
In our view, the root cause of these issues is that anomaly detection and root cause localization rely on different forms of data for training and have distinct diagnostic objectives for faults. The anomaly detection uses anomalous log sequences as training data, with the diagnostic objective of determining whether a fault has occurred in the system. The root cause localization utilizes root cause logs as training data, with the objective of localizing the fault that caused the failure. Simply piecing together expert models to construct a fault diagnosis system cannot effectively bridge the gap between the two in terms of data forms and diagnostic objectives, nor can it handle fault diagnosis within a unified framework.
Our propose method: Chimera
Addressing these issues requires bridging the gap between the two sub-tasks in terms of data forms and diagnostic objectives, and constructing an end-to-end fault diagnosis system. Our approach is to implement interactive multi-task learning between the two sub-tasks. Our key insight is that there is a strong mutual implication between the anomaly detection task and the root cause localization task, and their bidirectional interaction and knowledge transfer can bridge the gaps in data forms and diagnostic objectives.
Existing log-based fault diagnosis methods conduct anomaly detection and root cause localization in a task-independent manner, which fails to adequately address the inevitable diagnostic bias in fault diagnosis, thereby severely impacting diagnostic performance.
Existing fault diagnosis methods independently perform anomaly detection and root cause localization, neglecting their collaborative relationship, which prevents simultaneous detection and localization of faults, resulting in a large number of undesirable suboptimal diagnoses.
The distribution of diagnostic results from existing methods in performing fault diagnosis on the Thunderbird dataset. The green portion represents Detected but not Localized Fault (DF), the red portion represents Localized but not Detected Fault (LF), and the yellow portion represents Detected and Localized Fault (DLF).
Chimera achieved best performance even when compared to expert models addressing anomaly detection and root cause localization.
Compared to the advanced task-independent method built on a pipeline paradigm, Chimera achieved the best performance across three datasets.
The ablation of all components resulted in different levels of performance degradation.
Chimera effectively reduced the adverse effects of diagnostic bias, achieving the best performance.
Chimera produced the fewest suboptimal diagnoses, with the majority of faults being detected and localized simultaneously.