S. Kandula, R. Mahajan, P. Verkaik, S. Agarwal, J. Padhye, P. Bahl, “Detailed Diagnosis in Enterprise Networks,” ACM SIGCOMM Conference, (August 2009). [PDF]
Summary
Diagnosis of generic and application-specific faults in networked systems is known to be notoriously hard due to dynamic and complex interactions between components. This paper presents NetMedic, a system that enables detailed diagnosis in small enterprise networks by using rich (multi-variable) information exposed by the system itself and application running on it. NetMedic formulates diagnosis as an inference problem and detects complex, inter-twined faults by using a history-based simple learning mechanism without any explicit knowledge of the underlying semantics of the collected information and with little application-specific knowledge. It models the network as a dependency graph of fine-grained components and assigns values to dependency edges using likelihood of one component affecting another. By using edge weights, NetMedic ranks possible causes of faults to find out the true culprit underneath. This work is limited to automatic identification, i.e., automatic repair is considered out of scope.
NetMedic has two basic properties:
- Details: It can diagnose both application-specific and generic problems with as much specificity as possible. Since most faults are application-specific, digging dip is considered an integral attribute to reach closer to the context of a fault.
- Application agnosticism: It can diagnose with minimal application-specific knowledge. This property allows NetMedic to function properly even if new unknown application is introduced.
Inference in NetMedic depends on the joint-historical information of components impacting one another. For any two states in two components, NetMedic looks back into the history (from any period) to find out similar states with close correlation to assign likelihood to current situation with a hope that history will repeat itself. The authors argue that the values are not that important; the important part is assigning low weight to enough edges so that in the end it will be easier to rank possible fault. This type of expectation seems pretty naive, but the authors insist that it is enough without strong justification or real data.
The workflow of NetMedic can be divided into three functional pieces:
- Capturing component state: NetMedic captures the state of each component as a multi-variable vector as opposed to scalar variables in the existing approaches. It stores the states in one minute bins to be processed later to create the dependency graph.
- Generating the dependency graph: NetMedic uses pre-created templates to find the relationships between different types of components and stitch them together to complete the dependency graph. Use of templates requires mandatory human intervention whenever there is a unknown component without any template.
- Diagnosis: Finally, NetMedic computes abnormalities of states, assigns values to the edges of the dependency graph (details in $5.3.2), and ranks likely causes to identify the true causes of faults. Abnormality detection process using historical information introduces false positives. Based on the abnormality of edges, NetMedic assigns weights to edges using multiple heuristics and threshold constants. Finally, at the ranking stage it uses a compound ranking function consisting of two components: local and global impact of a cause.
The authors try to justify/validate their scheme using limited and controlled evaluation which are extremely lacking and do little to convince the readers.
Critique
Except for the facts that this is the first study of faults in enterprise networks (as claimed by the authors) and the first of its kind using multi-variable states to infer faults, everything else is not very convincing. The model seems too simplistic with heuristics and constants all over the place. The authors keep arguing that all those things work in real network, even though they evaluated the system on two primitive controlled networks. The way they have used historical information (linear correlation) also seems pretty naive in real-world scenario where interactions between two components can often be influenced by many others.
Dependence on history also restricts NetMedics ability to deal with previously unseen faults (considering the fact that it is not a full-fledged learner). It is not clear how the authors handle this issue. Specially, when a new instance of NetMedic starts with no history at all, how long will it take to reach the level of performance even in the controlled scenario presented in the paper?
NetMedic seems to work on diagnosing a single fault (in controlled environment), but does not do well in more than one simultaneous faults even in the test scenarios. The reliability of the evaluation regarding multiple faults can also be questioned, since it generalizes one (may be more not shown here) synthetic test case(s) and says NetMedic will perform similarly in other multi-fault scenarios. To do justice to the authors, it should be mentioned that it is extremely difficult to evaluate such systems without long term deployment in real scenario.
I like the way you picked holes in the work. The big issue is always detecting events that you have not seen before. But if you have the right building blocks and can execute the inference algorithms in the right way, you should be able to detect a class of faults you have not seen before — simply because many of the problems in distributed system arise from the interaction of components rather than the components in and of themselves. I think the approach is interesting, but it would be very nice if someone looked into the idea of active experimentation to detect causality.