R. Mahajan, D. Wetherall, T. Anderson, “Understanding BGP Misconfiguration,” ACM SIGCOMM Conference, (August 2002).
Summary
Misconfigurations in BGP tables result in excessive routing load, connectivity disruption, and policy violation. However, misconfigurations are widely prevalent in the Internet without any systematic study of their characteristics. This paper presents an empirical study of the BGP misconfigurations with an aim to find out their frequencies and root causes behind them. The authors analyzed routing table advertisements over a three-week period from 23 vantage points using the RouteViews listener to detect BGP misconfigurations and separately polled the concerned ISPs to verify the collected data with the polled ones. Their final verdict is that the perceived end-to-end connectivity the Internet is surprisingly robust! (Does it sound familiar?) Apart from that, there are several numerical results, most of them providing conservative lower bounds on different configuration errors. They also propose several fixes to abate the sources of misconfigurations.
The authors consider two globally visible BGP misconfigurations (misconfigurations being inadvertent slips and design mistakes):
- Origin misconfiguration: Accidental injection of prefix in global tables due to (i) aggregation errors in rules, (ii) hijacks, and (iii) leakage of private routes. Causes of origin misconfigurations include: (i) self deaggregation, (ii) related origin, and (iii) foreign origin.
- Export misconfiguration: AS-paths in violation of policies of one of the ASes in the path.
The basic assumption behind used for declaring a prefix to be an artifact of error is that misconfigurations appear and disappear really quickly while policy changes happen in human timescale. To be exact, the authors only consider short-lived (less than a day) changes as candidates for misconfigurations, and double-check their observations through email surveys of the concerned ISPs.
From their empirical study the authors discovered several intriguing results:
- Origin misconfigurations cause 13% incident, but result in only 4% connectivity issues (i.e., disrupting connectivity is not easy). Moreover, 95% of the incidents are corrected within less than 10 hours and 95% of connectivity issues are resolved in less than an hour!
- Export misconfigurations do not cause connectivity issues directly, but they bring in extra traffic to cause congestion leading to negligible indirect connectivity errors.
- Overall, 75% of prefix announcements are due to misconfigurations in one way or another.
- Apart from human error, which they set out to prove in the first place, software bugs play a significant role in BGP issues.
Based on email survey results, this paper identifies several sources of both types of misconfigurations (e.g., bugs, filtering error, stale configuration, hijacking, bad ACL etc.). And to check these source, the authors also propose solutions such as UI improvement, high-level language design, database consistency, deployment of secure protocols (e.g., S-BGP) etc.
Critique
One thing noticeable here is that they do not propose any automated monitoring and identifying system that could make life easier for the network operators. May be the results presented here will provide insight into solving some of the issues, but manual methods will probably not be helpful in identifying unknown phenomena. (Just saw that Neil added some links to ML-based studies possibly for automated identification)
Somewhere it is said that temporary route injection could be done to fend off DDoS attacks. Better explanation during discussion might help in that respect. The also authors remark that only a few BGP features are responsible for most of the errors. (Does it follow the Pareto principle?) But they do not particularly point out any of the features that are actually responsible. Is it even possible to figure those out? One more thing mentioned in the paper is that problem with silent failures. How do we make sure that routers do not fail silently? Do we use heartbeat protocols or just dump every action for later inspection?
Not much to say about the status of the inconsistent databases throughout the Internet and inability to incorporate extensions (e.g., S-BGP) into the network. Somethings never change.
Feedback
While this paper aims to untangle mysteries behind BGP misconfigurations, in its heart its an excellent network measurement paper. The way the authors arrange the bits and pieces to build up the whole paper is incredible. (An aside: Ratul was also involved in the RocketFuel paper, another excellent network measurement paper, that got the best student paper award in SIGCOMM 2002.)
Part of the problem is that it is really hard to observe the behavior of the BGP infrastructure from the edges in, especially from a university edge AS. Routeviews helps, but it is limited in the reach of ISPs that peer with it. It is an interesting question as to how to balance global health monitoring with competitive secrecy among the service providers.