Hey, Habr! This is the transcription for a presentation Qrator Labs' cloud architect, Alexander Zubkov, made at RIPE82 online conference this year.
Hello, everybody!
My name is Alexander Zubkov and today I’d like to talk about routing loops.
As you may know, a routing loop is a situation when some packet is routed infinitely or almost infinitely in a loop. And we can have such a situation during dynamic routing protocol convergence. And for BGP, at theInternet scale, it sometimes can take minutes.
Another reason for this happening could be stuck routes in the dynamic routing protocols. Also, this could be caused by configuration errors and such loops are persistent until the configuration error is fixed.
And the easiest way, and I think the most popular one, to get into such a situation is with unused IP space or badly routed NAT pools.
For example if some provider assigns and routes some space to its client and that client uses only a part of that space, so the remaining addresses are routed back to the provider by default route. So, we get a loop here.
And this situation is easy to fix on the client side, if you null-route all the addresses you receive from a provider. Any unused addresses are dropped and not looped.
Also, if a provider can implement some anti-spoof policy on the client’s interface - in that case it can also break the routing loop from a misconfigured client. So BCP38 is helpful here.
So, what is the problem with loops? It may be obvious, but for example my former internet provider called it a “cosmetic issue”. So the main and biggest issue is a possibility of high link utilization.
For example, if we have a packet with a high TTL that reaches a loop with two hops we have a 100x amplification - rather easy to understand. And such a link could be a target of a DDoS attack. Or if it is a link to your provider you could get increased bills because of that.
There are some less frequent usages. For example some very bad routing loops could be used as a DDoS attacking means to other networks. Other guys described a method of using routing loops for inferring the ability of spoofing from your network.
The Risks and Dangers of Amplified Routing Loops (Andree Toonk)
Flooding Attacks by Exploiting Persistent Forwarding Loops (Jianhong Xia, Lixin Gao, Teng Fei)
Weaponizing Middleboxes for TCP Reflected Amplification (Kevin Bock, Abdulrahman Alaraj, Yair Fax, Yair Fax, Eric Wustrow, Dave Levin)
“We were unable to terminate the barrage of packets sent to us … the traffic stopped after approximately six days … We believe the reason they finally stopped was because the routing loop eventually dropped a packet.”
Using Loops Observed in Traceroute to Infer the Ability to Spoof (Qasim Lone, Matthew Luckie, Maciej Korczyński, Michel van Eeten)
Hunting down the stuck BGP routes (Ben Cox)
I picked some articles describing such problems. For example, in the part where I put a quotation, a group of researchers found some infinite loops on the Internet, and some of them persisted for six days.
I also made a simple setup consisting of 3 servers and 2 switches to test for potential issues caused by a loop. The loop is between switches, and here you can take a look at the results.
Two lines highlighted with yellow color have the same flood rate: the first represents the flood directed to the server, with the second representing the flood directed to the loop. As you can see, in the first situation the server wasn’t affected at all, while in the second one, with a flood of slightly more than 400 Mbps I saturated a 40G channel that led to 5x slowdown.
Another interesting result of the test is highlighted with green color - that was when I tried to saturate the loop with small packets, and I got no slowdown whatsoever. I think this is because of how a queue is organized in the switches, maybe some fair scheduling among ports or something like that.
But, anyway, we already have issues with large packets, and if you have a server or router, they could have other bottlenecks than switches, so they may be affected too. And they could be affected even more by large packets in a loop.
At Qrator Labs we have a project called Qrator.Radar - you can register there for free, and it’s used to collect different types of BGP issues and other routing problems. After registering you can see what is going on with your autonomous system.
Qrator.Radar provided me with historical data on the loops observed in our system over these last years. As you can see, the overall trend is downward, I hope that this is not the migration to IPv6. Right now we have around 22 million loops, although I made my own research for the aims of this presentation.
I scanned the Internet and got 28 million unique replies, which represents approximately 1% of all active IPv4 address space. But there are more loops, because not all routers always reply, and there are many lost replies too. 1% seems like not much, but these IP addresses are located in 25 000 autonomous systems, which in turn is 35% of all active ASs. Every third AS contains a loop destination. Loops can be also found in autonomous systems of companies that I think should care the most about connectivity and reachability, like CDN providers, DDoS mitigation systems. And those are no small companies, some big names are in the list.
The fun fact is that I sent 4 probes for each IP, but in reply for each unique IP I’ve got more than 4 replies on average. There are some amplificators there, and in some cases I’ve got more than 100 000 replies for a single IP-address.
I also tried to count unique loops, which turned out to be a difficult problem, but I estimated that there are at least hundreds of thousands of them. And there are more than half a million of route IPs involved in those loops, those are located in almost 20 000 ASes.
I found loops in length from 1 to 34 hops. 2 hops is the most popular type, with at least half of all loops being 2 hops. There are loops spanning up to 5-7 autonomous systems or up to 8 countries. The longest loop I found in terms of duration is up to 18 seconds.
Here are statistics of loop destinations by country. The United States is the absolute leader, with more than 6 million loop destinations.
And here you’ve got the statistics for only Europe. Germany and Russia are leaders.
Here you can see the statistics by autonomous system. The two leaders here are Indian National Internet Backbone, and the second is Lumen. However, Lumen has several autonomous systems and if you sum them up they will have more than 600 000 loop destinations.
I also want to show you some of the most interesting loops I found while looking at the collected data. For example, you can see there is a loop of two hops, but one of the routers is too diligent and replies every time. Or, for example, you can see one router IP answering, but if you look at RTT you would recognize a clear two hops pattern there too. There are also loops where you have a loop of seventy hops, but only two hops reply. All these are of course repeating patterns.
I have also found an example of a 34 hops loop inside one network. Let’s move on.
There are also some strange loops, like this one which I called a “flat” loop. It looks like a packet is going back and forth in a row of routers, although that is not the weirdest loop.
I found such a strange loop - I am completely lost on what is happening there. I think someone probably enslaved a network administrator, so he hid such a message to ask for help.
To relieve the pain shown on the previous slides I also wanted to show you some fun traceroutes with artificially created names. They are all dead now with one exception. There was one traceroute where you could see a Christmas tree with the “Let it Snow” lyrics. Instead of this funny traceroute there is a loop now.
Thanks for reading and I hope you’re working on eliminating routing loops from your network.
Video of the presentation is available via this link.