No MBD at TG12?

At The Gathering 2012[1], we have a recurring problem every year: Lots of people want to play their games in LAN mode (which effectively works by broadcasting a request packet and waiting for any unicast replies), but we can't put 5000 people into a broadcast domain, so the broadcast will only reach the people on ther local switch, and they won't see the entire hall's games.

A few years back, I blogged about our solution to this, aptly called MBD (“multi-broadcast daemon”, but also a pun on “minor brain damage”, due to the horribleness of the hack); in the core switches, abuse the DHCP helper function by adding all the game UDP ports to it, then let the DHCP server listen on those ports and send the packets out again by directed broadcast (effectively sending out 160 packets for each that comes in, modulo rate limiting and a host of extra tweaks).

However, this year we switched from Cisco's 6509 workhorse to the new fancy Nexus 7000 series of datacenter switches. Incidentially, we also did not have properly working MBD in the hall anymore. What happened?

First of all, the Nexus is a much younger platform than the 6509, running NX-OS (an IOS lookalike, but on top of Linux) instead of IOS. This means that a lot of functionality, in particular the one you don't need in datacenters, will be missing. (Some missing functionality is on Cisco's public roadmap; some is not.)

First of all, you can't add new UDP ports as DHCP anymore. So this means you'll need some other way of getting to the packets. After some deliberation, we thought: Hey, maybe we can use ERSPAN[2]? ERSPAN is a relatively new technology from Cisco where you can essentially do “port mirroring” (ie. see all traffic on a given port, on another port) across entirely different switches in your network. It's intended purely as a debugging feature, of course, but that doesn't mean we can't abuse it. It even has hardware ACLs, so we don't have to flood one machine with all sorts of non-game packets. (We didn't get it to filter ARPs, though, but that's pretty much traffic we can ignore.)

Unfortunately, the other switches couldn't properly decapsulate ERSPAN even though they were supposed to, and we discovered this during the party. This means we had to decapsulate ERSPAN ourselves. Fortunately, the packets are just Ethernet frames within some weird framing within regular GRE, so we set up a GRE tunnel on the Linux box to each of the core servers, and wrote a C program to pick up the frames, filter out the ones that were UDP, and send them out again (with spoofed source and all) on the loopback interface. This made MBD receive the packets properly.

However, after a day or two we noticed that behavior was going erratic; people didn't find as many games as they should have, and the number would vary pretty wildly. Eventually it turned out that the Nexus switches' “auto-CoPP” behavior probably was killing us; each directed-broadcast packet takes a fair amount of CPU time, and as such, the switch started protecting itself from these packets by dropping them on the floor (to avoid saturating its own CPU). We spent some time trying to pace them, but then we'd simply get problems getting all the packets out before the next search request arrived, so no go.

OK, so we needed something else. How about just dropping directed-broadcast and asking every single host on the network (a /17; thankfully this is not IPv6!) by unicast? Some modification of the code was in order, and after a while, we sent out packets... but way too slow. We tried converting the code to use threads (and this is Perl!) in various interesting ways, but no go; Net::RawIP is not thread safe, and would segfault on us.

Eventually I split the packet-pushing business out into a separate C program that would take a single packet and then a long list of unicast destinations to send it to (we shouldn't send the packet back to the switch it originated from, and we shouldn't send to all sorts of administrative linknets, so we can't just send to all 32768 addresses blindly). This fixed our performance problems, and we sent out 160k packets per second or so... except it didn't work. The games seem to listen for broadcast requests only, ignoring unicast.

OK, so can't we just inject the broadcast packets ourselves, skipping this layer 3 business? Unfortunately, no; even if we pull 160 VLANs through a few switches to the MBD machine, we are not at a liberty to make the interfaces into switched interfaces; that would make us very vulnerable to an issue where the table switches eat BPDUs by default, interacting badly with LACP and causing loops, possibly taking down the entire core switch. (This happened last year, although with 6500s—auto-CoPP on the Nexus might save us, but we were not willing to take that risk.)

OK, so maybe we can use bidirectional ERSPAN? Unfortunately, no; a port can not be both source and destination for ERSPAN at the same time (probably to avoid loops, I'd guess). We could try to push them out through some different interface, but then we'd need to switch them back in again, and then we'd be back at the problem with possible loops due to the BPDU filter.

Eventually we raised a case with Cisco TAC to try to figure out why lots of packets were dropped even with the CoPP turned off, but as the night was progressing, it was pretty clear that our setup (directed-broadcast with spoofed source IP addresses!) was so far from anything normal that it really wasn't worth it for what's ultimately a minor service.

So, there you go. Three workdays, six different solutions considered and tried, and ultimately we ended up so far into the party that we decided to drop what's ultimately a minor service. (This was after spending a day or two staring at packets logs trying to figure out why some versions of Windows Vista couldn't get IP address from DHCP[3], by the way, so the desire to delve into low-level packet stuff was sort of thin.) And still peopl complain that we “spent too much time on getting 200 Gbit/sec which we didn't use anyway”. Sheesh. :-)


[1] <http://www.gathering.org/>
[2] <http://www.cisco.com/en/US/docs/ios-xml/ios/lanswitch/configuration/xe-3s/lnsw-conf-erspan.html>
[3] <http://markmail.org/thread/esbkew2pstvekbo6>