Context

I am leasing a Kimsufi dedicated server from OVH, ks3269175.kimsufi.com aka 5.39.82.72. Since early January 2015, TCP connections to that machine (and in particular SSH connections) are sporadically hanging.

Analysis of the issue

This machine is on a network whose default router is 5.39.82.254 (vss-gw-6k.fr.eu).

As far as I was able to determine, this is a virtual router, load balanced using GLBP . The two actual routers are:

  • vss-9b-6k.fr.eu, MAC address 00:07:b4:00:01:01
  • vss-9a-6k.fr.eu, MAC address 00:07:b4:00:01:02

When attempting to establish an SSH connection from the outside to that machine, the first data packet in the connection appears to be dropped if sent through 00:07:b4:00:01:01.

This does not appear to be related to any kind of stateful firewalling system. As an experiment, I wrote a simple Scapy script that loops sending identical TCP segments, one per second, through both of the above MAC addresses, to a remote address outside OVH.

A tcpdump on the dedicated server shows the stream of outgoing packets:

23:04:15.390421 00:22:4d:83:36:80 > 00:07:b4:00:01:01, ethertype IPv4 (0x0800), length 115: 5.39.82.72.2122 > 194.98.77.4.60347: Flags [P.], seq 0:49, ack 0, win 8192, options [TS val 123 ecr 456,eol], length 49
23:04:15.406734 00:22:4d:83:36:80 > 00:07:b4:00:01:02, ethertype IPv4 (0x0800), length 115: 5.39.82.72.2222 > 194.98.77.4.60347: Flags [P.], seq 0:49, ack 0, win 8192, options [TS val 123 ecr 456,eol], length 49
23:04:16.424437 00:22:4d:83:36:80 > 00:07:b4:00:01:01, ethertype IPv4 (0x0800), length 115: 5.39.82.72.2122 > 194.98.77.4.60348: Flags [P.], seq 0:49, ack 0, win 8192, options [TS val 123 ecr 456,eol], length 49
23:04:16.441460 00:22:4d:83:36:80 > 00:07:b4:00:01:02, ethertype IPv4 (0x0800), length 115: 5.39.82.72.2222 > 194.98.77.4.60348: Flags [P.], seq 0:49, ack 0, win 8192, options [TS val 123 ecr 456,eol], length 49
23:04:17.459641 00:22:4d:83:36:80 > 00:07:b4:00:01:01, ethertype IPv4 (0x0800), length 115: 5.39.82.72.2122 > 194.98.77.4.60349: Flags [P.], seq 0:49, ack 0, win 8192, options [TS val 123 ecr 456,eol], length 49
23:04:17.476966 00:22:4d:83:36:80 > 00:07:b4:00:01:02, ethertype IPv4 (0x0800), length 115: 5.39.82.72.2222 > 194.98.77.4.60349: Flags [P.], seq 0:49, ack 0, win 8192, options [TS val 123 ecr 456,eol], length 49

Now observing traffic on the remote machine, we see only those packets that went through 00:07:b4:00:01:02:

23:05:13.322004 IP 5.39.82.72.2222 > 194.98.77.4.60403: Flags [P.], ack 0, win 8192, options [TS val 123 ecr 456,eol], length 49
23:05:14.355176 IP 5.39.82.72.2222 > 194.98.77.4.60404: Flags [P.], ack 0, win 8192, options [TS val 123 ecr 456,eol], length 49
23:05:15.390245 IP 5.39.82.72.2222 > 194.98.77.4.60405: Flags [P.], ack 0, win 8192, options [TS val 123 ecr 456,eol], length 49
23:05:16.426968 IP 5.39.82.72.2222 > 194.98.77.4.60406: Flags [P.], ack 0, win 8192, options [TS val 123 ecr 456,eol], length 49
23:05:17.456869 IP 5.39.82.72.2222 > 194.98.77.4.60407: Flags [P.], ack 0, win 8192, options [TS val 123 ecr 456,eol], length 49
23:05:18.494918 IP 5.39.82.72.2222 > 194.98.77.4.60408: Flags [P.], ack 0, win 8192, options [TS val 123 ecr 456,eol], length 49

Resolution

None so far. OVH has been notified of the problem (TICKET#2015010719008317) and all analysis elements in my possession have been conveyed to them, to no avail so far: the machine has been essentially unusable for the past two months and counting.

Update 2015-03-20 OVH say they identified a problem and are working on a fix. No more SSH failure observed since 2015-03-18 in the afternoon, so apparently the fix did work. Still waiting for a post-mortem explanation as to what went wrong, and why it took them so long to ackonwledge, investigate, and resolve the problem.

Update 2015-04-01 Service remains stable, in that failures are not observed anymore. OVH indicates they are still discussing the underlying issue with Cisco, and the fix is not completed yet.

Update 2015-06-17 Service remains stable. OVH indicates that they have identified the origin of the issue, a fix is available, and they have scheduled its deployment.

Update 2015-09-17 At long last, OVH confirmed that the problem is indeed resolved on their side, and agreed to extend my subscription by 3 months at no cost in compensation, which is decent of them.

Acknowledgements

Thanks to Pierre Beyssac for hinting at GLBP, and to Fabian at OVH support for following up internally on this issue.