Bizarre packet filtering on OVH Kimsufi server
Context
I am leasing a Kimsufi dedicated server from OVH,
ks3269175.kimsufi.com
aka 5.39.82.72
. Since early
January 2015, TCP connections to that machine (and in
particular SSH connections) are sporadically hanging.
Analysis of the issue
This machine is on a network whose default router
is 5.39.82.254
(vss-gw-6k.fr.eu
).
As far as I was able to determine, this is a virtual router, load balanced using GLBP . The two actual routers are:
vss-9b-6k.fr.eu
, MAC address00:07:b4:00:01:01
vss-9a-6k.fr.eu
, MAC address00:07:b4:00:01:02
When attempting to establish an SSH connection from the outside
to that machine, the first data packet in the connection
appears to be dropped if sent through 00:07:b4:00:01:01
.
This does not appear to be related to any kind of stateful firewalling system. As an experiment, I wrote a simple Scapy script that loops sending identical TCP segments, one per second, through both of the above MAC addresses, to a remote address outside OVH.
A tcpdump on the dedicated server shows the stream of outgoing packets:
23:04:15.390421 00:22:4d:83:36:80 > 00:07:b4:00:01:01, ethertype IPv4 (0x0800), length 115: 5.39.82.72.2122 > 194.98.77.4.60347: Flags [P.], seq 0:49, ack 0, win 8192, options [TS val 123 ecr 456,eol], length 49
23:04:15.406734 00:22:4d:83:36:80 > 00:07:b4:00:01:02, ethertype IPv4 (0x0800), length 115: 5.39.82.72.2222 > 194.98.77.4.60347: Flags [P.], seq 0:49, ack 0, win 8192, options [TS val 123 ecr 456,eol], length 49
23:04:16.424437 00:22:4d:83:36:80 > 00:07:b4:00:01:01, ethertype IPv4 (0x0800), length 115: 5.39.82.72.2122 > 194.98.77.4.60348: Flags [P.], seq 0:49, ack 0, win 8192, options [TS val 123 ecr 456,eol], length 49
23:04:16.441460 00:22:4d:83:36:80 > 00:07:b4:00:01:02, ethertype IPv4 (0x0800), length 115: 5.39.82.72.2222 > 194.98.77.4.60348: Flags [P.], seq 0:49, ack 0, win 8192, options [TS val 123 ecr 456,eol], length 49
23:04:17.459641 00:22:4d:83:36:80 > 00:07:b4:00:01:01, ethertype IPv4 (0x0800), length 115: 5.39.82.72.2122 > 194.98.77.4.60349: Flags [P.], seq 0:49, ack 0, win 8192, options [TS val 123 ecr 456,eol], length 49
23:04:17.476966 00:22:4d:83:36:80 > 00:07:b4:00:01:02, ethertype IPv4 (0x0800), length 115: 5.39.82.72.2222 > 194.98.77.4.60349: Flags [P.], seq 0:49, ack 0, win 8192, options [TS val 123 ecr 456,eol], length 49
Now observing traffic on the remote machine, we see only those packets that went through 00:07:b4:00:01:02:
23:05:13.322004 IP 5.39.82.72.2222 > 194.98.77.4.60403: Flags [P.], ack 0, win 8192, options [TS val 123 ecr 456,eol], length 49
23:05:14.355176 IP 5.39.82.72.2222 > 194.98.77.4.60404: Flags [P.], ack 0, win 8192, options [TS val 123 ecr 456,eol], length 49
23:05:15.390245 IP 5.39.82.72.2222 > 194.98.77.4.60405: Flags [P.], ack 0, win 8192, options [TS val 123 ecr 456,eol], length 49
23:05:16.426968 IP 5.39.82.72.2222 > 194.98.77.4.60406: Flags [P.], ack 0, win 8192, options [TS val 123 ecr 456,eol], length 49
23:05:17.456869 IP 5.39.82.72.2222 > 194.98.77.4.60407: Flags [P.], ack 0, win 8192, options [TS val 123 ecr 456,eol], length 49
23:05:18.494918 IP 5.39.82.72.2222 > 194.98.77.4.60408: Flags [P.], ack 0, win 8192, options [TS val 123 ecr 456,eol], length 49
Resolution
None so far. OVH has been notified of the problem (TICKET#2015010719008317) and all analysis elements in my possession have been conveyed to them, to no avail so far: the machine has been essentially unusable for the past two months and counting.
Update 2015-03-20 OVH say they identified a problem and are working on a fix. No more SSH failure observed since 2015-03-18 in the afternoon, so apparently the fix did work. Still waiting for a post-mortem explanation as to what went wrong, and why it took them so long to ackonwledge, investigate, and resolve the problem.
Update 2015-04-01 Service remains stable, in that failures are not observed anymore. OVH indicates they are still discussing the underlying issue with Cisco, and the fix is not completed yet.
Update 2015-06-17 Service remains stable. OVH indicates that they have identified the origin of the issue, a fix is available, and they have scheduled its deployment.
Update 2015-09-17 At long last, OVH confirmed that the problem is indeed resolved on their side, and agreed to extend my subscription by 3 months at no cost in compensation, which is decent of them.
Acknowledgements
Thanks to Pierre Beyssac for hinting at GLBP, and to Fabian at OVH support for following up internally on this issue.