[NLNOG] Curious problem with connections from Ziggo customers to Linode nodes in some data centers

Thu Aug 24 16:43:28 CEST 2023

Thanks for looking further into this Boudewijn.

> 
> On 24 Aug 2023, at 13:02, Boudewijn Visser (nlnog) <bvisser-nlnog at xs4all.nl> wrote:
> 
> 
> I've had a look at your packet capture .
> It doesn't seem to be an MTU issue .

That’s good to know.

> Filtering for the traffic captured on the server side :
> (ip.src_host == 192.46.232.6 && ip.dst_host == 84.28.119.251 ) ||( ip.dst_host == 192.46.232.6 && ip.src_host == 84.28.119.251 )
> 
> So it seems your Ziggo public IP is 84.28.119.251 .
> And filtering for the capture from the inside client side
> (ip.src_host == 192.46.232.6 && ip.dst_host == 192.168.0.107 ) ||( ip.dst_host == 192.46.232.6 && ip.src_host == 192.168.0.107 )
> 
> I see an OK session using source port 50006 , and then a session that seems to have severe packet loss issues with source port 50007 .
> 
> See al the TCP retransmissions for the source-port 50007 session - and rarely that a packet gets through.
> 
> If you still can use this client (same public IP) try
> curl --local-port 50006 http://192.46.232.six
> curl --local-port 50007 http://192.46.232.six
> 
> that should replicate the problem exactly, first one always OK, second one always major problems.
> Note : some socket timeouts when trying multiple times shorty after each other.(bind failure socket already in use )

That appears to be correct. What does that mean though? 

> And - the specific local port that fails or works very likely also depends on the client source IP.
> 
> Sabri's suggestion on for tcp-traceroute is also valuable .

$ tcptraceroute -i en1 192.46.232.6 80
Selected device en1, address 192.168.0.107, port 50296 for outgoing packets
Tracing the path to 192.46.232.6 on TCP port 80 (http), 30 hops max
 1  192.168.178.1  3.926 ms  3.567 ms  3.542 ms
 2  * * *
 3  hvs-rc0002-cr102-et99-251.core.as33915.net (213.51.196.61)  32.259 ms  20.733 ms  26.945 ms
 4  asd-tr0021-cr101-be155-10.core.as9143.net (213.51.158.110)  15.686 ms  20.290 ms  13.737 ms
 5  nl-ams14a-ri1-ae51-0.core.as9143.net (213.51.64.186)  16.997 ms  18.788 ms  20.759 ms
 6  be3065.ccr41.ams03.atlas.cogentco.com (130.117.14.1)  37.960 ms  18.725 ms  20.228 ms
 7  be2813.ccr41.fra03.atlas.cogentco.com (130.117.0.122)  36.390 ms  18.615 ms  33.758 ms
 8  be2501.rcr21.b015749-1.fra03.atlas.cogentco.com (154.54.39.178)  35.337 ms  18.712 ms  23.170 ms
 9  204.130.243.21  21.488 ms  19.246 ms  27.173 ms
10  * * *
11  * * *
12  * * *
13  192-46-232-6.ip.linodeusercontent.com (192.46.232.6) [open]  32.722 ms  23.759 ms *

Just dumping the information here, not sure if this provides insight.

> (normally , traceroute is done using UDP (classic Unix, Cisco) or ICMP - but it can be done with TCP too. )
> With some luck , tcp-traceroute may give a hint for a node or path where the failure starts.
> 
> I've done a quick test (I happen to be behind Ziggo at the moment) but a tcp traceroute isn't too conclusive .
> Generally load balancing within a network is deterministic - based on ip/port combination for example.
> 
> IMO, the whole problem still looks like a network link that has severe issues (probably corrups large amount of packets which are then dropped at the neighbor node) , and traffic is load balanced over this link .
> So some session flows are impacted and others are not .
> 
> Since it seems limited to Ziggo clients it would likely be somewhere in the Ziggo network .
> Something at an exchange point is a remoter possibility - depending on what (other) destinations are impacted it might just not have been noticed either .
> 
> (some caveats : NAT in the Ziggo modem may change source port , esp with repeated tests )
> 
> I think that to get anything more it will need a quite senior Ziggo network engineer to investigate further.

That’s what I’m afraid of as well, and Ziggo customer support doesn’t even want (doesn’t have the expertise?) to talk about the issue I guess.

My lack of deep networking expertise prevents me from understanding and diagnosing this issue properly. I have installed a work-around by introducing a proxy so that all our traffic goes to the Frankfurt server via London. This way our customers can at least use our service again. It would be awesome if someone, “for the greater good”, with more understanding than I have, could pick this up and put it to the right people, but I find myself forced to switch priorities now. 

Thanks again for your assistance (all of you)!