[NLNOG] Curious problem with connections from Ziggo customers to Linode nodes in some data centers

Thu Aug 24 08:01:12 CEST 2023

Thanks Boudewijn!

There was a lively conversation about this on #nlnog yesterday, so I forgot to respond to you. I tried changing the MTU to 1420, that didn’t make a difference. I did a packet capture as well. This was between server 192.46.232.6 and client 192.168.0.107. Command used on the server was:

tcpdump -Aennvvi eth0 -w server.pcap port not 22

And on the client (because I was connected through VNC):

 sudo tcpdump -Aennvvi en1 -w client.pcap port not 22 and port not 5900

During this capture I did two requests (using curl) to http://192.46.232.6 <http://192.46.232.6/>, the first one succeeded and the second one I aborted after half a minute. The result is here: http://192.46.232.6/client+server.pcap

I lack the experience to properly analyse this. Does this contain any clues to you?

-- 
Stefan van den Oord
CTO @ Medicine Men B.V.

Not in the office on Wednesdays

Regulierenring 22
3981 LB Bunnik
The Netherlands

> On 23 Aug 2023, at 15:30, Boudewijn Visser (nlnog) <bvisser-nlnog at xs4all.nl> wrote:
> 
> Hi Stefan,
>  
> Some guesses for a possible cause :
>  
> MTU issue somewhere in this path, possibly limited to one member in a link bundel somewhere in the path.
> Given that you see this limited to Ziggo users, likely within the Ziggo network.
>  
> You might try to limit the MTU on your server to something like 1420 bytes .
> If that fixes the problem you have a clear indication that this is the problem.
>  
> Normally path mtu discovery should work (not to mention expecting that all internet is OK up to 1500 bytes) , but path mtu discovery expects that routers accurately know what the MTU is for their links to the neighbor.
> If a link between two routers drops packets less than the MTU configured on the router interfaces this behaviour is very hard to detect - it has become a black hole for oversized packets with no alert whatsoever.
>  
> Worse - when "the" link between two routers is a bundle and one member in this bundle has errors or perhaps supports not the full MTU size .
> Usually traffic is balanced across links members based on a hash of source ip, destination ip and source/destination ports .
> That means the same client and same destination may, depending on 'chance' (source port here) encounter the problem link or not .
>  
> Make sure that for your server logging you capture source IP but also source port .
>  
> If you have a knowledgeable 'friendly user' that has the problem on Ziggo and that you can work with for troubleshooting I suggest a packet capture of their traffic (and 'all icmp' ) on your end, and ideally also on the user side.
>  
> You want to capture 'all icmp'  (not filter on source IP) , as any path mtu (mtu too big) icmp messages are sourced from "some router IP along the path" .
>  
> Also helpful to do a full traceroute from both ends - as traffic may flow differently on the forward and return path .
>  
> Do you have any indication it is "all Ziggo" , or perhaps limited to some IP ranges from Ziggo ?
>  
> You can with ping try to manually find out about the MTU allowed to clients , and vice versa.
> [I always need to think and double check is the size argument is payload, full IP packet,and how big ethernet headers are again ]
>  
> Best regards, Boudewijn
>  
>> Op 23-08-2023 14:18 CEST schreef Stefan van den Oord <stefan+nlnog at medicinemen.eu>:
>>  
>>  
>> Dear NLNOG community,
>>  
>> I’d like to present to you a problem that we’re experiencing. To us it is very strange, we are out of ideas. Of course solutions would be much appreciated, but also ways of diagnosing this and work-arounds are very much appreciated.
>>  
>> Background: we’re a small Dutch company developing the Viduet platform: a platform to help chronically ill patients better manage their wellbeing together with their care providers. Our product is web-based and we also have a mobile app.
>>  
>> The problem: since almost two weeks we’re getting reports from users that sometimes (!) get connection timeouts in their browsers/apps when they connect to our web platform. We have narrowed down the potential sources of the issue and found a small setup to reproduce the issue:
>> We setup a clean Linode node in Frankfurt (smallest type, shared CPU) with Apache (just `apt install apache2`). Requesting the default apache index.html using `curl -i http://172.104.202.142` <http://172.104.202.142`/> (or http://172.104.202.142/large.html) causes timeouts more than 1 in 10 tries for some users, who have in common that they are all customers of the Ziggo internet provider.
>> Doing the same with a Linode node in Paris has the same result.
>> Doing the same with a Linode node in London works as you would expect, so does not have this strange behaviour.
>> Running `mtr -rwzbc100 172.104.202.142` (the Frankfurt node mentioned above) shows no packet loss, nothing out of the ordinary.
>> Running nginx instead of apache makes no difference: same issue
>> Forcing apache to use http/1.0 makes no difference: same issue (but frankly I don’t think it goes wrong on this level of the protocol stack)
>> There seems to be a relationship with the content length of the HTTP response. For shorter HTML files like the index.html on the Frankfurt node, it sometimes times out, and sometimes succeeds after a delay of a few seconds (and sometimes returns as fast as you’d expect). Using slightly larger HTML files it just times out.
>> I’m hesitating whether this is relevant at all, but when using HTTPS instead of HTTP, the problem also manifests in TLS handshake errors.
>>  
>> We have been in touch with Linode support and they found nothing out of the ordinary on their side. They point to the Ziggo network, saying:
>>  
>>> The reason that the issue only exists from to the Frankfurt data center and not London is likely because the problem is related to the particular route that the traffic takes from one place to the other. While the MTRs you shared look good, I did find evidence in another ticket of a particular hop within Ziggo's network showing issues, but we can't say for sure what the issue is. Here is the information about the hop in case that helps with your communication with Ziggo:
>>> AS33915  asd-tr <http://asd-tr0021-cr101-be64.core.as9143.net/>0021 <http://asd-tr0021-cr101-be64.core.as9143.net/>-cr <http://asd-tr0021-cr101-be64.core.as9143.net/>101 <http://asd-tr0021-cr101-be64.core.as9143.net/>-be <http://asd-tr0021-cr101-be64.core.as9143.net/>64 <http://asd-tr0021-cr101-be64.core.as9143.net/>.core.as <http://asd-tr0021-cr101-be64.core.as9143.net/>9143 <http://asd-tr0021-cr101-be64.core.as9143.net/>.net <http://asd-tr0021-cr101-be64.core.as9143.net/> (213.51.64.193)
>> 
>> Again, any help is much appreciated!
>>  
>> Kind regards,
>>  
>> —  
>> Stefan van den Oord 
>> CTO @ Medicine Men B.V. 
>> 
>> Not in the office on Wednesdays 
>> 
>> Regulierenring 22 
>> 3981 LB Bunnik 
>> The Netherlands 
>> +31 85 1307020
>>  
>> _______________________________________________ 
>> NLNOG mailing list 
>> NLNOG at nlnog.net 
>> http://mailman.nlnog.net/listinfo/nlnog

-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://mailman.nlnog.net/pipermail/nlnog/attachments/20230824/c40f27c0/attachment.html>