[NLNOG] Curious problem with connections from Ziggo customers to Linode nodes in some data centers
Stefan van den Oord
stefan+nlnog at medicinemen.eu
Fri Aug 25 17:12:08 CEST 2023
Hi Sabri,
My colleague provided the results of your suggestion:
10000: TCP connection attempts: 1 | Successful connections: 0 | Failed: 1 (100.00%)
10001: TCP connection attempts: 1 | Successful connections: 1 | Failed: 0 (0.00%)
10002: TCP connection attempts: 1 | Successful connections: 1 | Failed: 0 (0.00%)
10003: TCP connection attempts: 1 | Successful connections: 0 | Failed: 1 (100.00%)
10004: TCP connection attempts: 1 | Successful connections: 1 | Failed: 0 (0.00%)
10005: TCP connection attempts: 1 | Successful connections: 1 | Failed: 0 (0.00%)
10006: TCP connection attempts: 1 | Successful connections: 1 | Failed: 0 (0.00%)
10007: TCP connection attempts: 1 | Successful connections: 1 | Failed: 0 (0.00%)
10008: TCP connection attempts: 1 | Successful connections: 1 | Failed: 0 (0.00%)
10009: TCP connection attempts: 1 | Successful connections: 1 | Failed: 0 (0.00%)
10010: TCP connection attempts: 1 | Successful connections: 1 | Failed: 0 (0.00%)
10011: TCP connection attempts: 1 | Successful connections: 1 | Failed: 0 (0.00%)
10012: TCP connection attempts: 1 | Successful connections: 1 | Failed: 0 (0.00%)
10013: TCP connection attempts: 1 | Successful connections: 1 | Failed: 0 (0.00%)
10014: TCP connection attempts: 1 | Successful connections: 1 | Failed: 0 (0.00%)
10015: TCP connection attempts: 1 | Successful connections: 1 | Failed: 0 (0.00%)
10016: TCP connection attempts: 1 | Successful connections: 1 | Failed: 0 (0.00%)
10017: TCP connection attempts: 1 | Successful connections: 1 | Failed: 0 (0.00%)
10018: TCP connection attempts: 1 | Successful connections: 0 | Failed: 1 (100.00%)
10019: TCP connection attempts: 1 | Successful connections: 1 | Failed: 0 (0.00%)
10020: TCP connection attempts: 1 | Successful connections: 1 | Failed: 0 (0.00%)
10021: TCP connection attempts: 1 | Successful connections: 1 | Failed: 0 (0.00%)
10022: TCP connection attempts: 1 | Successful connections: 1 | Failed: 0 (0.00%)
10023: TCP connection attempts: 1 | Successful connections: 1 | Failed: 0 (0.00%)
10024: TCP connection attempts: 1 | Successful connections: 1 | Failed: 0 (0.00%)
10025: TCP connection attempts: 1 | Successful connections: 0 | Failed: 1 (100.00%)
10026: TCP connection attempts: 1 | Successful connections: 0 | Failed: 1 (100.00%)
10027: TCP connection attempts: 1 | Successful connections: 1 | Failed: 0 (0.00%)
10028: TCP connection attempts: 1 | Successful connections: 1 | Failed: 0 (0.00%)
[etc]
Traceroute on failed port:
traceroute -n -T --sport=10042 172.104.202.142 80
traceroute to 172.104.202.142 (172.104.202.142), 30 hops max, 60 byte packets
1 * 192.168.178.1 1.764 ms 1.614 ms
2 * * *
3 * * *
4 213.51.158.110 8.425 ms 8.810 ms 16.566 ms
5 213.51.64.186 10.506 ms * *
6 * * *
7 * * *
8 * 154.54.39.178 27.068 ms *
9 * 204.68.252.42 28.244 ms *
10 * * *
11 * * *
12 * * *
13 * * *
14 * * *
15 172.104.202.142 24.962 ms 1042.669 ms 23.458 ms
Traceroute on successful port:
traceroute -n -T --sport=10047 172.104.202.142 80
traceroute to 172.104.202.142 (172.104.202.142), 30 hops max, 60 byte packets
1 192.168.178.1 2.580 ms * 2.081 ms
2 * * *
3 213.51.196.61 10.775 ms * 15.674 ms
4 213.51.158.110 12.520 ms * *
5 213.51.64.186 14.735 ms 13.270 ms 13.815 ms
6 * * 130.117.14.1 14.223 ms
7 * * *
8 154.54.39.178 19.727 ms 21.422 ms *
9 * * *
10 * * *
11 * * *
12 * * *
13 172.104.202.142 23.457 ms 23.198 ms 21.411 ms
--
Stefan van den Oord
CTO @ Medicine Men B.V.
Not in the office on Wednesdays
Regulierenring 22
3981 LB Bunnik
The Netherlands
+31 85 1307020
OpenPGP Key <https://keys.openpgp.org/vks/v1/by-fingerprint/676D4DA93671DFA4CA44D13D6232002ADF3E0504>
> On 23 Aug 2023, at 20:59, Sabri Berisha <sabri at cluecentral.net> wrote:
>
> Hi,
>
> In the past I've used nping troubleshoot (L3) ECMP issues. For example:
>
> for i in `seq 10000 60000` ; do /bin/echo -n "$i: " ; nping -c 1 -g $i --tcp-connect -p 80 --dest-ip 172.104.202.142 | grep Failed ; done
>
> If you get a failure, use: traceroute -n -T --sport=<whatever source port is failing> 172.104.202.142 80
>
> This will not work very well on link aggregation, it's most effective on IP ECMP.
>
> Thanks,
>
> Sabri
>
>
> ----- On Aug 23, 2023, at 6:30 AM, Boudewijn Visser (nlnog) <bvisser-nlnog at xs4all.nl> wrote:
> Hi Stefan,
>
> Some guesses for a possible cause :
>
> MTU issue somewhere in this path, possibly limited to one member in a link bundel somewhere in the path.
> Given that you see this limited to Ziggo users, likely within the Ziggo network.
>
> You might try to limit the MTU on your server to something like 1420 bytes .
> If that fixes the problem you have a clear indication that this is the problem.
>
> Normally path mtu discovery should work (not to mention expecting that all internet is OK up to 1500 bytes) , but path mtu discovery expects that routers accurately know what the MTU is for their links to the neighbor.
> If a link between two routers drops packets less than the MTU configured on the router interfaces this behaviour is very hard to detect - it has become a black hole for oversized packets with no alert whatsoever.
>
> Worse - when "the" link between two routers is a bundle and one member in this bundle has errors or perhaps supports not the full MTU size .
> Usually traffic is balanced across links members based on a hash of source ip, destination ip and source/destination ports .
> That means the same client and same destination may, depending on 'chance' (source port here) encounter the problem link or not .
>
> Make sure that for your server logging you capture source IP but also source port .
>
> If you have a knowledgeable 'friendly user' that has the problem on Ziggo and that you can work with for troubleshooting I suggest a packet capture of their traffic (and 'all icmp' ) on your end, and ideally also on the user side.
>
> You want to capture 'all icmp' (not filter on source IP) , as any path mtu (mtu too big) icmp messages are sourced from "some router IP along the path" .
>
> Also helpful to do a full traceroute from both ends - as traffic may flow differently on the forward and return path .
>
> Do you have any indication it is "all Ziggo" , or perhaps limited to some IP ranges from Ziggo ?
>
> You can with ping try to manually find out about the MTU allowed to clients , and vice versa.
> [I always need to think and double check is the size argument is payload, full IP packet,and how big ethernet headers are again ]
>
> Best regards, Boudewijn
>
> Op 23-08-2023 14:18 CEST schreef Stefan van den Oord <stefan+nlnog at medicinemen.eu>:
>
>
> Dear NLNOG community,
>
> I’d like to present to you a problem that we’re experiencing. To us it is very strange, we are out of ideas. Of course solutions would be much appreciated, but also ways of diagnosing this and work-arounds are very much appreciated.
>
> Background: we’re a small Dutch company developing the Viduet platform: a platform to help chronically ill patients better manage their wellbeing together with their care providers. Our product is web-based and we also have a mobile app.
>
> The problem: since almost two weeks we’re getting reports from users that sometimes (!) get connection timeouts in their browsers/apps when they connect to our web platform. We have narrowed down the potential sources of the issue and found a small setup to reproduce the issue:
> We setup a clean Linode node in Frankfurt (smallest type, shared CPU) with Apache (just `apt install apache2`). Requesting the default apache index.html using `curl -i http://172.104.202.142` <http://172.104.202.142`/> (or http://172.104.202.142/large.html) causes timeouts more than 1 in 10 tries for some users, who have in common that they are all customers of the Ziggo internet provider.
> Doing the same with a Linode node in Paris has the same result.
> Doing the same with a Linode node in London works as you would expect, so does not have this strange behaviour.
> Running `mtr -rwzbc100 172.104.202.142` (the Frankfurt node mentioned above) shows no packet loss, nothing out of the ordinary.
> Running nginx instead of apache makes no difference: same issue
> Forcing apache to use http/1.0 makes no difference: same issue (but frankly I don’t think it goes wrong on this level of the protocol stack)
> There seems to be a relationship with the content length of the HTTP response. For shorter HTML files like the index.html on the Frankfurt node, it sometimes times out, and sometimes succeeds after a delay of a few seconds (and sometimes returns as fast as you’d expect). Using slightly larger HTML files it just times out.
> I’m hesitating whether this is relevant at all, but when using HTTPS instead of HTTP, the problem also manifests in TLS handshake errors.
>
> We have been in touch with Linode support and they found nothing out of the ordinary on their side. They point to the Ziggo network, saying:
>
> The reason that the issue only exists from to the Frankfurt data center and not London is likely because the problem is related to the particular route that the traffic takes from one place to the other. While the MTRs you shared look good, I did find evidence in another ticket of a particular hop within Ziggo's network showing issues, but we can't say for sure what the issue is. Here is the information about the hop in case that helps with your communication with Ziggo:
> AS33915 asd-tr <http://asd-tr0021-cr101-be64.core.as9143.net/>0021 <http://asd-tr0021-cr101-be64.core.as9143.net/>-cr <http://asd-tr0021-cr101-be64.core.as9143.net/>101 <http://asd-tr0021-cr101-be64.core.as9143.net/>-be <http://asd-tr0021-cr101-be64.core.as9143.net/>64 <http://asd-tr0021-cr101-be64.core.as9143.net/>.core.as <http://asd-tr0021-cr101-be64.core.as9143.net/>9143 <http://asd-tr0021-cr101-be64.core.as9143.net/>.net <http://asd-tr0021-cr101-be64.core.as9143.net/> (213.51.64.193)
> Again, any help is much appreciated!
>
> Kind regards,
>
> —
> Stefan van den Oord
> CTO @ Medicine Men B.V.
>
> Not in the office on Wednesdays
>
> Regulierenring 22
> 3981 LB Bunnik
> The Netherlands
> +31 85 1307020
>
> _______________________________________________
> NLNOG mailing list
> NLNOG at nlnog.net
> http://mailman.nlnog.net/listinfo/nlnog
>
> _______________________________________________
> NLNOG mailing list
> NLNOG at nlnog.net
> http://mailman.nlnog.net/listinfo/nlnog
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://mailman.nlnog.net/pipermail/nlnog/attachments/20230825/68269cb3/attachment-0001.html>
More information about the NLNOG
mailing list