[Nlnog] Re: PAIX Outages (fwd)

Sten Spans sten at blinkenlights.nl
Thu Apr 28 21:45:06 UTC 2005

Extreme gaat lekker de laatste tijd :)

Sten Spans

"There is a crack in everything, that's how the light gets in."
Leonard Cohen - Anthem

---------- Forwarded message ----------
Date: Thu, 28 Apr 2005 13:51:54 -0400
From: Richard A Steenbergen <ras at e-gerbil.net>
To: Jay Patel <jpatel70 at gmail.com>
Cc: nanog at merit.edu
Subject: Re: PAIX Outages

On Wed, Apr 27, 2005 at 10:45:15AM -0400, Jay Patel wrote:
> I have heard rumors that S&D has been having persistent switch
> problems with their switches at PAIX (Palo Alto), and I was kind of
> wondering if anyone actually cared?

Personally I tend to suspect the general lack of uproar is a rather
unfortunate (for them) sign that PAIX is no longer relevant when it comes
to critical backbone infrastructures.

It looks like different folks have been seeing different levels of outages
depending upon which switch/card they are connected to, but I havn't been
able to find anyone who has seen fewer than 30 hits between April 16th and
the two this morning. Our ports have seen just under 28 hours of total
downtime so far this month, while some lucky people have only seen around
6 hours.

I'm not sure if anyone at S&D or Extreme actually has any real idea what
the problem is with these current switches, but given this amount of
downtime, they should have replace every last component by now. If Extreme
can't fix them, there should be a pile of Black Diamond's sitting on the
curb waiting for trash day. In fact, 9/10ths of the way through writing
this e-mail, I got a call from S&D stating that they are doing exactly
that. :)

In the mean time, here are some of the more interesting snipits of what
has been tried on the current switches:

16 Apr 2005 20:19:53 GMT
We are currently experiencing some problems with 2 network cards in our
Palo Alto peering switch. This might be causing possible service
degradations. Switch Engineers are expecting new cards to replace the 2
suspected faulty network cards. These cards should be arriving in or
around 1 hour. Right after the cards arrive, we will be scheduling an
emergency maintenance window to get these cards replaced.

19 Apr 2005 14:16:07 GMT
The Purpose of this Emergency Maintenance window is for Switch Engineers
to replace a faulty processor module card affecting the Bay Area Peering
customers. The estimated down time will be 15 minutes.
(Actual downtime several hours)

19 Apr 2005 19:27:49 GMT
This is the final update regarding the problems experienced today with the
peering fabric. Our Switch Engineers corrected the problems during the
emergency maintenance window by replacing two line cards and 2 processor
cards in the Palo Alto switch. All peering sessions should be restored at
this time.

22 Apr 2005 21:56:15 GMT
The purpose of this emergency maintenance window is for engineers to
replace defective power supply units on the Paix Switch. No impact to your
services is expected.

24 Apr 2005 21:25:48 GMT
Our Switch Engineers will be conducting and emergency processor cards
replacement at the Palo Alto site. The expected downtime while this
maintenance is being conducting will be 2 hours.

24 Apr 2005 21:36:18 GMT
Our Switch Engineers will be conducting and emergency chassis replacement
at the Palo Alto site. The expected downtime while this maintenance is
being conducting will be 3 hours.

25 Apr 2005 19:17:41 GMT
Our engineers have escalated the problems with the peering switch in Palo
Alto to 3rd level support at Extreme, the switch vendor. More details will
follow as they become available.

26 Apr 2005 03:00:34 GMT
Our Switch Engineers have advised us that the switch has been migrated to
a different power bus to rule out any power variables. Power is being
monitored for the next 24 hours.

28 Apr 2005 13:33:05 GMT
At approximately 6:05 AM local time, the peering switch rebooted itself.
Our switch engineers are investigating this issue and believe all sessions
are back to normal at this time. More details will be provided as they
become available.

When I see a stable switching platform going forward, and some service
credits for the massive outages we've all endured so far, I'll probably be
a lot less cranky about the entire situation. Until then I have to say, if
they keep this up their are going to need to change their name to "Switch
or Data".

Oh well, at least this didn't happen during the S&D sponsored NANOG. :)

Richard A Steenbergen <ras at e-gerbil.net>       http://www.e-gerbil.net/ras
GPG Key ID: 0xF8B12CBC (7535 7F59 8204 ED1F CC1C 53AF 4C41 5ECA F8B1 2CBC)

More information about the NLNOG mailing list