rssLink RSS for all categories
 
icon_red
icon_green
icon_red
icon_red
icon_blue
icon_green
icon_green
icon_red
icon_red
icon_red
icon_orange
icon_green
icon_green
icon_green
icon_green
icon_blue
icon_red
icon_orange
icon_red
icon_red
icon_red
icon_red
icon_green
icon_red
icon_red
icon_red
icon_red
icon_orange
icon_green
 

FS#2571 — FS#6533 — general routing

Attached to Project— Network
Incident
Whole Network
CLOSED
100%
We have had a general routing problem.
We are looking for the origin of the problem.

Apparently,a card in one of two routers in Roubaix began to malfunction and did not crash completely. Suddenly it caused the isolation of the network and the split between the parts Paris Roubaix London.

We cut the card electrically and are checking the logs in order to understand how a card could trigger such a problem.
Date:  Monday, 16 April 2012, 11:17AM
Reason for closing:  Done
Comment by OVH - Wednesday, 28 March 2012, 08:38AM

One of two main routers in Roubaix is down rbx-g1-a9 and the second has a defected card.


Comment by OVH - Wednesday, 28 March 2012, 08:39AM

LC/0/0/CPU0:Mar 28 04:18:20 UTC: prm_server_ty[295]: prm_ser_check: Double-bit ECC error detected: NP 1, block 0xb (SRCH), offset 72, memid 539, name SEARCH_EXT_MEM, addr 0x05cab9f8, bit 4294967295, ext info 0x05cab9f8 0x000082d9 0x00000047 0xffffffff, action 0 (Fix)
LC/0/0/CPU0:Mar 28 04:18:20 UTC: prm_server_ty[295]: prm_ser_check: Double-bit ECC error detected: NP 1, block 0xb (SRCH), offset 72, memid 539, name SEARCH_EXT_MEM, addr 0x05cab9f8, bit 4294967295, ext info 0x05cab9f8 0x000082d9 0x00000047 0xffffffff, action 0 (Fix)
LC/0/1/CPU0:Mar 28 04:18:20 UTC: prm_server_ty[295]: prm_ser_check: Double-bit ECC error detected: NP 2, block 0xb (SRCH), offset 72, memid 539, name SEARCH_EXT_MEM, addr 0x05cab960, bit 4294967295, ext info 0x05cab960 0x000082b5 0x00000047 0xffffffff, action 0 (Fix)
LC/0/0/CPU0:Mar 28 04:18:21 UTC: prm_server_ty[295]: prm_ser_check: Double-bit ECC error detected: NP 1, block 0xb (SRCH), offset 72, memid 539, name SEARCH_EXT_MEM, addr 0x05cab9f8, bit 4294967295, ext info 0x05cab9f8 0x00


Comment by OVH - Wednesday, 28 March 2012, 08:42AM

2 cards 24x10G on rbx-g1-a9 crashed
also 1 card 24x10G on rbx-g2-a9 crashed too.


Comment by OVH - Wednesday, 28 March 2012, 09:00AM

Murphy law of problems that never happen.

Something caused the simultaneous failure of cards from the same type in two different routers.
A bug hard/soft on the new cards 24x10G of the Cisco ASR9010. other cards 8x10G remained up.

We opened the TAC in order to request replacing of the three cards that have crashed. but we must find the origin of the problem so we prevent it from happening again, because even with the same hard and the same soft, the same origin will cause the same problem.


Comment by OVH - Wednesday, 28 March 2012, 09:10AM

The problem started at 04:37 am on rbx-g2-a9 on the card 0/1

Mar 28 04:37:01 rbx-g2-a9.fr.eu 377642: LC/0/1/CPU0:Mar 28 02:37:04 UTC: prm_server_ty[295]: prm_ser_check: Double-bit ECC error detected: NP 6, block 0xb (SRCH), offset 72, memid 539, name SEARCH_EXT_MEM, addr 0x05c69a7b, bit 4294967295, ext info 0x05c69a7b 0x000082b5 0x00000047 0xffffffff, action 0 (Fix)
Mar 28 04:37:01 rbx-g2-a9.fr.eu 377643: LC/0/1/CPU0:Mar 28 02:37:04 UTC: prm_server_ty[295]: prm_ser_check: Double-bit ECC error detected: NP 6, block 0xb (SRCH), offset 72, memid 539, name SEARCH_EXT_MEM, addr 0x05c69a7b, bit 4294967295, ext info 0x05c69a7b 0x000082b5 0x00000047 0xffffffff, action 0 (Fix)
Mar 28 04:37:01 rbx-g2-a9.fr.eu 377644: LC/0/1/CPU0:Mar 28 02:37:04 UTC: prm_server_ty[295]: prm_ser_check: Double-bit ECC error detected: NP 6, block 0xb (SRCH), offset 72, memid 539, name SEARCH_EXT_MEM, addr 0x05c69a7a, bit 4294967295, ext info 0x05c69a7a 0x000082b5 0x00000047 0xffffffff, action 0 (Fix)
Mar 28 04:37:01 rbx-g2-a9.fr.eu 377645: LC/0/1/CPU0:Mar 28 02:37:04 UTC: prm_server_ty[295]: prm_ser_check: Double-bit ECC error detected: NP 6, block 0xb (SRCH), offset 72, memid 539, name SEARCH_EXT_MEM, addr 0x05c69a7b, bit 4294967295, ext info 0x05c69a7b 0x000082b5 0x00000047 0xffffffff, action 0 (Fix)
Mar 28 04:37:01 rbx-g2-a9.fr.eu 377646: LC/0/1/CPU0:Mar 28 02:37:04 UTC: prm_server_ty[295]: prm_ser_check: Double-bit ECC error detected: NP 6, block 0xb (SRCH), offset 72, memid 539, name SEARCH_EXT_MEM, addr 0x05c69a7a, bit 4294967295, ext info 0x05c69a7a 0x000082b5 0x00000047 0xffffffff, action 0 (Fix)

1 second later the same problem impacted another router rbx-g1-a9 on the card 0/0

Mar 28 04:37:02 rbx-g1-a9.fr.eu 8963: LC/0/0/CPU0:Mar 28 02:36:46 UTC: prm_server_ty[295]: prm_ser_check: Double-bit ECC error detected: NP 1, block 0xb (SRCH), offset 72, memid 539, name SEARCH_EXT_MEM, addr 0x05cab924, bit 4294967295, ext info 0x05cab924 0x000082b5 0x00000047 0xffffffff, action 0 (Fix)
Mar 28 04:37:02 rbx-g2-a9.fr.eu 377749: LC/0/1/CPU0:Mar 28 02:37:04 UTC: prm_server_ty[295]: prm_ser_check: Double-bit ECC error detected: NP 6, block 0xb (SRCH), offset 72, memid 539, name SEARCH_EXT_MEM, addr 0x05c69a7b, bit 4294967295, ext info 0x05c69a7b 0x000082b5 0x00000047 0xffffffff, action 0 (Fix)
Mar 28 04:37:02 rbx-g1-a9.fr.eu 8964: LC/0/0/CPU0:Mar 28 02:36:46 UTC: prm_server_ty[295]: prm_ser_check: Double-bit ECC error detected: NP 1, block 0xb (SRCH), offset 72, memid 539, name SEARCH_EXT_MEM, addr 0x05cab924, bit 4294967295, ext info 0x05cab924 0x000082b5 0x00000047 0xffffffff, action 0 (Fix)
Mar 28 04:37:02 rbx-g2-a9.fr.eu 377750: LC/0/1/CPU0:Mar 28 02:37:04 UTC: prm_server_ty[295]: prm_ser_check: Double-bit ECC error detected: NP 6, block 0xb (SRCH), offset 72, memid 539, name SEARCH_EXT_MEM, addr 0x05c69a7b, bit 4294967295, ext info 0x05c69a7b 0x000082b5 0x00000047 0xffffffff, action 0 (Fix)
Mar 28 04:37:02 rbx-g2-a9.fr.eu 377751: LC/0/1/CPU0:Mar 28 02:37:04 UTC: prm_server_ty[295]: prm_ser_check: Double-bit ECC error detected: NP 6, block 0xb (SRCH), offset 72, memid 539, name SEARCH_EXT_MEM, addr 0x05c69a7b, bit 4294967295, ext info 0x05c69a7b 0x000082b5 0x00000047 0xffffffff, action 0 (Fix)
Mar 28 04:37:02 rbx-g1-a9.fr.eu 8965: LC/0/0/CPU0:Mar 28 02:36:46 UTC: prm_server_ty[295]: prm_ser_check: Double-bit ECC error detected: NP 1, block 0xb (SRCH), offset 72, memid 539, name SEARCH_EXT_MEM, addr 0x05cab924, bit 4294967295, ext info 0x05cab924 0x000082b5 0x00000047 0xffffffff, action 0 (Fix)
Mar 28 04:37:02 rbx-g2-a9.fr.eu 377752: LC/0/1/CPU0:Mar 28 02:37:04 UTC: prm_server_ty[295]: prm_ser_check: Double-bit ECC error detected: NP 6, block 0xb (SRCH), offset 72, memid 539, name SEARCH_EXT_MEM, addr 0x05c69a7a, bit 4294967295, ext info 0x05c69a7a 0x000082b5 0x00000047 0xffffffff, action 0 (Fix)
Mar 28 04:37:02 rbx-g1-a9.fr.eu 8966: LC/0/0/CPU0:Mar 28 02:36:46 UTC: prm_server_ty[295]: prm_ser_check: Double-bit ECC error detected: NP 1, block 0xb (SRCH), offset 72, memid 539, name SEARCH_EXT_MEM, addr 0x05cab925, bit 4294967295, ext info 0x05cab925 0x000082b5 0x00000047 0xffffffff, action 0 (Fix)
Mar 28 04:37:02 rbx-g2-a9.fr.eu 377753: LC/0/1/CPU0:Mar 28 02:37:04 UTC: prm_server_ty[295]: prm_ser_check: Double-bit ECC error detected: NP 6, block 0xb (SRCH), offset 72, memid 539, name SEARCH_EXT_MEM, addr 0x05c69a7b, bit 4294967295, ext info 0x05c69a7b 0x000082b5 0x00000047 0xffffffff, action 0 (Fix)
Mar 28 04:37:02 rbx-g2-a9.fr.eu 377754: LC/0/1/CPU0:Mar 28 02:37:04 UTC: prm_server_ty[295]: prm_ser_check: Double-bit ECC error detected: NP 6, block 0xb (SRCH), offset 72, memid 539, name SEARCH_EXT_MEM, addr 0x05c69a7a, bit 4294967295, ext info 0x05c69a7a 0x000082b5 0x00000047 0xffffffff, action 0 (Fix)
Mar 28 04:37:02 rbx-g1-a9.fr.eu 8967: LC/0/0/CPU0:Mar 28 02:36:46 UTC: prm_server_ty[295]: prm_ser_check: Double-bit ECC error detected: NP 1, block 0xb (SRCH), offset 72, memid 539, name SEARCH_EXT_MEM, addr 0x05cab925, bit 4294967295, ext info 0x05cab925 0x000082b5 0x00000047 0xffffffff, action 0 (Fix)

72 seconds later,the problem impacted the two cards 0/0 and 0/1 on rbx-g2-a9:

Mar 28 04:38:14 rbx-g2-a9.fr.eu 21106: LC/0/1/CPU0:Mar 28 02:37:57 UTC: prm_server_ty[295]: prm_ser_check: Double-bit ECC error detected: NP 2, block 0xb (SRCH), offset 72, memid 539, name SEARCH_EXT_MEM, addr 0x05caba05, bit 4294967295, ext info 0x05caba05 0x000082b5 0x00000047 0xffffffff, action 0 (Fix)
Mar 28 04:38:14 rbx-g2-a9.fr.eu 21107: LC/0/0/CPU0:Mar 28 02:37:57 UTC: prm_server_ty[295]: prm_ser_check: Double-bit ECC error detected: NP 1, block 0xb (SRCH), offset 72, memid 539, name SEARCH_EXT_MEM, addr 0x05cab925, bit 4294967295, ext info 0x05cab925 0x000082b5 0x00000047 0xffffffff, action 0 (Fix)
Mar 28 04:38:14 rbx-g2-a9.fr.eu 21108: LC/0/0/CPU0:Mar 28 02:37:57 UTC: prm_server_ty[295]: prm_ser_check: Double-bit ECC error detected: NP 1, block 0xb (SRCH), offset 72, memid 539, name SEARCH_EXT_MEM, addr 0x05cab925, bit 4294967295, ext info 0x05cab925 0x000082b5 0x00000047 0xffffffff, action 0 (Fix)
Mar 28 04:38:14 rbx-g2-a9.fr.eu 21109: LC/0/1/CPU0:Mar 28 02:37:57 UTC: prm_server_ty[295]: prm_ser_check: Double-bit ECC error detected: NP 2, block 0xb (SRCH), offset 72, memid 539, name SEARCH_EXT_MEM, addr 0x05caba05, bit 4294967295, ext info 0x05caba05 0x000082b5 0x00000047 0xffffffff, action 0 (Fix)
Mar 28 04:38:14 rbx-g2-a9.fr.eu 21110: LC/0/1/CPU0:Mar 28 02:37:57 UTC: prm_server_ty[295]: prm_ser_check: Double-bit ECC error detected: NP 2, block 0xb (SRCH), offset 72, memid 539, name SEARCH_EXT_MEM, addr 0x05caba04, bit 4294967295, ext info 0x05caba04 0x000082b5 0x00000047 0xffffffff, action 0 (Fix)
Mar 28 04:38:14 rbx-g2-a9.fr.eu 21111: LC/0/0/CPU0:Mar 28 02:37:57 UTC: prm_server_ty[295]: prm_ser_check: Double-bit ECC error detected: NP 1, block 0xb (SRCH), offset 72, memid 539, name SEARCH_EXT_MEM, addr 0x05cab925, bit 4294967295, ext info 0x05cab925 0x000082b5 0x00000047 0xffffffff, action 0 (Fix)
Mar 28 04:38:14 rbx-g2-a9.fr.eu 21112: LC/0/0/CPU0:Mar 28 02:37:58 UTC: prm_server_ty[295]: prm_ser_check: Double-bit ECC error detected: NP 1, block 0xb (SRCH), offset 72, memid 539, name SEARCH_EXT_MEM, addr 0x05cab924, bit 4294967295, ext info 0x05cab924 0x000082b5 0x00000047 0xffffffff, action 0 (Fix)
Mar 28 04:38:14 rbx-g2-a9.fr.eu 21113: LC/0/0/CPU0:Mar 28 02:37:58 UTC: prm_server_ty[295]: prm_ser_check: Double-bit ECC error detected: NP 1, block 0xb (SRCH), offset 72, memid 539, name SEARCH_EXT_MEM, addr 0x05cab924, bit 4294967295, ext info 0x05cab924 0x000082b5 0x00000047 0xffffffff, action 0 (Fix)
Mar 28 04:38:15 rbx-g2-a9.fr.eu 21114: LC/0/0/CPU0:Mar 28 02:37:58 UTC: prm_server_ty[295]: prm_ser_check: Double-bit ECC error detected: NP 1, block 0xb (SRCH), offset 72, memid 539, name SEARCH_EXT_MEM, addr 0x05cab924, bit 4294967295, ext info 0x05cab924 0x000082b5 0x00000047 0xffffffff, action 0 (Fix)
Mar 28 04:38:15 rbx-g2-a9.fr.eu 21115: LC/0/1/CPU0:Mar 28 02:37:57 UTC: prm_server_ty[295]: prm_ser_check: Double-bit ECC error detected: NP 2, block 0xb (SRCH), offset 72, memid 539, name SEARCH_EXT_MEM, addr 0x05caba05, bit 4294967295, ext info 0x05caba05 0x000082b5 0x00000047 0xffffffff, action 0 (Fix)
Mar 28 04:38:15 rbx-g2-a9.fr.eu 21116: LC/0/1/CPU0:Mar 28 02:37:57 UTC: prm_server_ty[295]: prm_ser_check: Double-bit ECC error detected: NP 2, block 0xb (SRCH), offset 72, memid 539, name SEARCH_EXT_MEM, addr 0x05caba05, bit 4294967295, ext info 0x05caba05 0x000082b5 0x00000047 0xffffffff, action 0 (Fix)
Mar 28 04:38:15 rbx-g2-a9.fr.eu 21117: LC/0/0/CPU0:Mar 28 02:37:58 UTC: prm_server_ty[295]: prm_ser_check: Double-bit ECC error detected: NP 1, block 0xb (SRCH), offset 72, memid 539, name SEARCH_EXT_MEM, addr 0x05cab925, bit 4294967295, ext info 0x05cab925 0x000082b5 0x00000047 0xffffffff, action 0 (Fix)
Mar 28 04:38:15 rbx-g2-a9.fr.eu 21118: LC/0/0/CPU0:Mar 28 02:37:58 UTC: prm_server_ty[295]: prm_ser_check: Double-bit ECC error detected: NP 1, block 0xb (SRCH), offset 72, memid 539, name SEARCH_EXT_MEM, addr 0x05cab924, bit 4294967295, ext info 0x05cab924 0x000082b5 0x00000047 0xffffffff, action 0 (Fix)
Mar 28 04:38:15 rbx-g2-a9.fr.eu 21119: LC/0/1/CPU0:Mar 28 02:37:57 UTC: prm_server_ty[295]: prm_ser_check: Double-bit ECC error detected: NP 2, block 0xb (SRCH), offset 72, memid 539, name SEARCH_EXT_MEM, addr 0x05caba05, bit 4294967295, ext info 0x05caba05 0x000082b5 0x00000047 0xffffffff, action 0 (Fix)
Mar 28 04:38:15 rbx-g2-a9.fr.eu 21120: LC/0/0/CPU0:Mar 28 02:37:58 UTC: prm_server_ty[295]: prm_ser_check: Double-bit ECC error detected: NP 1, block 0xb (SRCH), offset 72, memid 539, name SEARCH_EXT_MEM, addr 0x05cab925, bit 4294967295, ext info 0x05cab925 0x000082b5 0x00000047 0xffffffff, action 0 (Fix)
Mar 28 04:38:15 rbx-g2-a9.fr.eu 21121: LC/0/0/CPU0:Mar 28 02:37:58 UTC: prm_server_ty[295]: prm_ser_check: Double-bit ECC error detected: NP 1, block 0xb (SRCH), offset 72, memid 539, name SEARCH_EXT_MEM, addr 0x05cab925, bit 4294967295, ext info 0x05cab925 0x000082b5 0x00000047 0xffffffff, action 0 (Fix)

then 27 seconds later this impacted the two cards 0/0 and 0/1 on rbx-g1-a9

Mar 28 04:38:43 3|rbx-g2-a9 394682: LC/0/1/CPU0:Mar 28 02:38:43 UTC: prm_server_ty[295]: prm_ser_check: Double-bit ECC error detected: NP 6, block 0xb (SRCH), offset 72, memid 539, name SEARCH_EXT_MEM, addr 0x05c69abe, bit 4294967295, ext info 0x05c69abe 0x000082d9 0x00000047 0xffffffff, action 0 (Fix)
Mar 28 04:38:43 3|rbx-g1-a9 25894: LC/0/0/CPU0:Mar 28 02:38:26 UTC: prm_server_ty[295]: prm_ser_check: Double-bit ECC error detected: NP 1, block 0xb (SRCH), offset 72, memid 539, name SEARCH_EXT_MEM, addr 0x05cab925, bit 4294967295, ext info 0x05cab925 0x000082b5 0x00000047 0xffffffff, action 0 (Fix)
Mar 28 04:38:43 3|rbx-g2-a9 394683: LC/0/1/CPU0:Mar 28 02:38:43 UTC: prm_server_ty[295]: prm_ser_check: Double-bit ECC error detected: NP 6, block 0xb (SRCH), offset 72, memid 539, name SEARCH_EXT_MEM, addr 0x05c69abf, bit 4294967295, ext info 0x05c69abf 0x000082d9 0x00000047 0xffffffff, action 0 (Fix)
Mar 28 04:38:43 3|rbx-g1-a9 25895: LC/0/1/CPU0:Mar 28 02:38:26 UTC: prm_server_ty[295]: prm_ser_check: Double-bit ECC error detected: NP 2, block 0xb (SRCH), offset 72, memid 539, name SEARCH_EXT_MEM, addr 0x05caba04, bit 4294967295, ext info 0x05caba04 0x000082b5 0x00000047 0xffffffff, action 0 (Fix)
Mar 28 04:38:43 3|rbx-g2-a9 394684: LC/0/1/CPU0:Mar 28 02:38:43 UTC: prm_server_ty[295]: prm_ser_check: Double-bit ECC error detected: NP 6, block 0xb (SRCH), offset 72, memid 539, name SEARCH_EXT_MEM, addr 0x05c69a7a, bit 4294967295, ext info 0x05c69a7a 0x000082b5 0x00000047 0xffffffff, action 0 (Fix)
Mar 28 04:38:43 3|rbx-g1-a9 25896: LC/0/0/CPU0:Mar 28 02:38:26 UTC: prm_server_ty[295]: prm_ser_check: Double-bit ECC error detected: NP 1, block 0xb (SRCH), offset 72, memid 539, name SEARCH_EXT_MEM, addr 0x05cab925, bit 4294967295, ext info 0x05cab925 0x000082b5 0x00000047 0xffffffff, action 0 (Fix)
Mar 28 04:38:43 3|rbx-g2-a9 394685: LC/0/1/CPU0:Mar 28 02:38:43 UTC: prm_server_ty[295]: prm_ser_check: Double-bit ECC error detected: NP 6, block 0xb (SRCH), offset 72, memid 539, name SEARCH_EXT_MEM, addr 0x05c69a7b, bit 4294967295, ext info 0x05c69a7b 0x000082b5 0x00000047 0xffffffff, action 0 (Fix)
Mar 28 04:38:43 3|rbx-g1-a9 25897: LC/0/1/CPU0:Mar 28 02:38:26 UTC: prm_server_ty[295]: prm_ser_check: Double-bit ECC error detected: NP 2, block 0xb (SRCH), offset 72, memid 539, name SEARCH_EXT_MEM, addr 0x05cab960, bit 4294967295, ext info 0x05cab960 0x000082b5 0x00000047 0xffffffff, action 0 (Fix)
Mar 28 04:38:43 3|rbx-g2-a9 394686: LC/0/1/CPU0:Mar 28 02:38:43 UTC: prm_server_ty[295]: prm_ser_check: Double-bit ECC error detected: NP 6, block 0xb (SRCH), offset 72, memid 539, name SEARCH_EXT_MEM, addr 0x05c69a7a, bit 4294967295, ext info 0x05c69a7a 0x000082b5 0x00000047 0xffffffff, action 0 (Fix)
Mar 28 04:38:43 3|rbx-g1-a9 25898: LC/0/0/CPU0:Mar 28 02:38:26 UTC: prm_server_ty[295]: prm_ser_check: Double-bit ECC error detected: NP 1, block 0xb (SRCH), offset 72, memid 539, name SEARCH_EXT_MEM, addr 0x05cab924, bit 4294967295, ext info 0x05cab924 0x000082b5 0x00000047 0xffffffff, action 0 (Fix)
Mar 28 04:38:43 3|rbx-g2-a9 394687: LC/0/1/CPU0:Mar 28 02:38:43 UTC: prm_server_ty[295]: prm_ser_check: Double-bit ECC error detected: NP 6, block 0xb (SRCH), offset 72, memid 539, name SEARCH_EXT_MEM, addr 0x05c69a7a, bit 4294967295, ext info 0x05c69a7a 0x000082b5 0x00000047 0xffffffff, action 0 (Fix)
Mar 28 04:38:43 3|rbx-g1-a9 25899: LC/0/1/CPU0:Mar 28 02:38:26 UTC: prm_server_ty[295]: prm_ser_check: Double-bit ECC error detected: NP 2, block 0xb (SRCH), offset 72, memid 539, name SEARCH_EXT_MEM, addr 0x05caba04, bit 4294967295, ext info 0x05caba04 0x000082b5 0x00000047 0xffffffff, action 0 (Fix)
Mar 28 04:38:43 3|rbx-g2-a9 394688: LC/0/1/CPU0:Mar 28 02:38:43 UTC: prm_server_ty[295]: prm_ser_check: Double-bit ECC error detected: NP 6, block 0xb (SRCH), offset 72, memid 539, name SEARCH_EXT_MEM, addr 0x05c69abf, bit 4294967295, ext info 0x05c69abf 0x000082d9 0x00000047 0xffffffff, action 0 (Fix)
Mar 28 04:38:43 3|rbx-g1-a9 25900: LC/0/0/CPU0:Mar 28 02:38:26 UTC: prm_server_ty[295]: prm_ser_check: Double-bit ECC error detected: NP 1, block 0xb (SRCH), offset 72, memid 539, name SEARCH_EXT_MEM, addr 0x05cab924, bit 4294967295, ext info 0x05cab924 0x000082b5 0x00000047 0xffffffff, action 0 (Fix)
Mar 28 04:38:43 3|rbx-g2-a9 394689: LC/0/1/CPU0:Mar 28 02:38:43 UTC: prm_server_ty[295]: prm_ser_check: Double-bit ECC error detected: NP 6, block 0xb (SRCH), offset 72, memid 539, name SEARCH_EXT_MEM, addr 0x05c69a7a, bit 4294967295, ext info 0x05c69a7a 0x000082b5 0x00000047 0xffffffff, action 0 (Fix)
Mar 28 04:38:43 3|rbx-g1-a9 25901: LC/0/1/CPU0:Mar 28 02:38:26 UTC: prm_server_ty[295]: prm_ser_check: Double-bit ECC error detected: NP 2, block 0xb (SRCH), offset 72, memid 539, name SEARCH_EXT_MEM, addr 0x05caba04, bit 4294967295, ext info 0x05caba04 0x000082b5 0x00000047 0xffffffff, action 0 (Fix)
Mar 28 04:38:43 3|rbx-g2-a9 394690: LC/0/1/CPU0:Mar 28 02:38:43 UTC: prm_server_ty[295]: prm_ser_check: Double-bit ECC error detected: NP 6, block 0xb (SRCH), offset 72, memid 539, name SEARCH_EXT_MEM, addr 0x05c69abe, bit 4294967295, ext info 0x05c69abe 0x000082d9 0x00000047 0xffffffff, action 0 (Fix)
Mar 28 04:38:43 3|rbx-g1-a9 25902: LC/0/0/CPU0:Mar 28 02:38:26 UTC: prm_server_ty[295]: prm_ser_check: Double-bit ECC error detected: NP 1, block 0xb (SRCH), offset 72, memid 539, name SEARCH_EXT_MEM, addr 0x05cab925, bit 4294967295, ext info 0x05cab925 0x000082b5 0x00000047 0xffffffff, action 0 (Fix)
Mar 28 04:38:43 3|rbx-g2-a9 394691: LC/0/1/CPU0:Mar 28 02:38:43 UTC: prm_server_ty[295]: prm_ser_check: Double-bit ECC error detected: NP 6, block 0xb (SRCH), offset 72, memid 539, name SEARCH_EXT_MEM, addr 0x05c69abf, bit 4294967295, ext info 0x05c69abf 0x000082d9 0x00000047 0xffffffff, action 0 (Fix)
Mar 28 04:38:43 3|rbx-g1-a9 25903: LC/0/1/CPU0:Mar 28 02:38:26 UTC: prm_server_ty[295]: prm_ser_check: Double-bit ECC error detected: NP 2, block 0xb (SRCH), offset 72, memid 539, name SEARCH_EXT_MEM, addr 0x05caba04, bit 4294967295, ext info 0x05caba04 0x000082b5 0x00000047 0xffffffff, action 0 (Fix)
Mar 28 04:38:43 3|rbx-g2-a9 394692: LC/0/1/CPU0:Mar 28 02:38:43 UTC: prm_server_ty[295]: prm_ser_check: Double-bit ECC error detected: NP 6, block 0xb (SRCH), offset 72, memid 539, name SEARCH_EXT_MEM, addr 0x05c69a7a, bit 4294967295, ext info 0x05c69a7a 0x000082b5 0x00000047 0xffffffff, action 0 (Fix)
Mar 28 04:38:43 3|rbx-g1-a9 25904: LC/0/0/CPU0:Mar 28 02:38:26 UTC: prm_server_ty[295]: prm_ser_check: Double-bit ECC error detected: NP 1, block 0xb (SRCH), offset 72, memid 539, name SEARCH_EXT_MEM, addr 0x05cab925, bit 4294967295, ext info 0x05cab925 0x000082b5 0x00000047 0xffffffff, action 0 (Fix)
Mar 28 04:38:43 3|rbx-g2-a9 394693: LC/0/1/CPU0:Mar 28 02:38:43 UTC: prm_server_ty[295]: prm_ser_check: Double-bit ECC error detected: NP 6, block 0xb (SRCH), offset 72, memid 539, name SEARCH_EXT_MEM, addr 0x05c69a7b, bit 4294967295, ext info 0x05c69a7b 0x000082b5 0x00000047 0xffffffff, action 0 (Fix)

we are with a hardware/software bug that impacts the IOS XR 4.2.0 and the cards A9K-TR-24x10GE
we remounted all that to TAC in order to find the origin of the problem and the solution.


Comment by OVH - Wednesday, 28 March 2012, 19:34PM

We have already faced this issue and it has been forwarded to TAC. The TAC Cisco has worked on this issue and prepared an SMU to apply a software patch on the IOS XR version that we have on. This small patch will be integrated in the +1 version.

We are recovering it then tonight we are going to start the maintenance on these 2 routers in order to apply the patch software which will push us to reload the router afterwards. We are not going to do it during the day.


Comment by OVH - Wednesday, 28 March 2012, 19:39PM

We worked with Cisco today, on the faced issues. We have to set urgent correctives on the routers. These correctives will be deployed tonight:
00:00 sur rbx-g1
01:00 sur rbx-g2


Comment by OVH - Wednesday, 28 March 2012, 20:39PM

Hello,
We had a routing problem tonight, due to a software bug which affected 2 principal routers in Roubaix. These Cisco ASR 9010 ensure collecting the bandwidth of datacenters in Roubaix (RBX1 RBX2 RBX3 RBX4 RBX5) and the connection to Paris, Brussels, Amsterdam, London and Frankfurt. Briefly, the routing heart in Roubaix.

This bug is known and is well related to new cards that we set end of January (24x10G per slot). For a random reason the card detects the RAM ECC errors and doesn't rout packets anymore. But certainly despite this the card is not declared as "breakdown" and remains in the router as if it was good.
Other routers will continue to send packets but there's none in front. That would cause a big issue and the network will not perform correctly.
The worse: a breakdown not net.

Tonight, 3 24x10G cards on 2 ASR 9010 routers had this bug almost in the same time. This broke the network in 3 pieces: USA/London/Amsterdam/Warsaw, Roubaix and Paris, Frankfurt, Madrid, Milan, aspiring the packets in Roubaix. Usually, the traffic would have been rerouted but then it was aspired and blocked in Roubaix.

Therefore, we didn't exploit the network to manage it and recover the logs of all the routers in order to reveal the problem's origin.
We have navigated to the old one, with rescue/external connexions to connect to each backbone router and check whether the router is the origin of the issue.
This operation took time, since there are 2 broken routers and it took us time to understand that this not only due to the router rbx-g2-a9 but also due to rbx-g1-a9.
Once we've restarted the 3 cards, all went back to normal in 5 minutes.

3 Weeks ago. We have already opened a ticket to Cisco regarding the RAM ECC issue. Cisco worked on this matter and has provided .. this morning, the patch software to apply on these routers in order to fix the problem. We are going to start the operation tonight. No breakdown expected.

We will focus also, on how to improve the management of our routers if all the backbone is down for a reason that will never happen.
We know how to deal with this case, but it's quite long. Very long.

At any case, the breakdown lasted only for 99.9% around 1h22 whereas we have "the right" to 43 minutes per month of downtime. There are penalties that trigger to go over the allowed time.
Example: for the SD OVH it's 5% per unavailability hour.
We are going to set an URL so that you could be able to trigger the SLA and send us the doc to credit the 5% time on your service. It will be posted in this task:
http://status.ovh.net/?do=details&id=2571

It never was pleasant to write this type of emails but if we aren't good, well, we take the responsibility and apologize.

We do apologize once again.

Regards,
Octave


Comment by OVH - Wednesday, 28 March 2012, 22:14PM

Both patches:
CSCty46761
CSCtx89601

asr9k-px-4.2.0.CSCtx89601-1.0.0
asr9k-px-4.2.0.CSCty46761-1.0.0


Comment by OVH - Thursday, 29 March 2012, 00:30AM

We started deploying the patch.

We are isolating rbx-g1-a9 from the network.


Comment by OVH - Thursday, 29 March 2012, 01:19AM

Routing is provided by rbx-g2. We are applying the patches.
A full reload of the router rbx-g1 will be performed. There is no expected impact on traffic, routing is made by rbx-g2.


Comment by OVH - Thursday, 29 March 2012, 01:19AM

Wed Mar 28 22:31:25.042 UTC
Install operation 6 '(admin) install activate
disk0:asr9k-px-4.2.0.CSCty46761-1.0.0 disk0:asr9k-px-4.2.0.CSCtx89601-1.0.0'
started by user 'gui' via CLI at 22:31:25 UTC Wed Mar 28 2012.
Info: This operation will reload the following nodes in parallel:
Info: 0/RSP0/CPU0 (RP) (SDR: Owner)
Info: 0/RSP1/CPU0 (RP) (SDR: Owner)
Info: 0/0/CPU0 (LC) (SDR: Owner)
Info: 0/1/CPU0 (LC) (SDR: Owner)
Info: 0/2/CPU0 (LC) (SDR: Owner)
Info: 0/3/CPU0 (LC) (SDR: Owner)
Info: 0/4/CPU0 (LC) (SDR: Owner)
Info: 0/5/CPU0 (LC) (SDR: Owner)
Info: 0/6/CPU0 (LC) (SDR: Owner)
Info: 0/7/CPU0 (LC) (SDR: Owner)


Comment by OVH - Thursday, 29 March 2012, 01:22AM

Patches are applied and the router is in a stable status.
However, we have a problem with the BGP. One of the sessions to RF-2 (BGP route reflector) does not go in v4 and another one to rf-1 in v6. We are checking this closer before proceeding further.


Comment by OVH - Thursday, 29 March 2012, 02:04AM

All BGP sessions were mounted. Worktask continues with g2-rbx.


Comment by OVH - Thursday, 29 March 2012, 02:06AM

TThe rbx-g2 is isolated from network. Routing is now provided by rbx-g1. The router will be reloaded during the process of applying patches.


Comment by OVH - Thursday, 29 March 2012, 02:08AM

This time,there's no inconvenience with BGP. The router is up and in a stable status.We are reactivating traffic on it.


Comment by OVH - Thursday, 29 March 2012, 02:39AM

Both routers rbx-g1 and rbx-g2 are working properly. The patches were implemented.


Comment by OVH - Monday, 16 April 2012, 11:17AM

To apply for the SLA, please visit https://www.ovh.co.uk/managerv3/sla-list.pl