OVHcloud Network Status

Current status
Legend
  • Operational
  • Degraded performance
  • Partial Outage
  • Major Outage
  • Under maintenance
FS#4408 — ldn-1-6k
Incident Report for Network & Infrastructure
Resolved
The router is down.

Update(s):

Date: 2010-07-25 22:59:00 UTC
It will be fixed with the BGP collector router which been ordered and have to arrive in 5 weeks. We will have less BGP sessions by router and only simple BGP.


Date: 2010-07-24 21:52:44 UTC
We returned all sessions on fra-5. it is stable.

We believe it is memory problem and memory split since we established the security via \"london/amsterdam\" and \"paris/frankfurt\".
ldn routers, ams and fra have consumed memory because of new information and visibly we are arriving at high limits. It remains 73Mo/1Go on ldn for example, but only 53Mo non fragmented.


Date: 2010-07-24 21:04:47 UTC
on ldn-1-6k in the crashinfo:
Jul 24 19:05:24 GMT: %C6K_PLATFORM-SP-2-PEER_RESET: SP is being reset by the RP

Date: 2010-07-24 20:28:12 UTC
We isolated all the sessions on fra-5 and disconnect all.
we are saving the configuration then rebooting.


Date: 2010-07-24 20:26:04 UTC
fra-5 is down again. it's a memory problem. we are rebooting it in hard.



Date: 2010-07-24 19:48:40 UTC
Jul 24 21:37:55 40g.fra-5-6k.routers.chtix.eu 718: Jul 24 20:37:26 GMT: %SYS-2-MALLOCFAIL: Memory allocation of 64 bytes failed from 0x420B35A8, alignment 8
Jul 24 21:37:55 40g.fra-5-6k.routers.chtix.eu 719: Pool: Processor Free: 0 Cause: Not enough free memory
Jul 24 21:37:55 40g.fra-5-6k.routers.chtix.eu 720: Alternate Pool: None Free: 0 Cause: No Alternate pool
Jul 24 21:37:55 40g.fra-5-6k.routers.chtix.eu 721: -Process= \"Tag Control\", ipl= 0, pid= 278
Jul 24 21:37:55 40g.fra-5-6k.routers.chtix.eu 722: -Traceback= 4102AD28 410315F0 420B35B0 420B4960 420BBF90 421EFA60 420BD978 420B7760 420BB770

Date: 2010-07-24 19:48:29 UTC
Jul 24 21:37:20 40g.fra-5-6k.routers.chtix.eu 707: Jul 24 20:36:57 GMT: %IPACCESS-2-NOMEMORY: Alloc fail for acl-config buffer. Disabling distributed mode on lc
Jul 24 21:37:20 40g.fra-5-6k.routers.chtix.eu 708: Jul 24 20:36:57 GMT: %IPACCESS-2-NOMEMORY: Alloc fail for acl-config buffer. Disabling distributed mode on lc
Jul 24 21:37:20 40g.fra-5-6k.routers.chtix.eu 709: Jul 24 20:36:58 GMT: %FIB-3-NOMEM: Malloc Failure, disabling DCEF

Date: 2010-07-24 19:48:21 UTC
We are booting card by card
fra-5-6k(config)#no power en module 2
fra-5-6k(config)#no power en module 7
fra-5-6k(config)#no power en module 8
fra-5-6k(config)#no power en module 9

Date: 2010-07-24 19:47:59 UTC
Jul 24 21:33:07 160G.rbx-1-6k.routers.ovh.net 48924: Jul 24 20:32:47 GMT: %DIAG-SP-3-TEST_FAIL: Module 9: TestMacNotification{ID=13} has failed. Error code = 0x1

Date: 2010-07-24 19:47:50 UTC
Jul 24 21:32:47 40g.fra-5-6k.routers.chtix.eu 418: Jul 24 20:32:27 GMT: %C6KPWR-SP-4-DISABLED: power to module in slot 8 set off (Module Failed SCP dnld)

Date: 2010-07-24 19:47:39 UTC
fra-5: some problems yet:
Jul 24 20:30:53 GMT: %TFIB-SP-7-SCANSABORTED: TFIB scan not completing. MAC string updated.
-Traceback= 40E40578 40E40904 40F1664C 40E18AD8 40E19078 40DFF760 40DFFB7C 40DFFE58 40E00AD8
Jul 24 20:31:11 GMT: %TFIB-DFC4-7-SCANSABORTED: TFIB scan not completing. MAC string updated.
-Traceback= 20F6AE38 20F6B1C4 2103E87C 20F43398 20F43938 20F2A020 20F2A43C 20F2A718 20F2B398
Jul 24 20:31:14 GMT: %TFIB-DFC1-7-SCANSABORTED: TFIB scan not completing. MAC string updated.
-Traceback= 20F6AE38 20F6B1C4 2103E87C 20F43398 20F43938 20F2A020 20F2A43C 20F2A718 20F2B398
Jul 24 20:31:15 GMT: %TFIB-DFC5-7-SCANSABORTED: TFIB scan not completing. MAC string updated.

Date: 2010-07-24 19:44:46 UTC
We have removed a queue modification on the 10G in order to return the old values. We modified it this week to increase the buffers on the ports.
Apparently the router did not support correctly the option.



Date: 2010-07-24 19:39:59 UTC
fra-5-6k is back. Cards are not yet properly back.
ams-1-6k is back, the same, it has yet rebooted a card.
ldn-1-6k it is a crash, we are fixing it through series cable, boot in progress
vss-2-6k the arp proxy is returned.

This is the worst backbone crash we've ever had in OVH ...
The domino effect on routers which has not rebooted a long time ago and that have a RAM split.

Jul 24 20:21:29 40g.fra-5-6k.routers.chtix.eu 622981: Pool: Processor Free: 30087848 Cause: Memory fragmentation
Jul 24 20:21:29 40g.fra-5-6k.routers.chtix.eu 622982: Alternate Pool: None Free: 0 Cause: No Alternate pool
Jul 24 20:21:29 40g.fra-5-6k.routers.chtix.eu 622983: -Process= \"IP RIB Update\", ipl= 0, pid= 164
Jul 24 20:21:29 40g.fra-5-6k.routers.chtix.eu 622984: -Traceback= 4102AD28 41030958 410433E0 413C2D10 42289224 406417AC 42305768 409D2680 40983230 40983350
Jul 24 20:21:29 40g.fra-5-6k.routers.chtix.eu 622985: Jul 24 19:21:07 GMT: %FIB-3-NORPXDRQELEMS: Exhausted XDR queuing elements while preparing message for slot/cpu 1/0
Jul 24 20:21:29 40g.fra-5-6k.routers.chtix.eu 622986: -Process= \"IP RIB Update\", ipl= 0, pid= 164
Jul 24 20:21:29 40g.fra-5-6k.routers.chtix.eu 622987: -Traceback= 413C2DE0 42289224 406417AC 42305768 409D2680 40983230 40983350
Jul 24 20:21:46 40g.fra-5-6k.routers.chtix.eu 623015: Jul 24 19:21:11 GMT: %FIB-3-NOMEM: Malloc Failure, disabling DCEF
Jul 24 20:27:34 40g.fra-5-6k.routers.chtix.eu 623147: Jul 24 19:27:15 GMT: %C6KFIB-4-DISABLED: Hardware FIB forwarding disabled, reverting to only software forwarding.

It is time, to establish new routers generation.
It was expected but only in September (it has to be available)


Date: 2010-07-24 19:01:56 UTC
proxy arp disabled on the vss-2.

Date: 2010-07-24 18:57:53 UTC
ams-1 is down. The router is just back.

Date: 2010-07-24 18:54:46 UTC
We isolated fra-5.


Date: 2010-07-24 18:52:16 UTC
Jul 24 20:28:13 40g.fra-5-6k.routers.chtix.eu 623150: Jul 24 19:27:53 GMT: %FIB-2-FIBDOWN: CEF has been disabled due to a low memory condition.
Jul 24 20:28:13 40g.fra-5-6k.routers.chtix.eu 623151: It can be re-enabled by configuring \"ip cef [distributed]\"

Date: 2010-07-24 18:29:54 UTC
fra-5 et th1-1 are defected. Not enough CPU.
We disabled the MPLS on all the backbone.
Posted Jul 24, 2010 - 18:20 UTC