OVHcloud Network Status

Current status
Legend
  • Operational
  • Degraded performance
  • Partial Outage
  • Major Outage
  • Under maintenance
FS#5851 — vss-5b6k RBX4
Scheduled Maintenance Report for Network & Infrastructure
Completed
We have a strange problem on the router. 10G ports crush gradually on the router and then return without a clear reason. No CRC, no problem on the links. And almost all links on each card.

The consequence is that there must be small cuts in the service here and there because the cuts are a matter of seconds, not enough so the PMO / PSO / HSRP / GLBP gets recalculated.

Update(s):

Date: 2011-10-02 23:05:18 UTC
The CPU of the routers for the last 2 days. \"it's going better\"(c)

----9999999999999999999999999999999999999999999999999999999999999999999999
----9999999999999999999999999999999999999999999999999999999999999999999999
100-******************#*#*************************************************
-90-******************###*************************************************
-80-*********#*##*########***#********************************************
-70-*******#####################*######***********************************
-60-****#*##############################**********************************
-50-****#################################*******######*#******************
-40-######################################################################
-30-######################################################################
-20-######################################################################
-10-######################################################################
---0....5....1....1....2....2....3....3....4....4....5....5....6....6....7.
-------------0----5----0----5----0----5----0----5----0----5----0----5----0
CPU% per hour (last 72 hours)
* = maximum CPU% # = average CPU%

Date: 2011-10-02 23:03:53 UTC
A human error on the configuration of the network
:
46.105.116.0/24
46.105.117.0/24
46.105.118.0/24
46.105.119.0/24
176.31.226.0/24
176.31.227.0/24

It can be explained by the tireness of our team's members.
We have been on this task since 48 hours :(

Date: 2011-10-02 22:31:49 UTC
After we set the port channel on static,the routers became stable and do not load.it's working with 30% of the CPU.

We have now an attack on the networks in the middle of the router and we will therefore work to identify the target of the attack,then block it in advance.

Date: 2011-10-02 21:44:36 UTC
A few days ago we have strengthened the security of our routers to avoid attacks on our infrastructure.
We noted that our monitoring system has detected an attack since some time and the target was a router at OVH.
How could it happen? A taping mistake in the protection.
We just fixed it.It's locked actually and now we have a strong proof that the attacks are the origin of the problems that we encountred in the last few days:

http://status.ovh.co.uk/?do=details&id=1833
http://status.ovh.co.uk/?do=details&id=1865

Looking at the graphs, we noted that it takes about 5Gbps of traffic:

http://yfrog.com/z/nxxpbbsj
http://yfrog.com/z/nwr8ejwj
http://yfrog.com/z/kjwq0bej

The easiest way is to send an email to noc@ovh.net with the problem's origin ..

Date: 2011-10-02 15:42:50 UTC
The router will find the stability when the ports will
stop to flap. Without that, it's not possible.
Whatever.

Date: 2011-10-02 15:40:00 UTC
We checked the parametring of HSRP/GLBP to avoid
that the router gets down when the ports flaps.

Date: 2011-10-02 15:26:50 UTC
The fact of switching on another type of port channel
seem to fix the flapping problem. We tackle
this change with the switches at the heart of the router.

Date: 2011-10-02 15:18:51 UTC
We desactivate flowcontrol on the router's ports (default configuration)

Date: 2011-10-02 14:45:11 UTC
We change this type of port channel on the uplinks
of the 2 routers.

Date: 2011-10-02 14:31:38 UTC
We synchronise the configuration between the 2 routers
then we relaunch the ip failover machine.

The router seems to be stable but we still have a flap on
vss-5b.
Oups, there's a flap on vss-5a ...
Help;

Date: 2011-10-02 14:11:56 UTC
Switch done.

Date: 2011-10-02 14:06:08 UTC
We added a new 6704 card and we are going to switch some ports
which flap on this new card.

Date: 2011-10-02 13:44:29 UTC
We reopened a TAC ticket by Cisco to understand from
where comes the problem.

In parallel, we switch the configuration from A to the
same type as B. It's almost 1h30 of work. Let's go !!!

Date: 2011-10-02 13:21:55 UTC
We restart the ports of the 100M servers.

Date: 2011-10-02 13:21:27 UTC
It's done and the ports are flapping again.
What's that ???

Date: 2011-10-02 13:20:24 UTC
It's breathing.

We are going to check if the attack was on ipv6 or not.

We set this working task :
http://status.ovh.net/?do=details&id=1877

and we are going to reactivate ipv6 on vss-5b.

Date: 2011-10-02 12:28:55 UTC
It's done.

Date: 2011-10-02 12:28:43 UTC
We are going to desactivate ipv6 on the vss-5b.

Date: 2011-10-02 12:28:09 UTC
It's done. All the 100M server are uniquely on the vss-5A.
The vss-5B is catching its breath and vss-5A it's OK. The change
consists in the fact that the router of each subnet is down. Maybe
the attack is turned toward this IP ?

We give some 20 minutes to B to see if the ports flap or not.
If it does not flap, it's all see. We switch all this on
vss-6B.


Date: 2011-10-02 12:21:40 UTC
We are preparing the migation of all the 100Mbps servers
from vss-5AB to the new vss-6AB.

We interrupt these servers on the B. A is still to be routed.
We are going to recable the ports of vss-5B on vss-6B and
then reactivate the routing on vss-6B with its ip Failover.

Date: 2011-10-02 12:08:29 UTC
We think that the problem of port comes from the fact
that the LACP is falling because the router is overloaded.
But why B only and not A.

We interrupted one of the 3 route reflectors on A and B.

Date: 2011-10-02 12:05:58 UTC
B done.

We tackle the same thing on the A.

Then we reset the Ip failover factory to execute everything
in progress.

Date: 2011-10-02 12:03:36 UTC
We are going to change the way of server configuration
on the router to find a configuration of RBX1/2/3 .

Date: 2011-10-02 12:01:21 UTC
We took 6 10G links which flapped since 1 hour et we check
the whole links to understand why it flaps.

Date: 2011-10-02 12:00:00 UTC
We changd again the time of the infrastructure to avoid
the expiry of the MAC positions on the ports.
This fixes a lot of problems.

Date: 2011-10-02 11:58:46 UTC
The problem is still there. The origin of the problem is not known for
the time being.

We have 2 different problems :

- On the Router B the ports are interrupted, sometimes, without no explication.

-The servers do not ping well and it has nothing to do with the first problem.

We have several tracks that we're going to browse for the 2 problems.

Date: 2011-10-02 04:03:45 UTC
The SUP has been replaced,but the problem continues.
We searched the origin of this inconvenience but we didn't find it.
Meanwhile,the A is not doing well at all.
We set back the traffic on the 2 routers,they are running both but it's not their best performance.

We will keep searching for the origin of this overload in order to define the kind of attack we are going through.
At the same time we are thinking about the cuts of the ports.

Date: 2011-10-02 00:59:25 UTC
The router is back. But it is not going better. We will change the SUP card of the router.

Meanwhile vss-5a-6k continues routing. But it has problems. We will plan the switching of some networks from vss5AB to vss-6AB in order to allow vss-5AB to operate normally. We have unusually many attacks that must be studied and check what it is.
It will take time and during that time it has to keep operating properly.

Date: 2011-10-02 00:52:44 UTC
We cut HSRP and GLBP on the B.
Around 10-15 seconds should be given to the router to use subnet traffic again.

Date: 2011-10-02 00:47:58 UTC
We are preparing the reboot.

No traffic on the server anymore,vss-5a provides routing.
IP failover have been cut,vss-5a provides routing.
Posted Oct 01, 2011 - 23:04 UTC