Web Cloud Status - FS#10957

OVHcloud Web Hosting Status

Current status

Legend

Operational
Degraded performance
Partial Outage
Major Outage
Under maintenance

FS#10957 — class 4

Scheduled Maintenance Report for Web Cloud

Completed

We're setting up a new class 4 infrastructure for the interconnections tonight, in order to reduce the load and spread the impact if a problem was to arise on the interconnection cards.

The hardware is on route.

Update(s):

Date: 2014-06-24 04:28:37 UTC
Hello,

With 1 week of delay related to the expected date to finish problems with VoIP,we will close the work task related to \"dead packet\".
It is clearly an exceptional case close to a bug 0-day.
Possibility to have such problem :once in the whole life.
We will provide with further information about the packet type that affects PTG but it will be within 1 or 2 months, when Cirpack clients will update the software.

This week, we will validate the indemnity related to all works and faults we had from the last 18 months. It will not regenerate confidence but Sometimes I would like to say there were before and after June 2014.
Trust will return by time.Lot of time. We are there anyway (we have proved that in managing this problem) and we will be there tomorrow with VoIP and we have a list of new services that will be deployed in few weeks like that. Yes a long list like that.
We will provide you with a roadmap of 18 months with services we want to offer you. And we want to incorporate value-added services that our partners are doing with our VoIP directly within our control.

We do not expect other work and do not see at all the risk areas where our infrastructure prevents us from providing the service.
This incident forced us to advance all the work that had been scheduled for late August. Well, now,it was done with pain actually, but it was done.

Sorry again for this problem and the others we have had.

Best regards,
Octave

Date: 2014-06-23 08:55:07 UTC
The gateways are up-to-date.

Date: 2014-06-23 08:52:49 UTC
We have blocked the traffic on one of our Telco gateways in order to update the gateway.
The update ran successfully.

The first tests are fine.

The gateway was put back into service in order for us to also block and update the second gateway.

Date: 2014-06-23 08:44:09 UTC
We have received the new TelcoBridge soft version. We are going to start tests and we are going to launch it during the night.

Date: 2014-06-21 00:27:46 UTC
We will analyse the SIG traffic to intersect the codec configurations with the custom configuration of each line.
In case of difference in configuration we will adapt this configuration to make it compatible with the configuration of clients. In the event of a mismatch (2 phones on one SIP with 2 different configurations) we will notify the client about the situation.

This problem can cause white communications.

Date: 2014-06-20 17:18:41 UTC
The TelcoBridge patch is ready.
It is currently running battery tests to validate charge and non-regression.
We expect the patch to be installed as soon as it is received either tonight or Saturday night.

Date: 2014-06-20 17:12:27 UTC
Hello,

This is where we are now:

we are waiting for TelcoBridge to deliver the patch this afternoon
This patch should correct the following problems that remain with outgoing calls

- no ringback tone in certain cases following the order of codecs
- call forwarding disconnects a call
- putting on hold disconnects a call

We are also looking at all the tickets posted on incoming and outgoing calls that had audio problems either straightaway or after a few minutes.
For these instances, it is necessary to open the incident tickets that enable us to cross-check the maximum amount of information and capture fragments.

Many thanks for your reports.

Regarding the Cirpack card, the patch has done it's job and there has not been a crash since 6:00 yesterday morning. It is stable.

Date: 2014-06-20 11:10:05 UTC
Hello,
The dead packet has been identified and filtered
on the devices. This packet has had no further
impact on production since this 6:00 this morning.

The incoming blank communication problem: we have activated
the dump to debug some customers that were having this issue.
We're looking into the cause. Standard tickets must be
opened and we will look at your details individually.
We believe that it's not related but we are flat out.

The few FAX issues have been fixed. We are more
flexible on negotiation.

Some codec renegotiations are being processed.
We will force the renegotiation now and we will change
the conf to accept the 1st Codec propos\\351. It will not
be changed again, we may offer another.
Working on it now.

We're waiting for the TelcoBridge patch to manage blind
transfers, which will also fix the ringback bug.
The patched release has not passed all the tests.
They are about to recode the patch and have it quality
approved for regression testing.

The end is near!

Regards
Octave

Date: 2014-06-20 10:49:58 UTC
Hello,
Yesterday, Cirpack was able to create a UDP packet that
crashes a PTG card very very quickly. We have tested
this packet on the PTG in our lab and it generates
the exact same error log that we see during these crashes.

A patch has been applied to protect the CPU and the PTG when
this type of packet is presented and the details on the IP/DST
in order to relocate it in the dumps.

The patch was applied this morning at around 5:30-5:50
All we can do now is wait and see if that fixes the issue,
or if any other strange packets are found.

Regards
Octave

Date: 2014-06-20 10:08:19 UTC
The patches have been successfully applied. The current configuration means that
PTGs will reboot in the previous version for security purposes.

Date: 2014-06-20 09:57:41 UTC
We will reboot the remaining PTGs.

Date: 2014-06-20 09:51:23 UTC
We are starting the intervention: one PTG will be patched and rebooted first.
Then we will proceed with the others at the same time. The intervention should
take 30 minutes. Incoming and outgoing calls will be down for 08 numbers and portability
during the relaunch.

Date: 2014-06-20 09:50:00 UTC
Cirpack has identified the dead packet that was causing a PTG to crash.
We have redone the tests internally and successfully crashed the PTG in
our lab.

Tonight we will apply a patch to create even more logs on crash.
The maintenance will be carried out at 5:30.

Date: 2014-06-20 09:49:21 UTC
We will apply a new patch in the morning on all PTGs, in order to:
* protect the PTG aganst the packets in question
* further increase the verbosity level of the packet

We will install the patch tomorrow morning from 5:30 on all cards.

Date: 2014-06-20 09:47:16 UTC
The excessive INVITE packet blockage was detected by our VAC has been corrected.
It's no longer necessary to reduce the list of codecs.

Date: 2014-06-20 09:44:59 UTC
The call redirection problem has been identified and corrected.
We also successfully reproduced this issue towards the Bouygues Telecom network.
This problem has not been detected on calls forwarded to other French mobile operators.
The cause of the issue is an additional message in the call message on the telecom network, we have withdrawn this message to correct the issue.

Date: 2014-06-20 09:42:27 UTC
Cirpack has delivered and installed the patch that corrects the new bug encountered on the version introduced on Saturday,
which improves the usage logs. There was a general crash of cards during the unplanned update at 14:47,
caused by applying this patch.

On all crashes we have detected the same 3 numbers + 1 internal in particular, these numbers have been blocked
and we have not had any new crashes this afternoon (installation of the Cirpack patch accepted).

An excessive INVITE packet blockage was detected on some IPBXs by our VAC.
The workaround is to reduce the list of Codecs offered, for example:
- on Aastra: \"SIP use basic codecs: 1\"
- on Asterisk: reduce the list of Codecs of the sip.conf file

Date: 2014-06-20 09:41:33 UTC
Hello,

We updated the C4A to a new version on Saturday
and there is a new bug on this version :(
Cirpack has almost finished patch and we will
then apply it to the PTGs. This bug caused
1 crash yesterday and 2 this morning.

The patch also includes a log which is more
precise in terms of PTG saturation level with
the exact message that causes everything malfunction.

In input, we have deployed 6 PTGs with minimum number
is less but we have few packets to analyse.
That said, we cannot locate the packets that are messing
everything up. There is still plenty of comm.

Cirpack tried to reproduce the sequences provided in
all scenarios and it does not cause their infrastructure
to crash.

So we are now adding 2 PTGs on C4A and we will move 2 circuits
onto these 2 PTGs. 60 simultaneous calls. We hope that they won't
crash in the afternoon, with few calls being in progress. So we
will really have very few packets to analyse. If it's still too
much, we will change to 1 circuit on the PTG.

On the TelcoBridge side, we're waiting for a solution from them,
from their R&D dept. We tried to outwit the TG with a special
Cirpack conf, it worked but it affected some phones in some cases.
We have taken a step backwards.

Regards
Octave

Date: 2014-06-20 09:38:04 UTC
Results of Monday;
Hello,
We worked hard this weekend to avoid
infrastructure outage on Monday.
This is not a great success even
though it's an improvement.

We have 2 bugs on outgoing call transfers,
we are working on it. The new device set up
this weekend is sending the UPDATE capacity,
though it does not handle it well.
We are looking how to disable it.
Either there is a command (which we did not see)
or we should patch and reboot. w8. This is a
Canadian office so it's working right now...

There was also the bug causing the PTG cards
to crash, we had 6 or 7 crashes in the afternoon.
It was a little bit better than last week because
it only affected the input, but still, we had these
lousy crashes. Yeah :( With logs and dumps we found
where it can come from and everything was turned off
to prevent it from happening again, around 18:50,
awaiting Cirpack to reproduce the bug and patch
their system.

The problem is related to unconditional transfer from an
OVH number to external ones that runs FAX. Renego T38 is
very violent with DTMF between the input and output
infrastructure causing internal flooding of UDP packets
(100 pps in 2 to 3 seconds) and it killed the PTG. It's been
quite some time since we offer this service and it doesn't
explain why it started causing issues on Wednesday to the
point of crashing everything. Astounding.

We didn't keep our word about everything working properly
from Monday. This is true and I can't deny it. We all know
what it means and there's no to say about it. Even if
we boosted the infrastructure reconfiguration and device
switchover this weekend, we would have had the same results
and impacts as today. Perhaps less than we had last week but
we would still have not kept our word. Checkmate. Failed.
Sorry for the failure. We did our best.

Tonight and tomorrow, we will keep fighting the darn bug
that crashes the PTG and 2 bugs on output/transfers.
And we will fix it in fix. There's no other option to choose
and we will put all our effort into restabilising.

We are waiting for the feedback from Cirpack and TelcoBridge
for these bugs and we expect to have positive answers and no
new problems. We are also preparing an order for TelcoBribge
to replace the Cirpack on Class 4 if we find that there's
no hope of fixing this bug. Delivery and setup time
is not great but it remains an option that won't take long to
bring about.

Thank you for your patience, at least what is left of it.

Kind regards,
Octave

Date: 2014-06-20 09:28:51 UTC
18:50: we disabled it on TB4A/B/C

so unconditional FAX transfers from an
OVH number to an external number will
no longer work until a solution is found
by Cirpack.

Date: 2014-06-20 09:01:46 UTC
We started to have see some progress.

The problem occurs with some FAX
redirections which try to renegotiate
the T38 very violently. The PTG cards are
taken by a Renego flood and crash.

We just disabled the forced T38 in output.

Date: 2014-06-18 06:59:52 UTC
We had several consecutive crashes on C4A, only incoming calls are affected.
Outgoing calls work without issues.
The manufacturer has full traces of these crashes, thanks to patches and serial cable that was set this weekend.

Date: 2014-06-18 06:25:25 UTC
At 13:08 the PTG crashed at C4A (outgoing calls), the difference:
Direct crash of bladectrl instead of freeze.

Date: 2014-06-18 06:23:25 UTC
10:45
We still have 2 problems to fix:
- When an outgoing call is made and then make a transfer to another position, there is no
ringback (no ringing waiting) and sometimes white communication
- When you make an outgoing call, sometimes you do not have a ringback

We reactivated dumps for all VoIP traffic and we can now identify the problem.

Date: 2014-06-18 06:20:35 UTC
15 sec downtime, human error on ACL manipulation.

Date: 2014-06-18 06:19:24 UTC
We have had a problem of intermittent sound
due to the latency between RBX and P19.
We fixed it.

There remains the problem of \"lack of ringtone
with G729 codecs during call transfer
without confirmation\"

Date: 2014-06-18 06:17:41 UTC
All planned works are finished and
we're waiting for the first peak on Monday
10:00-11:00 to confirm that the class 4
infrastructure is working correctly.

Date: 2014-06-18 06:15:34 UTC
Outbound call transfers have been fixed.

There are no known problem at present.

Date: 2014-06-18 06:14:56 UTC
The issues of calls being dropped after call waiting have been corrected for all C5s.

Date: 2014-06-18 06:09:32 UTC
Calls to short numbers of lines on C5C have been corrected.

Date: 2014-06-18 06:08:20 UTC
3) TB4B is in production with an SFR interconnection,
we will reconfigure all outgoing voice in 1 hr
via this device and we will ask you to confirm
that there are not more issues on ringback.

done
all output is passing via TB4B. if you have 2 minutes
to test a \\340 call from your phone to an exterior
n\\260 and confirm that it's wokring properly in all
scenarios, \\347a would help us. thanks in advance.

Date: 2014-06-18 06:07:34 UTC
E) We will upgrade C4A to the new software version
that allows PTG crash logs to be obtained. We
will do it tonight. This will allow us to have
the information in the event of a crash.

done

we have nearly finished the C4A infrastructure
upgrade with the chassis and all voice cards in
France and Europe.

C4A manages incoming calls and short numbers.
can you check in differents scenarios that you can
call your OVH phone numbers? thanks in advance.
it's very important.

Date: 2014-06-18 06:06:54 UTC
B) We have a list of 30 numbers which keep coming back
on the 7 crashes we have had. We will contact the 6 customers
and put their input. They will still be able to send calls
but not receive them.

done

Date: 2014-06-18 06:06:30 UTC
4) On TB4C, we will move an interconnection which is currently connected on C4B, which will no longer be used.

done

Date: 2014-06-18 05:48:55 UTC
E) We will upgrade C4A to the new software version allowing PTG crash logs to be obtained. We will do it tonight. This will allow us to have the information in case of crash

http://status.ovh.co.uk/?do=details&id=7081

planned for 21:00.

Date: 2014-06-18 05:47:54 UTC
> D) We will activate dump IPs on IPs using
> these 30 numbers.

done

Date: 2014-06-18 05:45:25 UTC
> E) We will upgrade C4A to the new software version
> that allows PTG crash logs to be obtained. We
> will do it tonight. This will allow us to have
> the information in the event of a crash.

update is planned for 21:00, it will take 7 to 12 minutes
to reboot all C4A devices.

Date: 2014-06-18 05:43:05 UTC
Here are the points we will work on:

1) We will finish switching the remaining 150 MGCP lines from c4a to c5a. Phones do recognize the new configuration (firewall?) and
so we will move the ip directly
At the same time we switched to 1007
C5a.

All subscribers ie the lines 130K/140K
SIP / MGCP will be 100% on 3 infra c5a/c5b/c5c

2) We have a new equipment that makes
class 4 ie the interconnection with France Telecom
SFR, Completel, DTAG, BT, Telefonica and Belgacom.
It is TelcoBridges.Since two months we set an order for 4 chassis (instead
Cirpack) and we received them since one month.

Since then,we are making tests, it goes well.
We detected a bug this week and received the patch tonight.
Therefore we can not use it for France Telecom, not yet, it must be
certified with FT and other historical operators of Europe. It will take several months. But we can
use it for all outgoing calls and special numbers.
So we have established 4 new chassis
TB4A, TB4B, TB4C and TB4D. TB4D is our spare.

3) TB4B is in production with interco SFR in 1H we will reconfigure all outgoing voice via this equipment and you will be asked to validate properly if there are no more problems on the ringback.

4) we will move on TB4C an interconnect that is currently connected to the C4B. C4B is no longer used.

5) On TB4A we will migrate an interco SFR and it will take 3-4 days because we should move it circuit by circuit. This is not serious,TB4B and TB4C
can take all outgoing calls without problems.

6) We will have incoming calls on the C4A
and outgoing calls on TB4A/TB4B/TB4C
We will therefore have no more crashes for outgoing calls since everything is starting by TB4

7) We can still have crashes on C4A related to customers which do things properly. And to avoid that it happens soon Monday, here is the list of actions:

A) We have installed anti-ddos that protects us against DDoS attacks and cleans everything that is not Catholic.

B) We have the list of 30 numbers related to
every 7 crashes we had. We will contact 6 customers and will block their input. They will still send voice but not being called.

C) We will work with these clients from
the next week and only at night to we see if they can generate calls and make input infra crash.

D) We will activate dump IP over IPs using these 30 numbers.

E) We will upgrade C4A to the new soft version that allows to get logs of PTG crashes. We will do it tonight. This will allow us to have the information in case of crash

F) if we can not reproduce the bug and everything is
stable during 10 days, we will move an interco on C4B and during a morning within 2 weeks we will use it to get the max voice.
The goal will be to crash the infra C4B and have logs that will fix the bug.

8) We will prepare the compensation for these 18 months,it will come on the invoice of June.

9) We will build with Cirpack an infrastructure for test and to qualify patches they provide to us, we want to test 5K of simultaneous calls as follows:
SIP-C5X-C4X=e1=TBX-SIP

With such test infrastructure we believe we can maintain the infra in production and keep correcting the small bugs here and there.
Then establish a infra for qualification/stress test for your infrastructures:
- Setting up a new asterisk
- You want to make a stress test/qualify your infra, and click to balance calls in/out
- It will let you see if your asterisk holds, and we see if it communicates correctly with us

Date: 2014-06-15 05:02:45 UTC
The new binary is set on the PTG. This allows to have a system of statistics and report more efficiently.

Date: 2014-06-15 05:01:51 UTC
Cirpack will provide us in less than an hour with a version of a binary to set up on PTG in order to have a system of statistics and report more effectively.

Date: 2014-06-13 12:34:36 UTC
We're routing each C5 one by one to the C4B in order to detect whether the problem is arising specifically via one machine.
If the card crashes, the C5 in question could be to blame.
If the three C5s pass this stage, then the issue is coming from somewhere else.

Date: 2014-06-13 12:08:30 UTC
Here is the list of action established, on our side as well as Cirpack:
- we're analysing the traces in order to detect calls in progress when the cards crash.
- we're also analysing the debug logs of a card from a previous crash.
- we're also looking at previous results with the data collected from the PTG when the crashes occurred, so as to detect calls which are causing the card fault. This analysis will be carried out on the data for up to 3 days ago.
- we're also bearing in mind the possibility of a bad calculation of circuits available on the cards. This action is in progress at the Cirpack R&D department.

The current step would be poor handling of voice streams.

Date: 2014-06-13 11:05:23 UTC
The problem with premium rate numbers has been fixed on all the C5s.

Date: 2014-06-13 09:32:05 UTC
There wws an issue with intermittent sound due to
latency between RBX and P19.
It has been fixed.

Date: 2014-06-13 08:01:29 UTC
All planned works have been completed and
we are waiting for the first peak on Mon 10am-11am
to confirm that the class 4 infrastructure is working
properly.

Date: 2014-06-13 07:54:18 UTC
Outbound transfers have been fixed.

We are not aware of any more issues in progress.

Date: 2014-06-13 07:54:13 UTC
The first tests were conclusive
We have set up a fall-back system between the two C4s.
A call that cannot be connected on C4 will automatically pass to the other.

We're now applying this configuration on all C5s.

Date: 2014-06-13 07:53:32 UTC
The interconnection has been mounted on the new C4B infrastructure.
We're currently performing tests on C5C to the external network via C4B.

Date: 2014-06-13 07:53:08 UTC
The interconnections are up.
We're finalising the class 4 configuration.

Date: 2014-06-13 07:52:56 UTC
The chassis problem has been fixed.
We're starting to deploy our configuration.

Date: 2014-06-13 07:52:35 UTC
The installation of the new cards is not finished.
It's taking much longer than expected.
We're still waiting for our manufacturer to finish the configuration.

Date: 2014-06-13 07:43:59 UTC
The chassis has been mounted and the ping between C4A and C4B is working.
The Cirpack technician is mounting the controller. Installation of the applications and lines on the new machine will follow.

On our side, we have prepared the configuration of C4B to route calls towards the interconnection, as well as the DSP loading configuration.

Date: 2014-06-13 07:39:04 UTC
The hardware has arrived.
We're starting the installation.

Posted Jun 13, 2014 - 07:38 UTC