Hi,
I decided to explain in a bit more detail what happened during that Wednesday night when we released the bad definitions that started flagging thousands of innocent programs as Trojans.
Normally, we have two definition updates a day. Usually one in the morning, and one in the afternoon/evening (unless there’s some emergency). The actual release process is well defined, and features multiple QA checks that ensure that the definitions we roll out don’t cause any [major] problems. For example, every definitions that we push out have to pass a false positive (FP) test on our extensive cleansets. The cleansets currently contain terabytes of data from hundreds of thousands of applications (we run many tests in parallel but still the test takes at least an hour to complete). Every single FP on this test set is a reason for the definitions to go back to the virus lab and be revised (and after a fix is made, a new full cleanset test is performed, until all is fine).
Now, given what I’ve just described, how could it happen that we released definitions that produced so many FP’s? Were we so unlucky so that none of the affected applications was included in the cleanset? (i.e. is the cleanset so poor?)
No. In fact, an analysis done later showed that with the definitions in question (VPS 091203-0), we detected over 50 thousand unique samples from the cleansets as viruses!
The problem was that the FP test was not performed at all before the definitions were pushed out.
On December 2, roughly 9pm we had a normal (scheduled) VPS update 091202-1. The update was working fine for most users, no FP’s or anything. However, due to a bug in it, the update wasn’t working correctly in some Avast v5.0 (beta) installations. On these computers, the avast service wouldn’t start after a reboot. Remember that avast 5 is still in beta and bugs like this can (and do) occur.
Soon after releasing the 091202-1, we noticed the problems with v5 and after doing some analysis, a decision was made to release another update that would fix the problem. It was around 1am local time and the situation was a bit stressful because v5 users were experiencing the issue and something had to be done fast. One of the persons not normally responsible for releasing VPS updates (but equipped with the knowledge of how it’s technically done) went ahead and released the out-of-band update. However, unfortunately, he didn’t follow the prescribed process and used wrong input files to generate the VPS. Files that were just prepared for testing - but were never really tested.
Anyway, after the update was released (at around 12:30am GMT, i.e. 1:30 local time here in Prague) there still was a chance to get some early warnings that the update is a fiasco and needs to be rolled back immediately. The irony is that the person was checking for at least one more hour whether there’s anything wrong, but the internal systems used to flag any anomalies (such as increased load on the FP reporting servers) weren’t showing anything special at this time. Should he have checked the forum he’d certainly notice the buzz that just started happening here, but unfortunately, he didn’t do so.
The responsible people were alerted not earlier than at 5:15am local time when the problem was already of massive size. It took 75 more minutes to release the cure.
What’s the conclusion? We will certainly be improving the process further so that such a thing is not possible anymore. In fact, this is our first major issue of this type, so we feel that even the current process works well, but only if it’s strictly followed. But we need to make sure that it is really enforced in every possible case.
Furthermore, we’re thinking of some additional early warning systems. If for example the evangelists here on the forum had a phone number to call in case of emergency, the problem could have been contained much much faster and the harm done would be incomparably smaller. Automated alerting systems have their place, but in many cases, a human decision is the best. And better to be alerted falsely ten times than not alerted at all.
The overall process will also be completely revised, and crisis management plans defined. We plan to do this over the next week, and I’ll be sharing the outcome of this with you.
Looking back, we feel really sorry for what happened. We have learned a lot from this incident and are making sure it will never, ever happen again.
So, if you believe in second chances, please stay with avast. We screwed and we know it but we have to look forward and keep fighting. The virus writers don’t sleep.
Thanks
Vlk