An attemt to explain what went on that Wed night (a follow-up on the FP issue)

Hi,

I decided to explain in a bit more detail what happened during that Wednesday night when we released the bad definitions that started flagging thousands of innocent programs as Trojans.

Normally, we have two definition updates a day. Usually one in the morning, and one in the afternoon/evening (unless there’s some emergency). The actual release process is well defined, and features multiple QA checks that ensure that the definitions we roll out don’t cause any [major] problems. For example, every definitions that we push out have to pass a false positive (FP) test on our extensive cleansets. The cleansets currently contain terabytes of data from hundreds of thousands of applications (we run many tests in parallel but still the test takes at least an hour to complete). Every single FP on this test set is a reason for the definitions to go back to the virus lab and be revised (and after a fix is made, a new full cleanset test is performed, until all is fine).

Now, given what I’ve just described, how could it happen that we released definitions that produced so many FP’s? Were we so unlucky so that none of the affected applications was included in the cleanset? (i.e. is the cleanset so poor?)

No. In fact, an analysis done later showed that with the definitions in question (VPS 091203-0), we detected over 50 thousand unique samples from the cleansets as viruses!

The problem was that the FP test was not performed at all before the definitions were pushed out.

On December 2, roughly 9pm we had a normal (scheduled) VPS update 091202-1. The update was working fine for most users, no FP’s or anything. However, due to a bug in it, the update wasn’t working correctly in some Avast v5.0 (beta) installations. On these computers, the avast service wouldn’t start after a reboot. Remember that avast 5 is still in beta and bugs like this can (and do) occur.

Soon after releasing the 091202-1, we noticed the problems with v5 and after doing some analysis, a decision was made to release another update that would fix the problem. It was around 1am local time and the situation was a bit stressful because v5 users were experiencing the issue and something had to be done fast. One of the persons not normally responsible for releasing VPS updates (but equipped with the knowledge of how it’s technically done) went ahead and released the out-of-band update. However, unfortunately, he didn’t follow the prescribed process and used wrong input files to generate the VPS. Files that were just prepared for testing - but were never really tested. :frowning:

Anyway, after the update was released (at around 12:30am GMT, i.e. 1:30 local time here in Prague) there still was a chance to get some early warnings that the update is a fiasco and needs to be rolled back immediately. The irony is that the person was checking for at least one more hour whether there’s anything wrong, but the internal systems used to flag any anomalies (such as increased load on the FP reporting servers) weren’t showing anything special at this time. Should he have checked the forum he’d certainly notice the buzz that just started happening here, but unfortunately, he didn’t do so.

The responsible people were alerted not earlier than at 5:15am local time when the problem was already of massive size. It took 75 more minutes to release the cure.

What’s the conclusion? We will certainly be improving the process further so that such a thing is not possible anymore. In fact, this is our first major issue of this type, so we feel that even the current process works well, but only if it’s strictly followed. But we need to make sure that it is really enforced in every possible case.

Furthermore, we’re thinking of some additional early warning systems. If for example the evangelists here on the forum had a phone number to call in case of emergency, the problem could have been contained much much faster and the harm done would be incomparably smaller. Automated alerting systems have their place, but in many cases, a human decision is the best. And better to be alerted falsely ten times than not alerted at all.

The overall process will also be completely revised, and crisis management plans defined. We plan to do this over the next week, and I’ll be sharing the outcome of this with you.

Looking back, we feel really sorry for what happened. We have learned a lot from this incident and are making sure it will never, ever happen again.

So, if you believe in second chances, please stay with avast. We screwed and we know it but we have to look forward and keep fighting. The virus writers don’t sleep.

Thanks
Vlk

OK VLK, thank you very much for taking the time to post this. I requested it in another thread and I’m glad you did it :wink: … as I was also wondering why an update was released in the middle of the night, which isn’t usual with avast, especially when an update was released just a few hours before. Now I see what happened…
As far as I’m concerned, I consider such errors human, and I won’t stigmatize Avast for this. So, np here, sticking to and with Avast :wink:

Thank you, Vik! That was about what I figured, in that I knew there MUST be something strange that had happened somewhere in the processing, because that third update came through only a short time after my second update! The little notice came up, and I immediately remarked to my son, “Wow! Avast! NEVER updates three times a day…there must be something strange going on!” Then just as immediately, the popups began…YIKES!

Thankfully, I didn’t delete anything, and (even though it didn’t help things later) I was able to restore everything from the chest (about 10 items). I am now going to do a full uninstall and clean install, because I’m having internet browser problems when the Standard Shield is active. I have a feeling that will cure my final problem after the big snafu.

At any rate, I want to thank you all for being so quick to work this out, and I’m totally confident that any new system you put in place will be great. In all the years I have been using Avast!, I have never, ever had this type of problem before, and here at the forum, I’ve found it very easy to get questions answered and help quickly delivered. You and the team are very, very friendly and efficient. I would never leave you just because this happened. I put my trust in your product a long time ago, and I don’t believe it was misplaced.

Thank you again.

Let yesterday stay in the past. You guys at Avast are the best. Especially the employee who learned from the experience and taught everyone else that even the best can make mistakes. :wink:

Vlk,

Thank you for taking the time to explain what happened. I will continue to use Avast.

I am staying with Avast :wink: We are all human and can make mistakes, but it takes a big man to say sorry Vlk

Thx Vlk. But i found your decision to remedy avast! 5 update problems a bit strange. avast! 5 is still in beta and every even major bug can be excusable. Also less users use it compared to stable 4.8.

vlk, I accept the the detailed information mistakes can happen.

With an update frequency of twice a day, a 3rd update seemed like a natural thing to do (an easy fix). And, of course, if it were executed correctly, there would be no problem.

We can speculate whether it was a right or wrong decision but I don’t it really matters.

Might I just insert that Vlk didn’t create the problem, nor has he laid the blame on anyone specific. He’s simply stated what happened, and has apologized for it.

Vlk, would you please read this:

http://forum.avast.com/index.php?topic=51745.msg437873#msg437873

I still can’t use any internet browser with the Standard Shield activated.

Thanks.

Hi and thanks, Vlk.

In one of the zillion threads relating to this (sorry can’t find it easily, but you may have already seen it), there was an interesting suggestion for a preferably-automatic work-around, in effect permitting the user to “downgrade” back to the previous installed version of the database. I agreed that it might be an idea for your crew to look into, although I agree the repair you did was admirably prompt.

Hi MikeBCda,

Well that could be a good idea that avast could come up with a sort of system snapshot with a good functioning version of avast5 to go back to whenever an incident of this magnitude might affect us (hopefully never),

polonus

good thing would be at least to generate a windows restore point (just) before an update is applied, tens of programs are doing that at setup time (sometimes initiated by Windows itself, sometimes by the programs), Windows Defender as well as MSE are doing it too when they get updated (also manually :wink: )…so why not avast ?

the problem that remains being if system files necessary for the restore to complete have been sent to Chest ;D …restore them first… yeah… :slight_smile: that’s a case per case situation, can’t give here the universal solution.

Vlk - Thank you for taking the time to explain what happened. As a “regular” user this gives me peace of mind to know the details and realize the likelihood is small this will happen again anytime soon.

I’m not sure how many other companies would do this. Covering up mistakes seems much to frequent these days with all products and services.

John in STL

Thanks for the explanation Vlk.
As usual, we can trust when the company acknowledges.
A telephone number will allow Evangelists to warn.

Thank you very much for the explanation, Vlk. :slight_smile:
I will continue to use avast! as my antivirus program.

Thanks Vlk for explaining, luckily no harm was done here on 3 systems and my aunte had no problems either (notebook was off) :stuck_out_tongue:
I’m glad you guys found out what caused it and take measures so that it never ever can happen again.
I will continue to use and spread Avast!

Thanks for the detailed explanation Vlk.
Leave avast! ??? Who, me ???
I don’t think that’s ever very likely to happen. :slight_smile:

I for one will be staying with Avast!.

As a person who works in a managerial position, I have an intimate understanding of processes and the repercussions of not following them. Unfortunately, people in these positions find that they more often than not end up “People-Proofing” the system, (what ever that system may be). Rarely is the fault in the process itself. More often than not its the “Human-Factor” I.E. the person/people who DID NOT follow said process.

I, like many, was bit by this particular issue. However once the problem was properly understood, the fix was relatively easy. A simple roll-back to Tuesday and then manual updating of my system.

However, I did learn an invaluable lesson, one that frankly I should have already known. That is, always check the simple stuff first.

I was freaking out when a quick scan with the faulty update informed me of multiple win32 infections. I was doing boot-time scans, full scans w/archiving in safe mode, etc. Finally, I just walked away. The next day in school, I discussed the issue with several different professors. I received recommendations from “Wipe the whole drive and start over” to complex, in depth system fixes.

The last professor I spoke with asked me, “Did you check your anti-virus providers Website to see if there’s been any issue with the program?” LOL Duh…
As humans are imperfect, so shall be the products of their labors. With that in mind I will continue to use Avast! and recommend it to all of my friends. I have been more satisfied with Avast! than any other Malware fighter I have ever used to date.

~Tuebor~
Philo
Loyal Avast! User

Thank you for the explanation Vlk :slight_smile:

It must be awful for you guys too, especially for the colleague this happened to :-\

Greetz, Red.