release-management - How do you update a live, busy web site in the politest way possible?

Question

When you roll out changes to a live web site, how do you go about checking that the live system is working correctly? Which tools do you use? Who does it? Do you block access to the site for the testing period? What amount of downtime is acceptable?

score 13 · Accepted Answer

I tend to do all of my testing in another environment (not the live one!). This allows me to push the updates to the live site knowing that the code should be working ok, and I just do sanity testing on the live data - make sure I didn't forget a file somewhere, or had something weird go wrong.

So proper testing in a testing or staging environment, then just trivial sanity checking. No need for downtime.

score 7 · Accepted Answer

Lots of good advice already.

As people have mentioned, if you don't have single point involved, it's simple to just phase in changes by upgrading an app server at a time. But that's rarely the case, so let's ignore that and focus on the difficult bits.

Usually there is a db in there which is common to everything else. So that means downtime for the whole system. How do you minimize that?

Automation. Script the entire deployment procedure. This (especially) includes any database schema changes. This (especially) includes any data migration you need between versions of the schema.

Quality control. Make sure there are tests. Automated acceptance tests (what the user sees and expects from a business logic / experience perspective). Consider having test accounts in the production system which you can script to test readonly activities. If you don't interact with other external systems, consider doing write activities too. You may need to filter out test account activity in certain parts of the system, especially if they deal with money and accounting. Bean counters get upset, for good reasons, when the beans don't match up.

Rehearse. Deploy in a staging environment which is as identical as possible to production. Do this with production data volumes, and production data. You need to feel how long an alter table takes. And you need to check that an alter table works both structurally, and with all foreign keys in the actual data.

If you have massive data volumes, schema changes will take time. Maybe more time than you can afford to be down. One solution is to use phased data migrations, so that the schema change is populated with "recent" or "current" (let's say one or three months old) data during the downtime, and the data for the remaining five years can trickle in after you are online again. To the end user things look ok, but some features can't be accessed for another couple of hours/days/whatever.

score 4 · Accepted Answer

At work, we spend a period of time with the code frozen in the test environment. Then after a few weeks of notice, we take the site down at midnight Friday night, work through the night deploying and validating, and put it up Saturday late morning. Traffic statistics showed us this was the best time frame to do it.

score 3 · Accepted Answer

If you have a set of load-balanced servers, you will be able to take one by one offline separately and update it. No downtime for the users!

score 3 · Accepted Answer

At the last place where I worked, QA would perform testing in the QA Environment. Any major problems would be fixed, tested, and verified before rolling out.

After the build has been certified by QA, the production support team pushed the code to the Staging environment where the client looks at the site and verifies that everything is as desired.

The actual production rollout occurs during off hours (after 9 p.m. if it is an emergency night push, or from 5 a.m. - 8 a.m. if it is a normally scheduled rollout).

The site is hosted on multiple servers, which are load balanced using an F5 Load Balancer:

A couple of the servers are removed from production,
code is installed, and
a cursory check is performed on the servers before putting the servers back in the pool.

This is repeated until all of the servers are upgraded to the latest code and allows the site to remain up the whole time.

This process is ideal, but there are cases when the database needs to be upgraded as well. If this is the case, then there are two options, depending on if the new database will break the site or not.

If the new database is incompatible with the existing front end, you have no real choice but to have a window of time where the site is down.

But if the new database is compatible with the existing front end, you can still push the code out without any actual downtime, but this requires there to be two production database servers.

All traffic is routed to the second DB and the first DB server is pulled.
The first DB is upgraded and after verification is complete, put back in production.
All traffic is routed to the first DB and second DB is pulled.
The second DB is upgraded and after verification is complete, put back in production.
The next step is to perform the partial upgrades as described above.

So to summarize:

When you roll out changes to a live web site, how do you go about checking that the live system is working correctly? In the best case, this is done incrementally.
Which tools do you use? Manual checks to verify code is installed correctly along with some basic automated tests, using any automation tool. We used Selenium IDE.
Who does it? DBA performs DB upgrades, Tech Support/System Admins push/pull the servers and installs the code, and QA or Production support performs the Manual Tests and/or runs the Automated tests.
Do you block access to the site for the testing period? If possible, this should be avoided at all costs, especially, as Gilles mentioned earlier, if it is a paid site.
What amount of downtime is acceptable? Downtime should be restricted to times when users would be least likely to use the site, and should be done in less than 3 hours time.
Note: 3 hours is very generous. After practice and rehearsing, like jplindstrom mentioned, the team will have the whole process down and can get in and out in sometimes less than an hour.

Hope this helps!

score 2 · Accepted Answer

Have a cute, disarming image and/or backup page. Some sites implement simple javascript games to keep you busy while waiting for the update.

Eg, fail whale.

-Adam

score 1 · Accepted Answer

Some of that depends on if you're updating a database as well. In the past, if the DB was being updated we downed the site for a planned (and published) maintenance period - usually something really off hours where impact was minimal. If the update doesn't involve the DB then, in a load balanced environment, we'd take 1 box out of the mix, deploy & test. If that was successful, it went into the mix and the other box (assuming 2 boxes) was brought out and updated/tested.

Note: We're NOT testing the code, just that the deployment went smoothly so down time any way was minimal. As has been mentioned, the code should have already passed testing in another environment.

score 1 · Accepted Answer

IMHO long downtimes (hours) are acceptable for a free site. If you educate your users enough they'll understand that it's a necessity. Maybe give them something to play with until the website goes back up (eg. flash game, webcam live feed showing the dev team at work, etc). For a website that people pay to access, a lot of people are going to waste your time with complaints if you feed them regular downtime. I'd avoid downtime like the plague and roll out updates really slowly and carefully if I were running a service that charges users.

In my current setup I have a secondary website connected to the same database and cache as the live copy to test my changes.

I also have several "page watcher" scripts running on cron jobs that use regular expressions to check that the website is rendering key pages properly.

score 1 · Accepted Answer

The answer is that "it depends". First of all, on the kind of environment you are releasing into. Is it "hello, world" type of website on a shared host somewhere, or a google.com with half a mil servers? Is there typically one user per day, or more like couple million? Are you publishing HTML/CSS/JPG, or is there a big hairy backend with SQL servers, middle tier servers, distributed caches, etc?

In general -- if you can afford to have separate environments for development, QA, staging, and production -- do have those. If you have the resources -- create the ecosystem so that you can build the complete installable package with 1 (one) click. And make sure that the same binary install can be successfully installed in DEV/QA/STAGE/PROD with another single click... There's tons of stuff written on this subject, and you need to be more specific with your question to get a reasonable answer

score 0 · Accepted Answer

Run your main server on a port other than 80. Stick a lightweight server (e.g. nginx) in front of it on port 80. When you update your site, start another instance on a new port. Test. When you are satisfied that it has been deployed correctly, edit your proxy config file, and restart it. In nginx's case, this results in zero downtime or failed requests, and can also provide performance improvements over the more typical Apache-only hosting option.

Of course, this is no substitute for a proper staging server, it is merely a 'polite' way of performing the handover with limited resources.

score 0 · Accepted Answer

To test everything as well as possible on a separate dev site before going live, I use Selenium (a web page tester) to run through all the navigable parts of the site, fill dummy values into forms, check that those values appear in the right places as a result, etc.

It's powerful enough to check a lot of javascript or dynamic stuff too.

Then a quick run-through with Selenium again after upgrading the live site verifies that the update worked and that there are no missing links or database errors.

It's saved me a few times by catching subtle errors that I would have missed just manually flicking through.

Also, if you put the live site behind some sort of "reverse proxy" or load balancer (if it's big), that makes it easy to switch back to the previous version if there are problems.

score 0 · Accepted Answer

The only way to make it transparent to your users is to put it behind a load balanced proxy. You take one server down while you update another server. Then when you done updating you put the one you updated online and take the other one down. That's how we do it.

If you have any sort of 'beta" build, don't roll it out on the live server. If you have a 'live, busy site' chances are people are going to pound on it and break something.

This is a typical high availbility setup, to maintain high availability you'll need 3 servers minimum. 2 live ones and 1 testing server. Plus any oter extra servers if you want to have a dedicated DB or something.

score 0 · Accepted Answer

Create a host class and deploy your live site on that host class. By host class I mean a set of hosts where load balancing is setup and its easy to add and remove hosts from the class.

When you are done with the beta testing and ready for production, no need to take your site down, just remove some host from production host class, add them in new host class and deploy your latest code there and test properly. Once you are sure that everything is working fine move all your host gradually to the new one and point new host class as production host class. Or you can use the same you were using initially, whole idea behind this activity is to make sure that your are testing your deployment on the production boxes, where your site will be running after deployment, because deploy issues are scary and hard to debug.

release-management - How do you update a live, busy web site in the politest way possible?

13 回答 13

Related

Reference