maintenance - The Neglected Stakeholder a.k.a the System Administrator

Question

Some time ago I came to realize that almost every customer project that I have been working on so far has neglected an important group of stakeholders: the system administrators.

These silent heroes are usually only involved at the end of a project and are left with an executable black box of bits that they have to install, support, and maintain for years to come. Whenever an issue occurs with this black box they have to find a way to resolve it using whatever random piece of information and tool support made available to them by the black box or the underlying platform, and if this is not sufficient then they have to improvise.

If they had been involved as a stakeholder in the project from the beginning they would have had a chance to predict potential problems and inform the project team about it. But reality is different and even though I as a developer would love to involve the system administrator as an extra stakeholder, external factors may prevent this from happening.

In these situations I would like to help our silent heroes as good as I can. So my question is:

What would a system administrator wish from us developers when we develop the systems they will have to maintain?

If you are a system administrator please tell a war story about a difficult problem you once had and what developers could have done to make it easier for you to solve it.

score 9 · Accepted Answer

Various things, including (but unlikely to be limited to) these, which are not in priority order:

No requirement to use privileged install
Option to use privileged install
Option for distributed install (so it can be installed on a server and used on other machines)
Clean uninstall
Sensible upgrade patterns
Option to choose install location
Minimal dependencies on other software
Minimal scattering of data around the system (don't dump stuff in /etc, /usr/lib, /var/adm, ...)
No ever-growing logs
Silent install
Scripted install
Online documentation (on machine - as well as on internet)
Man pages perhaps
Easy to configure
Easy to make accessible to end users
No security risks
No special users or groups (or limited number - at most one special user, one special group is a target, though not always attainable)
Either no 'phone home' functionality or only if explicitly configured (must not be default)
Good logging of diagnostics when there is a problem
Good tech support available if there is a problem
No requirement to get activation code during install
No requirement to reboot the machine after an install
Ability to parallel run old and new versions

A lot depends on what the software is and how it is used. The requirements for a GUI program that works on Windows, Linux and MacOS X are radically different from the requirements for a network daemon - but the goal should still be stable, reliable, easily managed software.

Bear in mind that there are big differences between software prepared by an in-house department for use within one company and software prepared for use by customers external to the company that develops the software.

score 5 · Accepted Answer

When a problem inevitably occurs, pay attention to what the sysadmin says and believe him. Don't just dismiss it out of hand if it doesn't fit with your initial assessment.

War story: Back about 6 years ago, I was sysadminning for a smallish manufacturing company and they decided to buy some software to handle scheduling of preventive maintenance on their equipment. One of its features was importing maintenance requests from email, but we had occasional problems with errors talking to the mail server during this process and I was eventually called in to take a look at it during a phone call with the developer. The conversation involved multiple iterations of

Developer: I've never heard of anyone having that kind of trouble talking to the mail server. It has to be a firewall issue.

Me: I'm logged into the firewall, running a packet sniffer, and watching your app's traffic pass through without any problems. It's getting through the firewall just fine.

Developer: No, no - it has to be a firewall issue.

(In the end, it turned out that the problem was that the app opened a POP3 connection, read all the mail, waited for the user to schedule the tasks, then sent a POP command to delete the mail after all requests had been scheduled. If the user took more than 15 minutes to do the scheduling, the POP connection timed out and the app wasn't able to recover, so it died instead. And then the user had to repeat the scheduling, meaning it would probably take long enough to time out again...)

score 2 · Accepted Answer

System administrators generally want the following:

Transparency into the system's operation. So some sort of GUI that shows system settings and perhaps a history of system problems, as well as lists of what the system has processed correctly.
A clear context-sensitive escalation path for problems. By this I mean that each problem type has some notes about fixing, and a person or team who can be contacted if the problem can't be fixed quickly and escalation is required.
To be proactive, i.e. able to inform end-users about a system problem before an end-user informs him. So some sort of immediate alerting for any system problem where that's feasible,
Not to be flooded by alerts. So once an alert has arrived, no more alerts for the same problem; just another message when the system is operational again.
Detailed logging using something like the event log (in Windows) for deeper investigation of a problem.

score 2 · Accepted Answer

I think a combination of the following:

1) Threshold of capacity -> What machines does it take to run this software and what metrics should be used to determine when this number may change, e.g. going from 2 to 3 database servers or going from 10 to 15 webservers. How beefy does the hardware need to be and does one part matter more than another, e.g. does CPU matter more than RAM, what about hard drive configuration and space?

2) Cookbook style troubleshooting -> If something goes wrong how easily can this be categorized into code, data, or network error.

3) Diagram of environments -> What does the dev, test and production instances of this software look like? Are there these and possibly other environments running right now?

4) Maintenance -> Are there log files to parse into reports, weekly error logs to send around, or some kind of housekeeping to do with the software, e.g. reboot the server weekly.

5) Security -> Are there accounts to be created and managed and a security policy to outline who has what level of authority on the system.

Those would be the main ones that come to my mind.

score 1 · Accepted Answer

1

That the system just works so that he can go home to kids.

于 2008-11-21T00:06:21.273 回答

score 1 · Accepted Answer

Every project have 'Capacity planning' along with its system architecture. System administrators should be involved in the Capacity Planning process as well as in the final review of the System Architecture. This will help him a better understand the system and be prepared for the deployment and support.

score 1 · Accepted Answer

Well-documented dependencies that come packaged with the software, if my home admin experiences are anything to go by.

score 1 · Accepted Answer

Well, more a horror than a war time story: maintaining an application that for no apparent reason demands to be ran under an administrator user account.

A few random things I think would be nice to have in an application:

Meaningful command line arguments
Some sort of scripting capabilities (if appropriate)
Any kind of progress indicator for long running operations
Error logging
Consistent UI

score 1 · Accepted Answer

Easy package maintenance!

It should be brain-dead simple to install and upgrade the software, and that goes for dependencies as well. If there are a lot of dependencies and sub-dependencies, and you aren't inclined to master the nuances of each Operating System's package management methodology, it would be nice to offer a package version with all the requisite dependencies bundled together into a giant tarball. Run the script, chuck it all into /usr/local/yourproject, and tell them where is the startup/shutdown/restart script.

maintenance - The Neglected Stakeholder a.k.a the System Administrator

9 回答 9

Related

Reference