About Me

My photo
Experienced Information Technology leader, author, system administrator, and systems architect.

Thursday, January 16, 2014

Effective System Monitoring

In order to maintain a reliable IT environment, every enterprise needs to set up an effective monitoring regime.

A common mistake by new monitoring administrators is to alert on everything. This is an ineffective strategy for several reasons. For starters, it may result in higher telecom charges for passing large numbers of alerts. Passing tons of irrelevant alerts will impact team morale. And, no matter how dedicated your team is, you are guaranteed to reach a state where alerts will start being ignored because "they're all garbage anyway."

For example, it is common for non-technical managers to want to send alerts to the systems team when system CPU hits 100%. But, from a technical perspective, this is absurd:

  • You are paying for a certain system capacity. Some applications (especially ones with extensive calculations) will use the full capacity of the system. This is a GOOD thing, since it means the calculations will be done sooner.
  • What is it you are asking the alert recipient to do? Re-start the system? Kill the processes that are keeping the system busy? If there is nothing for a the systems staff to do in the immediate term, it should be reported in a summary report, not alerted.
  • If there is an indication (beyond a busy CPU) that there is a runaway process of some sort, the alert needs to go to the team that would make that determination and take necessary action.

In order to be effective, a monitoring strategy needs to be thought out. You may end up monitoring a lot of things just to establish baselines or to view growth over time. Some things you monitor will need to be checked out right away. It is important to know which is which.

Historical information should be logged and retained for examination on an as-needed basis. It is wise to set up automated regular reports (distributed via email or web) to keep an eye on historical system trends, but there is no reason to send alerts on this sort of information.

Availability information should be characterized and handled in an appropriate way, probably through a tiered system of notifications. Depending on the urgency, it may show up on a monitoring console, be rolled up in a daily summary report, or paged out to the on-call person. Some common types of information in this category include:

  • "Unusual" log messages. Defining what is "unusual" usually takes some time to tune whatever reporting system is being used. Some common tools include logwatch, swatch, and logcheck. Even though it takes time, your team will need to customize this list on their own systems.
  • Hardware faults. Depending on the hardware and software involved, the vendor will have provided monitoring hooks to allow you to identify when hardware is failing.
  • Availability failures. This includes things like ping monitoring or other types of connection monitoring that give a warning when a needed resource is no longer available.
  • Danger signs. Typically, this will include anything that your team has identified that indicates that the system is entering a danger zone. This may mean certain types of performance characteristics, or it may mean certain types of system behavior.

Alerting Strategy

Alerts can come in different shapes, depending on the requirements of the environment. It is very common for alerts to be configured to be sent to a paging queue, which may include escalations beyond a single on-call person.

(If possible, configure escalations into your alerting system, so that you are not dependent on a single person's cell phone for the availability of your entire enterprise. A typical escalation procedure would be for an unacknowledged alert to be sent up defined chain of escalation. For example, if the on-call person does not respond in 15 minutes, an alert may go to the entire group. If the alert is not acknowledged 15 minutes after that, the alert may go to the manager.)

In some environments, alerts are handled by a round-the-clock team that is sometimes called the Network Operations Center (NOC). The NOC will coordinate response to the issue, including an evaluation of the alert and any necessary escalations.

Before an alert is configured, the monitoring group should first make sure that the alert meets three important criteria. The alert should be:

  1. Important. If the issue being reported does not have an immediate impact, it should be included in a summary report, not alerted. Prioritize monitoring, alerting, and response by the level of risk to the organization.
  2. Urgent. If the issue does not need to have action taken right away, report it as part of a summary report.
  3. Actionable. If no action can be taken by the person who receives the alert, it should have been defined to be sent to the right person. (Or perhaps the issue should be reported in a summary report rather than sent through the alerting system.)

Tuesday, January 14, 2014

Courage in a Corporate Setting

Sometimes leaders need to leave their comfortable "safe zones" in order to be effective. The reality is that the bulk of our jobs can be done by almost anyone. Most decisions we make are of the "no brainer" variety, especially as we become more experienced and comfortable in our role as leaders. But there are a few decisions that we need to make where we really earn the money and privileges that come with a management role.

When we voluntarily step outside of our comfort zone to do what we know to be right, we demonstrate the courage that distinguishes between someone who is a leader and someone who is merely a boss.

Battlefield analogies are very common when we speak about courage. An article by Peter Voyer in Ivey Business Journal suggests some important leadership traits that translate from the battlefield to a corporate setting:

  • Don't ask subordinates to do something you would not do. Not only should you be willing to work alongside your team, you should be seen as someone who engages in the task at hand. (Of course, the way you engage the project will be somewhat different than the tasks you would assign a junior team member, but nobody on your team should feel like you are unwilling to dirty your hands to make the project succeed.)
  • Demonstrate moral fiber. You can lose years of built-up moral capital in a split second with a morally dubious decision.
  • React quickly, decisively, and fairly when presented with a moral question.
  • Maintain dignity and respect within and between groups.

I recently saw an article about a manager who was allegedly fired because he stood up for an Indian employee's right to earn the same salary as American employees with a similar job. This is the sort of courage we need if we want to be leaders and not merely bosses. Who do you want to see when you look in the mirror in the morning?

Great leaders earn the loyalty of the people who work with them. They earn loyalty by demonstrating loyalty. This doesn't mean that you cover for one of your subordinates who does something wrong; it does not help someone's development to infantilize them. But make sure that the consequences are fair and are implemented with the long-term development of your employee in mind. This may mean that you stand up for someone who has made a mistake and demand fair treatment for that person. Yes, this is uncomfortable, but it is part of how you become the manager you want to be.

Make sure you stay informed of your team's progress towards goals, and work with them to overcome obstacles. This does not mean that you do your team's work for them; it means that you provide a sounding board. Sometimes a problem is escalated to you if it is something that requires a manager's approval or advice; make sure that you do what you need to do promptly, then return the task to its rightful owner.

Maintain your integrity. Make the best decisions you can, and abide by the results of those decisions. Don't pass the blame. Instead, identify how to fix the situation and propose solutions.

Demonstrate courage by making the right decisions, even when they are hard. Anybody can be a great boss when the going is easy. Being a great leader comes from doing the right things even when they are not easy.

Monday, January 13, 2014

The Promise and Peril of Self-Driving Cars

The research by Google and others into self-driving cars has been intriguing. The vast majority of traffic accidents are the fault of drivers, and being able to eliminate human error would be a huge win for traffic safety.

But if computers are driving cars, we have to take a serious look at information security in the context of a self-driving automobile. Unfortunately, most current automation does not have adequate safeguards to protect from malicious inputs.

In particular, components do not do checking or validation to make sure that commands are being issued from an appropriate source. Security researchers have demonstrated that they are able to issue commands to a Prius to control steering, braking, acceleration, and dashboard displays. They were also able to disable an Escape's brakes at slow speed.

Ford and Toyota both point out that the researchers were connecting directly to the car's CAN (Controller Area Network), which limits the impact of some of their demonstrations. But keep in mind that wireless controllers on on-board systems such as Bluetooth controllers on sound systems and telematics units on satellite roadside assistance services may provide an entry point into the automobile. Anywhere a wireless connection allows access to a component connected to a CAN is a possible entry point for malicious code.

The sorts of security measures we use for other network-connected items would still work inside a car. Provide air gaps between components that don't need to be connected. And provide for validation and authentication of commands from components that do need to be connected.

I remember discussions about PC security in the early days of the Internet, when most computer viruses were still spread by injudicious insertion of floppy disks. Way back when, we were told that PCs didn't need to have security programmed in from the ground up. I'm hoping we learn from the history of those poor decisions. A Blue Screen of Death is one thing, but a traffic fatality is another.