ITD trying to make Larry, Curly and Moe less interdependent

The convenience of e-mail is a wonderful thing. People who never before have used their computer for much beyond a large paperweight have become almost addicted to those alluring beeps that signal an incoming message from across the hall or around the world.

Your boss may be out of town, but finding out that last crucial bit of information to meet a deadline is still possible because she checks her e-mail each day, even from her seaside hotel room. Your e-mail has been sent; you come in the next morning, eager to check incoming messages so you can make your noontime presentation.

You sign on with your name and password; the almost instant response tells you that you have logged on to the server named "Curly." But just as you try to access your mail box, everything comes to a standstill. Half an hour later, you try again and get the message "gateway not responding." Another hour later, the same message appears. You call the number you have for your boss; there's no answer. The sweat begins to pop out on your brow. The clock is ticking; the meeting looms.

It is not much consolation to know that you are not alone in your dilemma. Close to 20,000 faculty, staff and students at Emory have user IDs for Dooley or Eagle accounts. "Most of our clients utilize the cluster to do e-mail," said Mike Stephens, an operating systems analyst in the Information Technology Division. The "cluster" refers to three servers, playfully named "Larry," "Curly" and "Moe," that reside in the North Decatur Building.

The number of users has mushroomed to 20,000 from approximately 4,000 three years ago, and that kind of growth brings complications. According to Stephens, both the level and continuity of usage are increasing. "During drop/add this fall, there were 425 users at one time on Larry and Curly -- 50 more than ever before."

Several months ago, said Stephens, there were two disk failures, which caused lengthy downtimes for the cluster. In the last several weeks, there have been problems with two additional servers that house NIS+, a program that keeps user lists and passwords and enables users to log on to any of the three machines in the cluster with the same password and the same home page. According to Stephens, "the downside of this setup, which provides a higher level of security, and an enormous amount of flexibility and power, is that if NIS+ is not working, you can't log in anywhere."

That issue is simply a symptom of the larger problems that cause the system to go down as much as it does. According to Stephens, there is a high level of interdependence in the system, which means that if one crucial part of the system fails, the entire system ceases to operate. Part of making the system more dependable, said Stephens, is making the system less interdependent and eliminating these "single points of failure." That will involve the purchase of additional servers, which is currently in the works, and splitting out those servers from the main system. Adding more servers and reconfiguring the system will allow Stephens and his co-workers to ensure that two critical things are not necessarily residing on the same machine.

Adding staff will help as well, Stephens hopes. The cluster has two full-time system administrators, Stephens and Louis Leon, and one part-time person. Two junior positions are currently open. "We probably need about eight people to do the job right," said Stephens.

Also in the works, he said, are fallback plans to allow the system to function, if not fully, at least as a stop-gap measure until problems can be solved, rather than the entire system being unavailable.

There are other less technical reasons why e-mail sometimes fails to work. "If people let mail pile up," said Stephens, "the disk gets full, and nobody gets mail." Once a month, 150-300 users with more than one megabyte of mail receive a message that their mail has been moved to another location.

Even when e-mail is working, the speed can range from several seconds to several hours just to go from one end of campus to the other. "A message from the administration building being sent to the hospital might have to go through five or six different machines before it reaches its destination," said Stephens. Emory Hospital, as well as the law school, the business school and several other areas, are on their own systems, separate from the Dooley/Eagle cluster.

Response time also varies depending on the number of active users, and whether back-ups are running. Because the cluster operates seven days a week, 24 hours a day, said Stephens, there's no time that the cluster is without users when back-up can be done conveniently.

Beyond more hardware and staffing, Stephens said that what the e-mail administrators need from the Emory community is "time and patience and, most of all, trust. Trust is what we're fighting hardest for, and it's what we lose the most of in episodes like we've had this past month."

-- Nancy M. Spitler