Troubleshooting is a form of problem solving. It is the systematic search for the source of a problem so that it can be solved. Troubleshooting is often a process of elimination - eliminating potential causes of a problem. Troubleshooting is used in many fields such as system administration and electronics.
In general troubleshooting is the identification or diagnosis of "trouble" in a system. The problem is initially described as symptoms of malfunction and troubleshooting is the process of determining the causes of these symptoms.
A system can be described in terms of its expected or intended behavior (usually, for artificial systems, its purpose). Events or inputs to the system are expected to generate specific results or outputs. (For example selecting the "print" option from various computer applications is intended to result in hardcopy emerging from some specific device). Any unexpected, particularly undesirable behavior is a symptom and troubleshooting is the process of isolating its specific cause or causes. Frequently the symptom is a failure to observe any results. (Nothing was printed, for example).
Most discussion of troubleshooting, and especially training in formal troubleshooting procedures, is extremely domain specific. The bulk of the material is relevant to a particular field of endeavor (such as automotive repair, computer hardware services, or software systems support). However, troubleshooting has common elements regardless of the specifics.
Any system can be described in terms of its components or subsystems. Each subsystem can be described in terms of its expected behavior. So the inputs to a system can be described as a cascade of inputs and results among the components of the system. (For example: selecting the "print" option in a computer application may cause the software to call on a separate utility, such as lpr on a UNIX system; that in turn might open, read and parse a number of configuration files which might direct it to perform some form of hostname address resolution via DNS, NIS, or LDAP, and then initiate a TCP/IP connection to a specific network device, and so on).
The domain-specific knowledge that drives the troubleshooting process is the understanding of these systems in terms of the interactions and dependencies among their subsystems and components. In particular the specialist can ennumerate the components and knows a set of procedures for testing many of them in isolation from the system as a whole. (For example the systems administrator may know which configuration files lpr is trying to parse and may read them manually, check their permissions, or may assume the identity of the user who is experiencing the problem and manually run an lpr command from the system's shell prompt; this may isolation the problem to the application's configuration, the user's preference settings, the workstation's configuration or network settings, the network's name services domain, or back to the printer's configuration or hardware).
Well-designed systems have designated "test points" or monitoring instrumentation. (For example most printers have indicator lights which change colors or blink, or LCD panels which display messages for detectable problems: paper jams, empty paper trays, network or other cable disconnection, etc. As another example UNIX and Linux systems support features for system call tracing through commands like truss, strace, and ktrace).
Usually troubleshooting is applied to something that has suddenly stopped working, since its previously working state forms the expectations about its continued behavior. So the initial focus is often on recent changes to the system or to the environment in which it exists. (For example a printer that "was working when it was plugged in over there"). However, there is a well known principle that correlation does not imply causality. (For
Answered by
Nishu's
at
11:21 AM on August 16, 2008