[hcoop/debian/openafs.git] / doc / man-pages / pod8 / fragments / fileserver-troubleshooting.pod

Sending process signals to the File Server Process can change its
behavior in the following ways:

  Process          Signal       OS     Result
  ---------------------------------------------------------------------

  File Server      XCPU        Unix    Prints a list of client IP
                                       Addresses.

  File Server      USR2      Windows   Prints a list of client IP
                                       Addresses.

  File Server      POLL        HPUX    Prints a list of client IP
                                       Addresses.

  Any server       TSTP        Any     Increases Debug level by a power
                                       of 5 -- 1,5,25,125, etc.
                                       This has the same effect as the
                                       -d XXX command-line option.

  Any Server       HUP         Any     Resets Debug level to 0

  File Server      TERM        Any     Run minor instrumentation over
                                       the list of descriptors.

  Other Servers    TERM        Any     Causes the process to quit.

  File Server      QUIT        Any     Causes the File Server to Quit.
                                       Bos Server knows this.

The basic metric of whether an AFS file server is doing well is the number
of connections waiting for a thread,
which can be found by running the following command:

   % rxdebug <server> | grep waiting_for | wc -l

Each line returned by C<rxdebug> that contains the text "waiting_for"
represents a connection that's waiting for a file server thread.

If the blocked connection count is ever above 0, the server is having
problems replying to clients in a timely fashion.  If it gets above 10,
roughly, there will be noticeable slowness by the user.  The total number of
connections is a mostly irrelevant number that goes essentially
monotonically for as long as the server has been running and then goes back
down to zero when it's restarted.

The most common cause of blocked connections rising on a server is some
process somewhere performing an abnormal number of accesses to that server
and its volumes.  If multiple servers have a blocked connection count, the
most likely explanation is that there is a volume replicated between those
servers that is absorbing an abnormally high access rate.

To get an access count on all the volumes on a server, run:

   % vos listvol <server> -long

and save the output in a file.  The results will look like a bunch of B<vos
examine> output for each volume on the server.  Look for lines like:

   40065 accesses in the past day (i.e., vnode references)

and look for volumes with an abnormally high number of accesses.  Anything
over 10,000 is fairly high, but some volumes like root.cell and other
volumes close to the root of the cell will have that many hits routinely.
Anything over 100,000 is generally abnormally high.  The count resets about
once a day.

Another approach that can be used to narrow the possibilities for a
replicated volume, when multiple servers are having trouble, is to find all
replicated volumes for that server.  Run:

   % vos listvldb -server <server>

where <server> is one of the servers having problems to refresh the VLDB
cache, and then run:

   % vos listvldb -server <server> -part <partition>

to get a list of all volumes on that server and partition, including every
other server with replicas.

Once the volume causing the problem has been identified, the best way to
deal with the problem is to move that volume to another server with a low
load or to stop any runaway programs that are accessing that volume
unnecessarily.  Often the volume will be enough information to tell what's
going on.

If you still need additional information about who's hitting that server,
sometimes you can guess at that information from the failed callbacks in the
F<FileLog> log in F</var/log/afs> on the server, or from the output of:

   % /usr/afsws/etc/rxdebug <server> -rxstats

but the best way is to turn on debugging output from the file server.
(Warning: This generates a lot of output into FileLog on the AFS server.)
To do this, log on to the AFS server, find the PID of the fileserver
process, and do:

    kill -TSTP <pid>

where <pid> is the PID of the file server process.  This will raise the
debugging level so that you'll start seeing what people are actually doing
on the server.  You can do this up to three more times to get even more
output if needed.  To reset the debugging level back to normal, use (The
following command will NOT terminate the file server):

    kill -HUP <pid>

The debugging setting on the File Server should be reset back to normal when
debugging is no longer needed.  Otherwise, the AFS server may well fill its
disks with debugging output.

The lines of the debugging output that are most useful for debugging load
problems are:

    SAFS_FetchStatus,  Fid = 2003828163.77154.82248, Host 171.64.15.76
    SRXAFS_FetchData, Fid = 2003828163.77154.82248

(The example above is partly truncated to highlight the interesting
information).  The Fid identifies the volume and inode within the volume;
the volume is the first long number.  So, for example, this was:

   % vos examine 2003828163
   pubsw.matlab61                   2003828163 RW    1040060 K  On-line
       afssvr5.Stanford.EDU /vicepa 
       RWrite 2003828163 ROnly 2003828164 Backup 2003828165 
       MaxQuota    3000000 K 
       Creation    Mon Aug  6 16:40:55 2001
       Last Update Tue Jul 30 19:00:25 2002
       86181 accesses in the past day (i.e., vnode references)

       RWrite: 2003828163    ROnly: 2003828164    Backup: 2003828165
       number of sites -> 3
          server afssvr5.Stanford.EDU partition /vicepa RW Site 
          server afssvr11.Stanford.EDU partition /vicepd RO Site 
          server afssvr5.Stanford.EDU partition /vicepa RO Site 

and from the Host information one can tell what system is accessing that
volume.

Note that the output of L<vos_examine(1)> also includes the access count, so
once the problem has been identified, vos examine can be used to see if the
access count is still increasing.  Also remember that you can run vos
examine on the read-only replica (e.g., pubsw.matlab61.readonly) to see the
access counts on the read-only replica on all of the servers that it's
located on.
Commit	Line	Data
805e021f CE	1	Sending process signals to the File Server Process can change its
	2	behavior in the following ways:
	3
	4	Process Signal OS Result
	5	---------------------------------------------------------------------
	6
	7	File Server XCPU Unix Prints a list of client IP
	8	Addresses.
	9
	10	File Server USR2 Windows Prints a list of client IP
	11	Addresses.
	12
	13	File Server POLL HPUX Prints a list of client IP
	14	Addresses.
	15
	16	Any server TSTP Any Increases Debug level by a power
	17	of 5 -- 1,5,25,125, etc.
	18	This has the same effect as the
	19	-d XXX command-line option.
	20
	21	Any Server HUP Any Resets Debug level to 0
	22
	23	File Server TERM Any Run minor instrumentation over
	24	the list of descriptors.
	25
	26	Other Servers TERM Any Causes the process to quit.
	27
	28	File Server QUIT Any Causes the File Server to Quit.
	29	Bos Server knows this.
	30
	31	The basic metric of whether an AFS file server is doing well is the number
	32	of connections waiting for a thread,
	33	which can be found by running the following command:
	34
	35	% rxdebug <server> \| grep waiting_for \| wc -l
	36
	37	Each line returned by C<rxdebug> that contains the text "waiting_for"
	38	represents a connection that's waiting for a file server thread.
	39
	40	If the blocked connection count is ever above 0, the server is having
	41	problems replying to clients in a timely fashion. If it gets above 10,
	42	roughly, there will be noticeable slowness by the user. The total number of
	43	connections is a mostly irrelevant number that goes essentially
	44	monotonically for as long as the server has been running and then goes back
	45	down to zero when it's restarted.
	46
	47	The most common cause of blocked connections rising on a server is some
	48	process somewhere performing an abnormal number of accesses to that server
	49	and its volumes. If multiple servers have a blocked connection count, the
	50	most likely explanation is that there is a volume replicated between those
	51	servers that is absorbing an abnormally high access rate.
	52
	53	To get an access count on all the volumes on a server, run:
	54
	55	% vos listvol <server> -long
	56
	57	and save the output in a file. The results will look like a bunch of B<vos
	58	examine> output for each volume on the server. Look for lines like:
	59
	60	40065 accesses in the past day (i.e., vnode references)
	61
	62	and look for volumes with an abnormally high number of accesses. Anything
	63	over 10,000 is fairly high, but some volumes like root.cell and other
	64	volumes close to the root of the cell will have that many hits routinely.
65	Anything over 100,000 is generally abnormally high. The count resets about
66	once a day.
67
68	Another approach that can be used to narrow the possibilities for a
69	replicated volume, when multiple servers are having trouble, is to find all
70	replicated volumes for that server. Run:
71
72	% vos listvldb -server <server>
73
74	where <server> is one of the servers having problems to refresh the VLDB
75	cache, and then run:
76
77	% vos listvldb -server <server> -part <partition>
78
79	to get a list of all volumes on that server and partition, including every
80	other server with replicas.
81
82	Once the volume causing the problem has been identified, the best way to
83	deal with the problem is to move that volume to another server with a low
84	load or to stop any runaway programs that are accessing that volume
85	unnecessarily. Often the volume will be enough information to tell what's
86	going on.
87
88	If you still need additional information about who's hitting that server,
89	sometimes you can guess at that information from the failed callbacks in the
90	F<FileLog> log in F</var/log/afs> on the server, or from the output of:
91
92	% /usr/afsws/etc/rxdebug <server> -rxstats
93
94	but the best way is to turn on debugging output from the file server.
95	(Warning: This generates a lot of output into FileLog on the AFS server.)
96	To do this, log on to the AFS server, find the PID of the fileserver
97	process, and do:
98
99	kill -TSTP <pid>
100
101	where <pid> is the PID of the file server process. This will raise the
102	debugging level so that you'll start seeing what people are actually doing
103	on the server. You can do this up to three more times to get even more
104	output if needed. To reset the debugging level back to normal, use (The
105	following command will NOT terminate the file server):
106
107	kill -HUP <pid>
108
109	The debugging setting on the File Server should be reset back to normal when
110	debugging is no longer needed. Otherwise, the AFS server may well fill its
111	disks with debugging output.
112
113	The lines of the debugging output that are most useful for debugging load
114	problems are:
115
116	SAFS_FetchStatus, Fid = 2003828163.77154.82248, Host 171.64.15.76
117	SRXAFS_FetchData, Fid = 2003828163.77154.82248
118
119	(The example above is partly truncated to highlight the interesting
120	information). The Fid identifies the volume and inode within the volume;
121	the volume is the first long number. So, for example, this was:
122
123	% vos examine 2003828163
124	pubsw.matlab61 2003828163 RW 1040060 K On-line
125	afssvr5.Stanford.EDU /vicepa
126	RWrite 2003828163 ROnly 2003828164 Backup 2003828165
127	MaxQuota 3000000 K
128	Creation Mon Aug 6 16:40:55 2001
129	Last Update Tue Jul 30 19:00:25 2002
130	86181 accesses in the past day (i.e., vnode references)
131
132	RWrite: 2003828163 ROnly: 2003828164 Backup: 2003828165
133	number of sites -> 3
134	server afssvr5.Stanford.EDU partition /vicepa RW Site
135	server afssvr11.Stanford.EDU partition /vicepd RO Site
136	server afssvr5.Stanford.EDU partition /vicepa RO Site
137
138	and from the Host information one can tell what system is accessing that
139	volume.
140
141	Note that the output of L<vos_examine(1)> also includes the access count, so
142	once the problem has been identified, vos examine can be used to see if the
143	access count is still increasing. Also remember that you can run vos
144	examine on the read-only replica (e.g., pubsw.matlab61.readonly) to see the
145	access counts on the read-only replica on all of the servers that it's
146	located on.