Commit | Line | Data |
---|---|---|
805e021f CE |
1 | Sending process signals to the File Server Process can change its |
2 | behavior in the following ways: | |
3 | ||
4 | Process Signal OS Result | |
5 | --------------------------------------------------------------------- | |
6 | ||
7 | File Server XCPU Unix Prints a list of client IP | |
8 | Addresses. | |
9 | ||
10 | File Server USR2 Windows Prints a list of client IP | |
11 | Addresses. | |
12 | ||
13 | File Server POLL HPUX Prints a list of client IP | |
14 | Addresses. | |
15 | ||
16 | Any server TSTP Any Increases Debug level by a power | |
17 | of 5 -- 1,5,25,125, etc. | |
18 | This has the same effect as the | |
19 | -d XXX command-line option. | |
20 | ||
21 | Any Server HUP Any Resets Debug level to 0 | |
22 | ||
23 | File Server TERM Any Run minor instrumentation over | |
24 | the list of descriptors. | |
25 | ||
26 | Other Servers TERM Any Causes the process to quit. | |
27 | ||
28 | File Server QUIT Any Causes the File Server to Quit. | |
29 | Bos Server knows this. | |
30 | ||
31 | The basic metric of whether an AFS file server is doing well is the number | |
32 | of connections waiting for a thread, | |
33 | which can be found by running the following command: | |
34 | ||
35 | % rxdebug <server> | grep waiting_for | wc -l | |
36 | ||
37 | Each line returned by C<rxdebug> that contains the text "waiting_for" | |
38 | represents a connection that's waiting for a file server thread. | |
39 | ||
40 | If the blocked connection count is ever above 0, the server is having | |
41 | problems replying to clients in a timely fashion. If it gets above 10, | |
42 | roughly, there will be noticeable slowness by the user. The total number of | |
43 | connections is a mostly irrelevant number that goes essentially | |
44 | monotonically for as long as the server has been running and then goes back | |
45 | down to zero when it's restarted. | |
46 | ||
47 | The most common cause of blocked connections rising on a server is some | |
48 | process somewhere performing an abnormal number of accesses to that server | |
49 | and its volumes. If multiple servers have a blocked connection count, the | |
50 | most likely explanation is that there is a volume replicated between those | |
51 | servers that is absorbing an abnormally high access rate. | |
52 | ||
53 | To get an access count on all the volumes on a server, run: | |
54 | ||
55 | % vos listvol <server> -long | |
56 | ||
57 | and save the output in a file. The results will look like a bunch of B<vos | |
58 | examine> output for each volume on the server. Look for lines like: | |
59 | ||
60 | 40065 accesses in the past day (i.e., vnode references) | |
61 | ||
62 | and look for volumes with an abnormally high number of accesses. Anything | |
63 | over 10,000 is fairly high, but some volumes like root.cell and other | |
64 | volumes close to the root of the cell will have that many hits routinely. | |
65 | Anything over 100,000 is generally abnormally high. The count resets about | |
66 | once a day. | |
67 | ||
68 | Another approach that can be used to narrow the possibilities for a | |
69 | replicated volume, when multiple servers are having trouble, is to find all | |
70 | replicated volumes for that server. Run: | |
71 | ||
72 | % vos listvldb -server <server> | |
73 | ||
74 | where <server> is one of the servers having problems to refresh the VLDB | |
75 | cache, and then run: | |
76 | ||
77 | % vos listvldb -server <server> -part <partition> | |
78 | ||
79 | to get a list of all volumes on that server and partition, including every | |
80 | other server with replicas. | |
81 | ||
82 | Once the volume causing the problem has been identified, the best way to | |
83 | deal with the problem is to move that volume to another server with a low | |
84 | load or to stop any runaway programs that are accessing that volume | |
85 | unnecessarily. Often the volume will be enough information to tell what's | |
86 | going on. | |
87 | ||
88 | If you still need additional information about who's hitting that server, | |
89 | sometimes you can guess at that information from the failed callbacks in the | |
90 | F<FileLog> log in F</var/log/afs> on the server, or from the output of: | |
91 | ||
92 | % /usr/afsws/etc/rxdebug <server> -rxstats | |
93 | ||
94 | but the best way is to turn on debugging output from the file server. | |
95 | (Warning: This generates a lot of output into FileLog on the AFS server.) | |
96 | To do this, log on to the AFS server, find the PID of the fileserver | |
97 | process, and do: | |
98 | ||
99 | kill -TSTP <pid> | |
100 | ||
101 | where <pid> is the PID of the file server process. This will raise the | |
102 | debugging level so that you'll start seeing what people are actually doing | |
103 | on the server. You can do this up to three more times to get even more | |
104 | output if needed. To reset the debugging level back to normal, use (The | |
105 | following command will NOT terminate the file server): | |
106 | ||
107 | kill -HUP <pid> | |
108 | ||
109 | The debugging setting on the File Server should be reset back to normal when | |
110 | debugging is no longer needed. Otherwise, the AFS server may well fill its | |
111 | disks with debugging output. | |
112 | ||
113 | The lines of the debugging output that are most useful for debugging load | |
114 | problems are: | |
115 | ||
116 | SAFS_FetchStatus, Fid = 2003828163.77154.82248, Host 171.64.15.76 | |
117 | SRXAFS_FetchData, Fid = 2003828163.77154.82248 | |
118 | ||
119 | (The example above is partly truncated to highlight the interesting | |
120 | information). The Fid identifies the volume and inode within the volume; | |
121 | the volume is the first long number. So, for example, this was: | |
122 | ||
123 | % vos examine 2003828163 | |
124 | pubsw.matlab61 2003828163 RW 1040060 K On-line | |
125 | afssvr5.Stanford.EDU /vicepa | |
126 | RWrite 2003828163 ROnly 2003828164 Backup 2003828165 | |
127 | MaxQuota 3000000 K | |
128 | Creation Mon Aug 6 16:40:55 2001 | |
129 | Last Update Tue Jul 30 19:00:25 2002 | |
130 | 86181 accesses in the past day (i.e., vnode references) | |
131 | ||
132 | RWrite: 2003828163 ROnly: 2003828164 Backup: 2003828165 | |
133 | number of sites -> 3 | |
134 | server afssvr5.Stanford.EDU partition /vicepa RW Site | |
135 | server afssvr11.Stanford.EDU partition /vicepd RO Site | |
136 | server afssvr5.Stanford.EDU partition /vicepa RO Site | |
137 | ||
138 | and from the Host information one can tell what system is accessing that | |
139 | volume. | |
140 | ||
141 | Note that the output of L<vos_examine(1)> also includes the access count, so | |
142 | once the problem has been identified, vos examine can be used to see if the | |
143 | access count is still increasing. Also remember that you can run vos | |
144 | examine on the read-only replica (e.g., pubsw.matlab61.readonly) to see the | |
145 | access counts on the read-only replica on all of the servers that it's | |
146 | located on. |