Commit | Line | Data |
---|---|---|
805e021f CE |
1 | Rx protocol specification draft |
2 | Nickolai Zeldovich, kolya@MIT.EDU | |
3 | ||
4 | Introduction | |
5 | ============ | |
6 | ||
7 | Rx is a client-server RPC protocol, an extended and combined version | |
8 | of the older R and RFTP protocols. This document describes Rx, but | |
9 | the details of Rx security protocols (such as Rxkad) are not specified. | |
10 | ||
11 | Rx communicates via UDP datagrams on a user-specified port. Rx also | |
12 | provides for multiplexing of Rx services on a single port, via a | |
13 | 16-bit service ID which identifies a particular Rx service that's | |
14 | listening on a given port akin to a port number. Therefore, an Rx | |
15 | service is identified by a triple of <IP address; UDP port number; | |
16 | Rx service ID>. | |
17 | ||
18 | The protocol is connection-oriented -- a client and a server must | |
19 | first hand-shake and establish a connection before Rx calls can be | |
20 | made. Said hand-shaking is implicit upon the first request if no | |
21 | authentication is desired, or can consist of a pair of Challenge | |
22 | and Response requests in order to establish authentication between | |
23 | the client and the server. | |
24 | ||
25 | Protocol Overview | |
26 | ================= | |
27 | ||
28 | As mentioned above, Rx uses UDP/IP datagrams on a user-specified | |
29 | port to communicate. An optional user-selectable authentication | |
30 | and encryption method can be used to achieve desired security. | |
31 | Each Rx server may provide multiple services, specified by the | |
32 | Service ID. This allows for service multiplexing, much in the | |
33 | same way as UDP port numbers allow for multiplexing of UDP | |
34 | datagrams addressed to the same host. | |
35 | ||
36 | Each client and server pair that want to communicate using Rx must | |
37 | establish an Rx connection, which can be thought of as a context | |
38 | for all subsequent Rx activity between these two parties. An Rx | |
39 | connection can only be associated with a single Rx service. | |
40 | ||
41 | Each Rx connection context contains multiple channels, which are | |
42 | used for data transmission and actually performing an RPC call. | |
43 | The channels are independent of each other, allowing multiple | |
44 | RPC calls to be performed to the same Rx server simultaneously. | |
45 | ||
46 | An Rx call involves the transmission of call arguments over an Rx | |
47 | channel to the server and reception of the reply data. For each | |
48 | Rx call, an available Rx channel must be allocated exclusively to | |
49 | that call. The channel cannot be used for anything else until the | |
50 | call completes. After call completion, the channel may be reused | |
51 | for subsequent Rx calls. | |
52 | ||
53 | Rx Connections | |
54 | ============== | |
55 | ||
56 | This section makes many references to fields of an Rx header; see | |
57 | the ``Packet Formats'' section for specific layout of the Rx header. | |
58 | ||
59 | The connection epoch is a unique value chosen by Rx on startup and | |
60 | used by the peer to both to identify connections to this host, and | |
61 | to detect when this host's Rx restarts. An Rx connection between | |
62 | two hosts is identified by: | |
63 | ||
64 | { Epoch, Connection ID, Peer IP, Peer Port }, | |
65 | if the high bit of the epoch (+) is not set | |
66 | { Epoch, Connection ID }, | |
67 | if the high bit of the epoch (+) is set | |
68 | ||
69 | This means that if the high epoch bit is set, the recipient of a | |
70 | packet should accept packets for this Rx connection from any IP | |
71 | address and port number. Conversely, if the high bit is not set, | |
72 | the IP and port number must be the same in order for packets to | |
73 | be properly recognized as being part of the same connection. | |
74 | ||
75 | Connection ID is chosen by the client that establishes the connection. | |
76 | The last two bits of the same 32-bit field are used by Rx to multiplex | |
77 | between 4 parallel calls on the same connection. Each one of them is | |
78 | called an Rx channel, and therefore the field is denoted "Channel ID". | |
79 | ||
80 | Call number identifies a particular call within a channel (so there | |
81 | are four call numbers associated with an Rx connection). Each new | |
82 | call should start with a higher number than the previous call, and | |
83 | typically this is just the previous call number + 1. The initial | |
84 | call number must be non-zero, since call number zero indicates a | |
85 | connection-only Rx packet (see below). The call number is chosen | |
86 | by the peer initiating the call. Although only one call can use | |
87 | a channel at one time, the call number allows peers to distinguish | |
88 | packets on the same channel that belong to different calls. | |
89 | ||
90 | The sequence number is similar to the sequence number in TCP, but | |
91 | instead of bytes they count packets within a call. Sequence numbers | |
92 | always start with 1 at the beginning of each call, and are incremented | |
93 | by 1 for each additional packet sent. Retransmissions in Rx are done | |
94 | on a packet-by-packet basis, identified by these sequence numbers. | |
95 | ||
96 | Every outgoing packet associated with a certain connection is stamped | |
97 | with a serial number in the serial field, and the serial number is | |
98 | incremented by 1 for every packet sent. This is used by the flow | |
99 | control mechanisms (described below). The serial number for a | |
100 | connection should start out with 1 (i.e., the first packet sent | |
101 | should have a serial number of 1.) | |
102 | ||
103 | Service ID identifies a particular Rx service running on a given | |
104 | host/port combination. This is analogous to how UDP port numbers | |
105 | allow multiplexing packets to a single IP address. Note that once | |
106 | an Rx connection has been created, the service ID may not be changed; | |
107 | existing implementations cache the service ID value for a given | |
108 | connection, and will ignore service ID values in subsequent packets. | |
109 | ||
110 | The Checksum field allows for an optional packet checksum. A zero | |
111 | checksum field value means that checksums are not being computed. | |
112 | An Rx security protocol (identified by the security field, described | |
113 | below) may choose to use this field to transport some checksum of | |
114 | the packet that is computed and verified by it (for example, rxkad | |
115 | uses this field for a cryptographic header checksum). Rx itself | |
116 | makes no use of the checksum field. | |
117 | ||
118 | The status field allows for additional user flags to be transported | |
119 | with each packet. These have no significance to the protocol itself. | |
120 | These flags are associated with a call rather than an individual | |
121 | packet. | |
122 | ||
123 | The security field specifies the type of security in use on this | |
124 | connection. These values don't have a defined mapping in the Rx | |
125 | protocol but rather are mapped to specific Rx security types by | |
126 | the application using Rx. | |
127 | ||
128 | An Rx security protocol can use the checksum field as described | |
129 | above, and can also modify the packet payload in any way, for | |
130 | instance by encrypting the contents or adding headers or trailers | |
131 | specific to the security protocol (although the end result must | |
132 | be a properly sized packet that Rx will be able to transmit.) | |
133 | ||
134 | The "Flags" field consists of a number of single-bit flags with | |
135 | meanings as follows. The actual bit values are defined below, | |
136 | in the ``Protocol Constants'' section. | |
137 | ||
138 | * CLIENT-INITIATED | |
139 | This packet originated from an Rx client (as opposed | |
140 | to server). To avoid packet loops, a server should | |
141 | always clear the CLIENT-INITIATED flag on any packets | |
142 | it sends, and discard incoming packets without the | |
143 | CLIENT-INITIATED flag. | |
144 | ||
145 | * REQUEST-ACK | |
146 | Sender is requesting acknowledgement of this packet, | |
147 | via an Ack packet response. | |
148 | ||
149 | * LAST-PACKET | |
150 | This packet is the last packet in this call from the | |
151 | sender. | |
152 | ||
153 | NOTE: some older Rx implementations, which do not | |
154 | support the trailing packet size fields in Rx Ack | |
155 | packets, use the LAST-PACKET flag for computing the | |
156 | MTU. In particular, when a DATA packet with the | |
157 | REQUEST-ACK flag but without the LAST-PACKET flag | |
158 | is received, the MTU is adjusted down to the size | |
159 | of that packet. | |
160 | ||
161 | * MORE-PACKETS | |
162 | More packets are going to be following this one. This | |
163 | flag is set on all but the last packet by the sender | |
164 | transmitting a list of packets at once, for possible | |
165 | optimization at the receiver end. | |
166 | ||
167 | * SLOW-START-OK | |
168 | In an ack packet, indicates that the sender of this | |
169 | packet supports the slow-start mechanism, described | |
170 | below under ``Flow Control''. | |
171 | ||
172 | * JUMBO-PACKET | |
173 | In a data packet, indicates that this packet is part | |
174 | of a jumbogram, and is not the last one. See the | |
175 | ``Jumbograms'' section below for more details. | |
176 | ||
177 | Packet Types | |
178 | ============ | |
179 | ||
180 | The "Type" field indicates the contents of this packet. Actual | |
181 | values are specified in the ``Protocol Constants'' section. | |
182 | This section describes the simpler packet types, and subsequent | |
183 | sections cover more complex packet types in more detail. | |
184 | ||
185 | Certain type packets are connection-only requests (that is, they | |
186 | are not associated with an RPC call). A connection-only request | |
187 | is indicated by a zero call number. Valid packet types in a | |
188 | connection-only context are Abort, Challenge, Response, Debug, | |
189 | Version, and the parameter exchange packet types. All other | |
190 | packets can only be used in the context of a call. Additionally, | |
191 | Abort can be used both in a connection and call context. | |
192 | ||
193 | The payload of the packet following the header depends on the | |
194 | type of the field, as follows: | |
195 | ||
196 | * DATA type (Standard data packet) | |
197 | ||
198 | The payload of a data packet is simply the Rx payload, | |
199 | corresponding to the sequence number and call specified | |
200 | in the header. The actual data that is transmitted in | |
201 | Rx data packets is described below. | |
202 | ||
203 | The receipt of a data packet by a client implicitly | |
204 | acknowledges that the server has received and processed | |
205 | all the packets that have been transmitted to it as | |
206 | part of this call. | |
207 | ||
208 | * ACK type (Acknowledgement of received data) | |
209 | ||
210 | An acknowledgement packet provides information about | |
211 | which packets were or were not received by the peer, | |
212 | and other useful parameters. The semantics of these | |
213 | packets are described below in the ``Call Layer'' | |
214 | section. | |
215 | ||
216 | * BUSY type (Busy response) | |
217 | ||
218 | When a client tries to start a new call on a channel | |
219 | which the server still considers active, a busy response | |
220 | is returned. The call and channel number in the packet | |
221 | header indicate which call is being rejected. This packet | |
222 | type has no payload associated with it. | |
223 | ||
224 | * ABORT type (Abort packet) | |
225 | ||
226 | Indicates that the relevant connection or call (if the | |
227 | call number field is non-zero) has encountered an error | |
228 | and has been terminated. The payload of the packet has | |
229 | a network-byte-order 32-bit user error code. | |
230 | ||
231 | * ACKALL type (Acknowledgement of all packets) | |
232 | ||
233 | An acknowledge-all packet indicates the obvious: the peer | |
234 | wants to acknowledge the receipt of all packets sent to | |
235 | it. This could be used, for example, when a connection | |
236 | is being closed and the client wants to ensure that no | |
237 | retransmissions are attempted after it exits. | |
238 | ||
239 | There is no payload associated with an acknowledge-all | |
240 | packet. | |
241 | ||
242 | * CHALLENGE, RESPONSE types (Challenge request/response) | |
243 | ||
244 | The payload of the packet is security-layer-specific | |
245 | data, and is used to authenticate an Rx connection. | |
246 | ||
247 | Perhaps this should include a reference to some spec | |
248 | on rxkad (or rxkad should just be added to this spec.) | |
249 | ||
250 | * DEBUG type (Debug packet) | |
251 | ||
252 | Rx supports an optional debugging interface; see the | |
253 | ``Debugging'' section below for more details. | |
254 | ||
255 | * PARAMS types (Parameter exchange) | |
256 | ||
257 | These types were assigned in AFS 3.2 but never used for | |
258 | anything, and therefore have no protocol significance | |
259 | at this time. | |
260 | ||
261 | * VERSION type (Get AFS version) | |
262 | ||
263 | If a server receives a packet with a type value of 13, and | |
264 | the client-initiated flag set, it should respond with a | |
265 | 65-byte payload containing a string that identifies the | |
266 | version of AFS software it is running. The response should | |
267 | not have the client-initiated flag set. | |
268 | ||
269 | Nothing should respond to a version packet without the | |
270 | client-initiated flag, to avoid infinite packet loops. | |
271 | ||
272 | Call Layer | |
273 | ========== | |
274 | ||
275 | The call layer provides a reliable data transport over an | |
276 | Rx channel, and is used by the RPC layer to make Rx calls. | |
277 | One of the most important pieces of the call layer is the | |
278 | Rx acknowledgement packet. The acknowledgement packet is | |
279 | used by Rx to determine when retransmissions are needed, | |
280 | as well as determining the proper transmission / receiving | |
281 | parameters to use (such as the transmit window size and | |
282 | jumbogram length, described in more detail below). | |
283 | ||
284 | A new call is established by the client simply sending a | |
285 | data packet to the server on an available channel. Either | |
286 | side can indicate that they have no more data to send by | |
287 | setting the LAST-PACKET flag in their last Rx packet. The | |
288 | call remains open until the upper layer informs Rx that it | |
289 | is done with the call. (The upper layer in this case would | |
290 | most likely be the Rx RPC layer.) | |
291 | ||
292 | The structure of an Rx acknowledgement packet is described | |
293 | in the Packet Formats section. We will refer to particular | |
294 | fields of the acknowledgement packet here by names. | |
295 | ||
296 | The <Buffer Space> field specifies the number of packets that | |
297 | the sender of the acknowledgement is willing to provide for | |
298 | receiving packets for this call. The sender, presumably, | |
299 | should not send packets beyond the number specified here, | |
300 | without receiving further acknowledgement allowing it. | |
301 | ||
302 | The <Max Skew> field indicates the maximum packet skew that | |
303 | the sender of this packet has seen for this call. If a | |
304 | packet is received N packets later than expected (based | |
305 | on the packet's serial number, i.e. if the last received | |
306 | packet's serial number is N higher than this packet's), | |
307 | then it is defined to have a skew of N. This can be used | |
308 | to avoid retransmission because of packet reordering. | |
309 | ||
310 | The <First Sequence> number specifies the sequence number of | |
311 | the first packet that is being explicitly acknowledged (either | |
312 | positively or negatively) by this packet. All packets with | |
313 | sequence numbers smaller than this are implicitly acknowledged. | |
314 | ||
315 | The <Reserved> field, previously used to indicate the previous | |
316 | received packet, is no longer used. It should be set to zero | |
317 | by the sender and not interpreted by the receiver. | |
318 | ||
319 | The <Serial Number> field indicates the serial number of the | |
320 | packet which has triggered this acknowledgement, or zero if there | |
321 | is no such packet (i.e. the ack packet was delayed and should not | |
322 | be used for round-trip time computation). The receiver should | |
323 | note that any transmitted packets with a serial number less than | |
324 | this, which are not acknowledged by this packet, are likely lost | |
325 | or reordered. Thus, these packets should be retransmitted, after | |
326 | a possible delay to allow for packet reordering (as measured by | |
327 | packet skew). | |
328 | ||
329 | The trailing fields after the variable-length acknowledgements | |
330 | section are not always 32-bit aligned with respect to the packet, | |
331 | and aren't always present. (Their presence depends on the Rx | |
332 | version of the peer.) The maximum and recommended packet sizes | |
333 | are, respectively, the largest possible packet size that the peer | |
334 | is willing to accept from us, and the size of the packet they | |
335 | would prefer to receive. In absence of these fields, it should | |
336 | be assumed that the maximum allowed packet size is 1444 bytes. | |
337 | ||
338 | The receive window size indicates the size of the ACK sender's | |
339 | receive window, in packets. Its use is described below in | |
340 | the "Flow Control" section. If this field is absent, the | |
341 | implementation must assume a maximum window size of 15 packets; | |
342 | older implementations that do not support this trailing field | |
343 | only allow for a window of 15 packets. | |
344 | ||
345 | The "Max Packets per Jumbogram" field indicates how many packets | |
346 | the ACK sender is willing to receive in a jumbogram (also | |
347 | described below). All packets in a jumbogram are always of the | |
348 | same size (except the last one), regardless of the maximum and | |
349 | recommended packet sizes described above. | |
350 | ||
351 | The <Reason> field specifies a particular type of an ack packet. | |
352 | Valid reason codes are specified in the ``Packet Formats and | |
353 | Protocol Constants'' section; their meanings are as follows: | |
354 | ||
355 | REQUESTED | |
356 | Acknowledgement was requested. The peer received | |
357 | a packet from us with the acknowledgement-requested | |
358 | flag set, and is acknowledging it. | |
359 | ||
360 | DUPLICATE | |
361 | A duplicate packet was received. The duplicate | |
362 | packet's serial number is in the <Serial> field. | |
363 | ||
364 | OUT-OF-SEQUENCE | |
365 | A packet was received out of sequence. The serial | |
366 | number of said packet is in the <Serial> field. | |
367 | ||
368 | WINDOW-EXCEEDED | |
369 | A packet was received but exceeded the current | |
370 | receive window, and was dropped. | |
371 | ||
372 | NO-SPACE | |
373 | A packet was received, but no buffer space was | |
374 | available and therefore it was dropped. | |
375 | ||
376 | PING | |
377 | This is a keep-alive packet, used to verify that | |
378 | the peer is still alive. If the REQUEST-ACK flag | |
379 | in the Rx packet is set, the recipient of this | |
380 | packet should reply with a PING-RESPONSE packet. | |
381 | ||
382 | PING-RESPONSE | |
383 | This is a response to a keep-alive ack (ping). | |
384 | ||
385 | DELAYED | |
386 | A delayed acknowledgement, usually because a certain | |
387 | amount of time has passed since the receipt of the | |
388 | last packet and there are outstanding unacknowledged | |
389 | packets. Should not be used for RTT computation. | |
390 | ||
391 | OTHER | |
392 | Un-delayed general acknowledgement, which does not | |
393 | fall in any of the above categories. | |
394 | ||
395 | A peer should never delay the transmission of an ack packet | |
396 | in response to a received packet unless it sets the delayed | |
397 | ack type field. This is because ack packets (except for | |
398 | delayed ones) are used for RTT computation by Rx. | |
399 | ||
400 | All acknowledgement packets should have the REQUEST-ACK | |
401 | flag in the Rx header turned off, except for PING type | |
402 | ack packets. | |
403 | ||
404 | The <Ack Count> field specifies the number of bytes following | |
405 | in the acknowledgements section. Each of those bytes indicate | |
406 | the acknowledgement status corresponding to a sequence number | |
407 | between firstSequence and firstSequence+ackCount-1 inclusively. | |
408 | There can be up to 255 bytes in the acknowledgements section. | |
409 | Typically the ack count is the receive window size of the | |
410 | ack packet sender, and the individual packet status bytes | |
411 | correspond to the packets in the current receive window. | |
412 | The values in each of those bytes can be as follows: | |
413 | ||
414 | 0 Explicit negative acknowledgement: packet with the | |
415 | corresponding sequence number has not been received | |
416 | or has been dropped. | |
417 | 1 Explicit acknowledgement: packet with the corresponding | |
418 | sequence number has been received but not processed by | |
419 | the application yet. | |
420 | ||
421 | It's important to note the distinction between packets with | |
422 | sequence numbers before firstSequence, between firstSequence | |
423 | and firstSequence+ackCount-1, and those with sequence numbers | |
424 | of at least firstSequence+ackCount. Those in the first category | |
425 | have been passed up to the application level and the sender | |
426 | (recipient of this ack) can recycle packets with such sequence | |
427 | numbers. | |
428 | ||
429 | Packets in the second category are individually acknowledged | |
430 | in the acknowledgements section, either as being queued for | |
431 | the application or not received. The recipient of the ack | |
432 | should keep all packets with sequence numbers in this range, | |
433 | but avoid retransmitting the positively acknowledged ones. | |
434 | Negatively acknowledged packets should be retransmitted. | |
435 | A more detailed explaination of the retransmit strategy is | |
436 | given below. | |
437 | ||
438 | Packets in the third category are not acknowledged at all, | |
439 | and the recipient of the ack should assume no knowledge | |
440 | of their state. Since the Rx receive window should not | |
441 | exceed the size of an ack packet, the sender shouldn't | |
442 | have transmitted any packets in this category anyway. | |
443 | ||
444 | * Round-trip time computation | |
445 | ||
446 | To determine when packet retransmission is necessary, Rx | |
447 | computes some statistics about the round-trip time between | |
448 | the two hosts: exponentially-decaying averages of the | |
449 | round-trip time and the standard deviation thereof. Each | |
450 | acknowledgement packet which mentions a specific packet in | |
451 | the <Serial> field and is not delayed is used to update the | |
452 | round-trip statistics. First, the round-trip time for this | |
453 | packet (R) is computed as the difference between the arrival | |
454 | time of the ack packet and the time we transmitted the | |
455 | packet with the serial number specified in <Serial>. | |
456 | ||
457 | Next, the round-trip time average and standard deviation | |
458 | values are updated. For instance, this algorithm could | |
459 | be used: | |
460 | ||
461 | RTTdev = RTTdev * (3/4) + |RTTavg - R| / 4 | |
462 | RTTavg = RTTavg * (7/8) + R / 8 | |
463 | ||
464 | * Packet retransmission | |
465 | ||
466 | In order to support reliable data transport, Rx must retransmit | |
467 | packet which are lost in the network. This must not be done | |
468 | too early, otherwise we might retransmit a packet whose first | |
469 | copy is still in transit, thereby wasting bandwidth. | |
470 | ||
471 | Rx computes a retransmit timeout value T, and retransmits any | |
472 | packet which hasn't been positively acknowledged since last | |
473 | transmission for at least T seconds. This timeout could be | |
474 | computed as follows from the round-trip statistics above: | |
475 | ||
476 | T = RTTavg + 4 * RTTdev + 0.350 | |
477 | ||
478 | This allows the packet to be up to 4 deviations late and still | |
479 | not be retransmitted. The 350 msec fudge factor is used to | |
480 | compensate for bursty networks, though it is likely becoming | |
481 | less relevant (and accurate) with time. | |
482 | ||
483 | A more clever algorithm could take into account the maximum | |
484 | packet skew rate, and improve the retransmission strategy to | |
485 | take into the account the likelihood that a given packet has | |
486 | been reordered, and give it extra time before retransmission. | |
487 | ||
488 | * Keepalive and Timeout | |
489 | ||
490 | The upper layer (either the Rx RPC layer or the application) | |
491 | have to specify a timeout, T, to the call layer. If the peer | |
492 | is not heard from within T seconds, the call layer declares | |
493 | the call to be dead and propagates the error to the upper | |
494 | layer. | |
495 | ||
496 | In order to determine whether the peer is still alive or not, | |
497 | keepalive requests are used. These take form of an ack PING | |
498 | and PING-RESPONSE packets. When the client has not received | |
499 | any response from the server, either to the original request | |
500 | or the keepalive requests, in T seconds, the call times out. | |
501 | ||
502 | The following strategy may be used to determine when to send | |
503 | keepalive requests: | |
504 | ||
505 | Compute a keepalive timeout, KT = T/6 | |
506 | ||
507 | If the call was initiated KT seconds ago, or KT | |
508 | seconds have passed since the last keepalive | |
509 | request transmission, send a keepalive packet. | |
510 | ||
511 | This strategy limits the number of transmitted keepalive | |
512 | packets to a fixed number in the case of a dead server, | |
513 | and proportional to the real timeout in case of a slow | |
514 | server. It also allows up to 5 keepalives to be dropped | |
515 | before the server is erroneously declared dead. | |
516 | ||
517 | * Flow Control | |
518 | ||
519 | Every Rx client or server has associated with each Rx call a | |
520 | receive and transmit window. These windows indicate the number | |
521 | of packets that haven't been fully acknowledged packets (that | |
522 | is, not read by the peer's application) that an Rx sender can | |
523 | have outstanding at any time. A sender's transmit window may | |
524 | never be greater than it's peer's receive window for that call. | |
525 | The receive windows are exchanged via the "Receive Window Size" | |
526 | parameter in an Ack packet. | |
527 | ||
528 | Rx ``sliding windows'' are similar to those used by TCP, except | |
529 | they measure packets rather than bytes. Also, in TCP the window | |
530 | effectively applies to bytes in flight between the two peers, | |
531 | whileas in Rx the window applies to packets between the user | |
532 | applications. For example, a transmit window of 8 on a certain | |
533 | Rx connection means that at most 8 packets can be transmitted | |
534 | and not yet read by the peer's application at any time. The | |
535 | sequence number of the first packet that hasn't been read by | |
536 | the application is indicated by the First Sequence field of | |
537 | an Ack packet. | |
538 | ||
539 | The selection of initial window sizes isn't strictly defined | |
540 | by the Rx protocol, but here are a few things that one might | |
541 | want to consider when choosing initial windows: | |
542 | ||
543 | * A useful strategy can be to advertise a small receive | |
544 | window until the application starts reading data, and | |
545 | advertise a larger window afterwards. | |
546 | ||
547 | * The transmit window should be initially a conservative | |
548 | small value. Once an Ack packet is received, the peer's | |
549 | advertised receive window can be used to choose a better | |
550 | transmit window. | |
551 | ||
552 | Rx uses the slow start, congestion avoidance, and fast recovery | |
553 | algorithms[6]. The algorithms are modified to work in the context | |
554 | of Rx packet-based transmission windows, and are described below. | |
555 | ||
556 | These algorithms require two additional variables to be maintained | |
557 | for each active Rx call: a congestion window, cwind, and a slow | |
558 | start threshold, ssthresh. | |
559 | ||
560 | Define a "negative ack" as an Ack packet that contains a negative | |
561 | acknowledgement followed by a positive one. Similarly, define a | |
562 | "positive ack" to be any Ack that is not negative. Upon receiving | |
563 | three negative acks for a call in a row since the last congestion | |
564 | avoidance attempt (if any), the Rx protocol enters congestion | |
565 | avoidance for that Rx call. | |
566 | ||
567 | * Slow start, congestion avoidance, and fast recovery algorithms | |
568 | ||
569 | First, the congestion window, cwind, is initialized to 1. | |
570 | The number of unread transmitted packets is now limited not | |
571 | only by the transmission window, but also by the congestion | |
572 | window. The latter limit is a little different: Rx may | |
573 | send up to cwind packets (by sequence number) past the last | |
574 | contiguous positively acknowledged packet. For example, | |
575 | if an Ack packet indicates that packets 1, 2 and 8 were | |
576 | received, and cwind is 2, Rx may transmit packets 3 and 4. | |
577 | ||
578 | When congestion occurs (indicated by a negative ack or a | |
579 | packet retransmission timeout), Rx enters congestion avoidance | |
580 | and fast recovery. The slow-start threshold, ssthresh, is | |
581 | set to half of the effective transmission window (minimum of | |
582 | cwind and transmit window), but no less than 2 packets. | |
583 | ||
584 | If triggered by a negative ack, any negatively acknowledged | |
585 | packets should be retransmitted as soon as possible (i.e. | |
586 | window-permitting). | |
587 | ||
588 | If triggered by a retransmission timeout, the congestion | |
589 | window is reset to a single packet. | |
590 | ||
591 | When in fast-recovery mode, every additional negative ack | |
592 | packet received causes cwind to be increased by one packet. | |
593 | A positive ack packet causes cwind to be set to ssthresh, | |
594 | and terminates fast recovery. At this point we are back | |
595 | to congestion avoidance, since the cwind is half the original | |
596 | transmission window. | |
597 | ||
598 | When packet acknowledgements are received, the congestion | |
599 | window should be increased. If cwind is less than ssthresh, | |
600 | cwind should be increased by 1 for each newly acknowledged | |
601 | packet. If cwind is at least ssthresh, cwind is increased | |
602 | by 1 for each newly received Ack packet. | |
603 | ||
604 | The size of the receive window should not grow past the size of | |
605 | an Rx ack packet (which can acknowledge up to 255 packets at a | |
606 | time.) | |
607 | ||
608 | Debugging | |
609 | ========= | |
610 | ||
611 | Rx provides for an optional debugging interface, using the Debug AFS | |
612 | packet type, allowing remote Rx clients to query an Rx server for | |
613 | some Rx protocol statistics. Not all implementations are required | |
614 | to implement this interface. Some parts of this interface may also | |
615 | be specific to a particular implementation of Rx. In order to prevent | |
616 | packet loops, a server should only reply to debug packets with the | |
617 | client-initiated flag set. | |
618 | ||
619 | The payload of a debug request packet is always the same; both of | |
620 | the 32-bit quantities are in network byte order: | |
621 | ||
622 | 0 1 2 3 | |
623 | 0 1 2 3 4 5 6 7 8 9 0 1 2 3 4 5 6 7 8 9 0 1 2 3 4 5 6 7 8 9 0 1 | |
624 | +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ | |
625 | | Debug Type | | |
626 | +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ | |
627 | | Debug Index | | |
628 | +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ | |
629 | ||
630 | The debug type indicates the kind of debug information being sent | |
631 | or requested, and determines the format of the rest of the packet. | |
632 | The debug index allows some debug types to export array-like data, | |
633 | indexed by this field. The following debug types are defined for | |
634 | the Transarc implementation: | |
635 | ||
636 | 0x01 Retrieve basic connection statistics | |
637 | 0x02 Get information about some connections | |
638 | 0x03 Get information about all connections | |
639 | 0x04 Get all Rx stats | |
640 | 0x05 Get all peers of this server | |
641 | ||
642 | The index field in the debug packet indicates which element of the | |
643 | debug information the client wants to access, in cases where there | |
644 | are multiple entries in question. | |
645 | ||
646 | The responses to each of those debug queries contain the following | |
647 | information: | |
648 | ||
649 | 1. Retrieve basic connection stats | |
650 | ||
651 | An array of general statistics about packet allocation, | |
652 | server performance, and so on. The first octet in this | |
653 | response represents the debug protocol version being used | |
654 | by the server. See RX_DEBUGI_VERSION* in rx/rx.h. | |
655 | ||
656 | 2, 3. Get information about connections | |
657 | ||
658 | Both of these calls return a struct rx_debugConn (see | |
659 | rx/rx.h), indexed by the "index" field. | |
660 | ||
661 | The first version of the debug call (type 2) only retrieves | |
662 | information about connections which are deemed interesting, | |
663 | that is, connections which are active, or about to be | |
664 | reaped. | |
665 | ||
666 | The end of the list is signaled by a response where the | |
667 | connection ID value is 0xFFFFFFFF. | |
668 | ||
669 | 4. Get Rx stats | |
670 | ||
671 | This call returns a struct rx_stats to the client in network | |
672 | byte order, containing various statistics about the state of | |
673 | Rx on the server (see rx/rx.h). | |
674 | ||
675 | 5. Get all Rx peers | |
676 | ||
677 | Similar to the connection request above (2, 3) this call | |
678 | returns all the Rx peers of the server (in a network-byte-order | |
679 | struct rx_debugPeer), indexed by the index field in the request. | |
680 | End of list is indicated by a host value of 0xFFFFFFFF. (These | |
681 | are the first 4 octets.) | |
682 | ||
683 | In response to unknown requests, the server returns 0xFFFFFFF8 in the | |
684 | debug type field. | |
685 | ||
686 | XXX The response interface should probably be fixed | |
687 | to include a fixed header that indicates whether | |
688 | the request was successfully completed. | |
689 | ||
690 | Jumbograms | |
691 | ========== | |
692 | ||
693 | To be able to transmit more data in a single packet, Rx supports | |
694 | ``jumbograms'', which are single UDP datagrams containing multiple | |
695 | sequential Rx DATA packets. In a jumbogram, all packets except the | |
696 | last one must be of a fixed maximal size (1412 bytes). Because all | |
697 | the packets in the jumbogram are sequential, only one full header | |
698 | is needed. Here is what a jumbogram could look like: | |
699 | ||
700 | +-----------+---------------+--------------+---------------+ | |
701 | | Rx header | 1412 byte pkt | Short header | 1412 byte pkt | -> | |
702 | +-----------+---------------+--------------+---------------+ | |
703 | ||
704 | +--------------+- -+-----------------------+ | |
705 | -> | Short header | ... | <= 1412 byte last pkt | | |
706 | +--------------+- -+-----------------------+ | |
707 | ||
708 | Every Rx packet in a jumbogram except the first one must be preceeded | |
709 | by the short Rx header, and all packets except the last one must have | |
710 | the Jumbogram Rx flag set in their respective headers. The number of | |
711 | packets in a jumbogram may not exceed the peer's advertised Max Packets | |
712 | Per Jumbogram value in the Ack packet. | |
713 | ||
714 | The maximum number of packets per jumbogram should be assumed to be 1 | |
715 | (i.e., no jumbograms) unless explicitly specified otherwise by an Ack | |
716 | packet. If an Ack packet is received without the packet-per-jumbogram | |
717 | field, it might indicate that the peer is now running a version of Rx | |
718 | that does not support jumbograms, and therefore no jumbograms should | |
719 | be sent until they are explicitly enabled again. | |
720 | ||
721 | The short header in a jumbogram has the following makeup: | |
722 | ||
723 | 0 1 | |
724 | 0 1 2 3 4 5 6 7 8 9 0 1 2 3 4 5 | |
725 | +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ | |
726 | | Flags | Reserved | | |
727 | +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ | |
728 | | Checksum | | |
729 | +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ | |
730 | ||
731 | All the packets in the jumbogram have the same Rx header fields | |
732 | (from the full Rx header) except for Flags, Checksum, Sequence, | |
733 | and Serial. The flags and checksum field for subsequent packets | |
734 | are taken from the short header preceeding that packet in the | |
735 | jumbogram. The sequence and serial numbers are assumed to be | |
736 | consecutive, and are incremented by 1 from the first packet in | |
737 | the jumbogram (ie the full Rx header). | |
738 | ||
739 | Retransmitted packets should not be sent in a jumbogram. | |
740 | ||
741 | RPC Layer | |
742 | ========= | |
743 | ||
744 | This section discusses how an RPC call is made using the Rx protocol. | |
745 | There are two common ``types'' of Rx calls: simple and streaming. | |
746 | These mostly reflect a difference in the upper-level API rather than | |
747 | in the Rx protocol. A simple Rx call has a fixed number of input | |
748 | variables and a fixed number of output variables. A streaming Rx | |
749 | call, in addition to the above, allows the user to send and receive | |
750 | arbitrary amounts of data (whose length should be specified as a | |
751 | fixed-length argument.) | |
752 | ||
753 | In either case, an Rx call consists of two basic stages: client | |
754 | sending the data to the server, and server sending the response | |
755 | back to the client. No data can be sent by the client in the | |
756 | same call after the server has started sending its response. | |
757 | ||
758 | Each remote function call associated with a particular Rx service | |
759 | (identified by the IP-port-serviceId triplet, as mentioned above) | |
760 | is assigned a 32-bit integer opcode number. To make a simple Rx | |
761 | call, the caller must transmit the opcode number followed by the | |
762 | expected arguments for that call over an Rx channel using XDR | |
763 | encoding. The callee uses XDR to unmarshall the opcode and input | |
764 | arguments, performs a function call corresponding to that opcode | |
765 | and arguments, and then uses XDR to encode the return values back | |
766 | to the caller. The caller then uses XDR to receive the output | |
767 | variables. | |
768 | ||
769 | For streaming calls which send data from the caller to the callee, | |
770 | the convention is to include the length of the data to be sent as | |
771 | one of the fixed-length arguments, and send the variable-length | |
772 | data immediately after the fixed-length portion. For streaming | |
773 | calls which receive data, the convention is for the callee to first | |
774 | reply with a fixed-length field specifying the number of bytes it's | |
775 | about to send, and then send those bytes. Upon completion of the | |
776 | streaming part of the call, the output arguments are sent back to | |
777 | the caller in fixed-length XDR form, as with simple calls. | |
778 | ||
779 | Packet Formats and Protocol Constants | |
780 | ===================================== | |
781 | ||
782 | * Rx packet | |
783 | ||
784 | Every simple Rx packet has an Rx header, of the form below. | |
785 | All quantities are in network byte order. | |
786 | ||
787 | 0 1 2 3 | |
788 | 0 1 2 3 4 5 6 7 8 9 0 1 2 3 4 5 6 7 8 9 0 1 2 3 4 5 6 7 8 9 0 1 | |
789 | +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ | |
790 | |+| Connection Epoch | | |
791 | +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ | |
792 | | Connection ID | * | | |
793 | +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ | |
794 | | Call Number | | |
795 | +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ | |
796 | | Sequence Number | | |
797 | +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ | |
798 | | Serial Number | | |
799 | +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ | |
800 | | Type | Flags | Status | Security | | |
801 | +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ | |
802 | | Checksum | Service ID | | |
803 | +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ | |
804 | | Payload .... | |
805 | +-+-+-+-+- | |
806 | ||
807 | [*] The field marked with * is the Channel ID. The last | |
808 | two bits of the connection ID are used to multiplex | |
809 | between 4 parallel calls. | |
810 | ||
811 | [+] The bit marked with + is used to indicate that only | |
812 | the connection ID should be used to identify this | |
813 | connection, and sender host/port should not be used. | |
814 | ||
815 | The values for the Flags field are defined as follows: | |
816 | ||
817 | 0000 0001 CLIENT-INITIATED | |
818 | 0000 0010 REQUEST-ACK | |
819 | 0000 0100 LAST-PACKET | |
820 | 0000 1000 MORE-PACKETS | |
821 | 0001 0000 - Reserved - | |
822 | 0010 0000 SLOW-START-OK | |
823 | 0010 0000 JUMBO-PACKET | |
824 | ||
825 | Commonly, but not necessarily, the following value mappings | |
826 | for the Security field are used: | |
827 | ||
828 | 0 No security or encryption | |
829 | 1 bcrypt security, only used in AFS 2.0 | |
830 | 2 "krb4" rxkad | |
831 | 3 "krb4" rxkad with encryption (sometimes) | |
832 | ||
833 | The following packet type values are defined: | |
834 | ||
835 | 1 DATA Standard data packet | |
836 | 2 ACK Acknowledgement of received data | |
837 | 3 BUSY Busy response | |
838 | 4 ABORT Abort packet | |
839 | 5 ACKALL Acknowledgement of all packets | |
840 | 6 CHALLENGE Challenge request | |
841 | 7 RESPONSE Challenge response | |
842 | 8 DEBUG Debug packet | |
843 | 9 PARAMS Exchange of parameters | |
844 | 10 PARAMS Exchange of parameters | |
845 | 11 PARAMS Exchange of parameters | |
846 | 12 PARAMS Exchange of parameters | |
847 | 13 VERSION Get AFS version | |
848 | ||
849 | * Rx acknowledgement packet | |
850 | ||
851 | 0 1 2 3 | |
852 | 0 1 2 3 4 5 6 7 8 9 0 1 2 3 4 5 6 7 8 9 0 1 2 3 4 5 6 7 8 9 0 1 | |
853 | +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ | |
854 | | Buffer Space | Max Skew | | |
855 | +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ | |
856 | | First Sequence | | |
857 | +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ | |
858 | | Reserved | | |
859 | +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ | |
860 | | Serial | | |
861 | +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ | |
862 | | Reason | Ack Count | Acknowledgements ... | |
863 | +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ .. | |
864 | ||
865 | ... -+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ | |
866 | ... Acks | Reserved | Reserved | | |
867 | +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ | |
868 | | Maximum Packet Size | | |
869 | +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ | |
870 | | Recommended Packet Size | | |
871 | +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ | |
872 | | Receive Window Size | | |
873 | +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ | |
874 | | Max Packets per Jumbogram | | |
875 | +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ | |
876 | ||
877 | Note that the trailing fields can have arbitrary alignment, | |
878 | determined by the number of individual acks in the packet. | |
879 | There are three reserved octets between the variable acks | |
880 | section and the start of the trailing fields; they also have | |
881 | no particular alignment. | |
882 | ||
883 | The valid values for the Reason code are: | |
884 | ||
885 | 1 REQUESTED | |
886 | 2 DUPLICATE | |
887 | 3 OUT-OF-SEQUENCE | |
888 | 4 WINDOW-EXCEEDED | |
889 | 5 NO-SPACE | |
890 | 6 PING | |
891 | 7 PING-RESPONSE | |
892 | 8 DELAYED | |
893 | 9 OTHER | |
894 | ||
895 | Acknowledgements | |
896 | ================ | |
897 | ||
898 | Jeffrey Hutzelman <jhutz@cmu.edu> reviewed an early draft of this | |
899 | specification, and provided much appreciated feedback on technical | |
900 | details as well as document structuring. | |
901 | ||
902 | Love Hornquist-Astrand <lha@stacken.kth.se> made many corrections | |
903 | to this specification, especially regarding backwards-compatibility | |
904 | with older Rx implementations. | |
905 | ||
906 | References | |
907 | ========== | |
908 | ||
909 | [1] /afs/sipb.mit.edu/contrib/doc/AFS/hijacking-afs.ps.gz | |
910 | ||
911 | [2] OpenAFS: src/rx/ | |
912 | ||
913 | [3] /afs/sipb.mit.edu/contrib/doc/AFS/ps/rx-spec.ps | |
914 | ||
915 | [4] ftp://ftp.stacken.kth.se/pub/arla/prog-afs/shadow/doc/r.vdoc | |
916 | ||
917 | [5] ftp://ftp.stacken.kth.se/pub/arla/prog-afs/shadow/doc/rx.mss | |
918 | ||
919 | [6] http://web.mit.edu/rfc/rfc2001.txt | |
920 | ||
921 | $Id: rx-spec,v 1.22 2002/10/20 06:46:00 kolya Exp $ |