Commit | Line | Data |
---|---|---|
805e021f CE |
1 | This file contains a threading analysis of Ubik prepared by |
2 | Jeffrey Hutzelman and sent to the openafs-devel@openafs.org | |
3 | mailing list in February 2011, archived at: | |
4 | https://lists.openafs.org/pipermail/openafs-devel/2011-February/018329.html | |
5 | ||
6 | A while ago, Jeff Altman asked me to do an analysis of the Ubik code with | |
7 | repsect to threading, to try to determine how ready the code is to be moved | |
8 | from an LWP-based environment to one using preemptive POSIX threads, and | |
9 | what it will take to get it the rest of the way there. Now that the work | |
10 | is complete, I'm posting the complete analysis and my recommendations for | |
11 | discussion and as a backdrop for changes I expect to see proposed in the | |
12 | near future. | |
13 | ||
14 | This work was funded by Your File System, Inc. | |
15 | ||
16 | -- Jeff | |
17 | ||
18 | INTRODUCTION | |
19 | ||
20 | This document describes the structure of Ubik, with an eye toward | |
21 | thread-safety and concurrency. It begins by discussing the elements, | |
22 | usage, and locking considerations of major shared data structures. This | |
23 | is followed by a discussion of each major subsystem. The emphasis is on | |
24 | deficiencies in thread-safety and the changes needed to correct them. | |
25 | ||
26 | Most of this document focuses on the user-mode Ubik server code, which | |
27 | comprises the bulk of the Ubik library code and also of the complexity. | |
28 | A separate section near the end is dedicated to Ubik client code, which | |
29 | itself is intended primarily for use in user-mode applications. The | |
30 | OpenAFS cache manager includes its own mechanisms for accessing | |
31 | replicated services, including the Ubik-managed VLDB, and for tracking | |
32 | the servers that provide them. | |
33 | ||
34 | Occasionally, issues related to concurrency, performance, and modularity | |
35 | (particularly as regards support for supporting multiple independent | |
36 | databases in a single server process) are also discussed; however, these | |
37 | discussions are peripheral and are included only to record data obtained | |
38 | while examining the question of thread-safety and as a basis for future | |
39 | work. Similarly, throughout this document are detailed discussions of | |
40 | some parts of the code and certain data structures which, while not | |
41 | always directly relevant to thread-safety, were produced in the course | |
42 | of analyzing the Ubik library code. These are included to provide | |
43 | additional background and as a basis for future work in documentation of | |
44 | Ubik library internals and interfaces. | |
45 | ||
46 | ||
47 | MAJOR DATA STRUCTURES | |
48 | ||
49 | This section calls out the major shared data structures used in the Ubik | |
50 | server code and details what they are used for and how shared access to | |
51 | them is protected. Much of this text is probably suitable as a starting | |
52 | point for expanding on the documentation found in comments in ubik.p.h. | |
53 | ||
54 | struct ubik_dbase - database | |
55 | ||
56 | This structure contains nearly everything Ubik knows about an open | |
57 | database (for an explanation of "nearly", see the section below on | |
58 | globals), including database-wide parameters and state and a list of | |
59 | active transactions. | |
60 | ||
61 | The following structure elements are set only when the database | |
62 | structure is being initialized, after which they are never modified: | |
63 | ||
64 | + pathName - The base path used to construct database filenames | |
65 | + physical later method pointers (read, write, truncate, sync, | |
66 | stat, open, setlabel, getlabel, getnfiles), used to manipulate | |
67 | the physical database files on disk. These currently always | |
68 | point to the methods defined in phys.c, and in fact other | |
69 | modules contain knowledge of the internal implementation of that | |
70 | module. | |
71 | ||
72 | The primary synchronization mechanism used to protect this structure | |
73 | is the database lock (versionLock), which is manipulated via the | |
74 | DBHOLD() and DBRELE() macros. This lock protects the following | |
75 | structure elements, as well as a number of globals as described | |
76 | elsewhere in this document: | |
77 | ||
78 | + activeTrans - list of active transactions | |
79 | + version - database version (*) | |
80 | + tidCounter - transaction ID of latest transaction (*) | |
81 | + writeTidCounter - transaction ID of latest write transaction (*) | |
82 | + flags (*) | |
83 | + readers | |
84 | ||
85 | (*) These fields are currently protected by the database lock, except | |
86 | that the beacon thread does not hold that lock when examining them, | |
87 | or the global 'ubik_epochTime', which holds the corresponding epoch. | |
88 | The recommendation is for these fields, along with ubik_epochTime, to | |
89 | be additionally protected by a second lock, allowing the beacon | |
90 | thread to examine them without holding the database lock; see the | |
91 | section on the BEACON package for details. | |
92 | ||
93 | The condition variable 'version_cond' is used to signal to that the | |
94 | database version may have changed; it is broadcast in udisk_commit(), | |
95 | in SDISK_SendFile(), and from the recovery thread; it is monitored by | |
96 | ubik_WaitVersion(), which can be called by an application to wait for | |
97 | a database version change (this is not currently used in OpenAFS). | |
98 | This CV is associated with the database lock. When LWP is used, this | |
99 | condition is signalled on &ubik_dbase->version. | |
100 | ||
101 | The condition variable 'flags_cond' is used by udisk_end() to signal | |
102 | that DBWRITING flag has been cleared. This wakes threads waiting in | |
103 | ubik.c:BeginTrans() to begin a new transaction. This CV is | |
104 | associated with the database lock. When LWP is used, this condition | |
105 | is signalled on &ubik_dbase->flags. | |
106 | ||
107 | When LWP is used, signals are also sometimes sent on &urecovery_state | |
108 | and on &ubik_amSyncSite. However, nothing ever waits on these, and | |
109 | no corresponding condition variables exist when pthreads is used. | |
110 | ||
111 | The 'cachedVersion' field holds the database version represented by | |
112 | the application's cache; this is compared against the current | |
113 | database version to discover if the cache is up to date. Both | |
114 | cachedVersion and the application cache itself are protected by | |
115 | cache_lock, an AFS reader/writer lock. | |
116 | ||
117 | When the application wishes to use the cache, it calls | |
118 | ubik_CacheUpdate(), which verifies that the cache is current, | |
119 | updating it if necessary (under a write lock), and then returns with | |
120 | the application cache read-locked. The application can then rely on | |
121 | the cache's contents not to change until it ends the transaction by | |
122 | calling ubik_EndTrans() or ubik_AbortTrans(). When one of these | |
123 | functions is called, the read lock is dropped, and a write lock may | |
124 | be acquired temporarily to update both cachedVersion and the cache | |
125 | (in the case of committing a write transaction) or to clear | |
126 | cachedVersion, indicating the cache is invalid (in the case of | |
127 | aborting a transaction). | |
128 | ||
129 | In the case of ubik_EndTrans() on a write transaction, the cache is | |
130 | held write-locked until udisk_commit() has completed; this is also | |
131 | done in SDISK_Commit(), which insures that the contents of the | |
132 | on-disk database and page cache cannot change while an active | |
133 | transaction holds the cache lock. Note that both the recovery thread | |
134 | and SDISK_SendFile() may replace the database contents wholesale; | |
135 | however, these operations first abort all outstanding transactions. | |
136 | ||
137 | Prior to the introduction of ubik_BeginTransReadAnyWrite(), an | |
138 | application could choose to manage its own cache, updating it during | |
139 | write transactions and when, at the beginning of a read transaction, | |
140 | it was discovered to be out of date (due to a write done by another | |
141 | server). Under this model, the application uses Ubik transaction | |
142 | locks to protect not only the database but also its cache. However, | |
143 | the use of ubik_BeginTransReadAnyWrite() allows some transactions to | |
144 | proceed and read the database without acquiring any locks, which | |
145 | makes it unsafe to depend on Ubik transaction locks to protect the | |
146 | application cache. Applications which make use of this mode must set | |
147 | ubik_SyncWriterCacheProc to point to a procedure to be called to | |
148 | update the cache during ubik_EndTrans(), after udisk_commit() has | |
149 | succeeded and before the cache_lock is released. These applications | |
150 | must then refrain from ever updating the cache other than during that | |
151 | procedure or during the callback passed to ubik_CheckCache(). | |
152 | ||
153 | ||
154 | struct ubik_trans - active transaction | |
155 | ||
156 | This structure represents the state of an active transaction, | |
157 | including transaction metadata and state and a list of all changes | |
158 | made as part of the transaction. | |
159 | ||
160 | Transactions are organized into a linked list referenced by the | |
161 | corresponding database structure, protected by the database lock. | |
162 | Each transaction contains an upward pointer to the database, | |
163 | a transaction type, and a transaction ID, all of which are immutable | |
164 | for as long as the transaction exists. | |
165 | ||
166 | A transaction created via the REMOTE interface exists until ended by | |
167 | a call to SDISK_Abort() or SDISK_ReleaseLocks() or aborted in | |
168 | urecovery_CheckTid() due to a transaction ID mismatch or creation of | |
169 | a new transaction (note that in this last case, cleanup will be | |
170 | deferred if SDISK_Lock() is blocked on this transaction). In these | |
171 | cases, ubik_currentTrans is cleared at the same time, under the | |
172 | database lock, and further REMOTE calls will fail gracefully until | |
173 | a new transaction is started. Thus, REMOTE transactions may be | |
174 | considered valid only for as long as the database lock is held. | |
175 | ||
176 | Transactions created via ubik_BeginTrans(), ubik_BeginTransReadAny(), | |
177 | or ubik_BeginTransReadAnyWrite() exist until ended by a call to | |
178 | ubik_AbortTrans() or ubik_EndTrans(). They are never destroyed from | |
179 | within the UBIK package. Code which discovers a transaction by | |
180 | traversing the database's active transaction list may access the | |
181 | transaction only as long as the database lock is held, since once it | |
182 | is released, an application thread may end the transaction. | |
183 | ||
184 | Each transaction contains the following fields, which are protected | |
185 | by the database lock: | |
186 | ||
187 | + locktype - type of lock held by this transaction | |
188 | + minCommitTime - unused | |
189 | + seekFile - file number of database file pointer | |
190 | + seekPos - position of database file pointer | |
191 | + flags | |
192 | ||
193 | File writes are represented by 'iovec_info', an XDR vector structure | |
194 | (type iovec_wrt, a vector of struct ubik_iovec) containing the file, | |
195 | position, and length of each write, and 'iovec_data', an XDR opaque | |
196 | containing the corresponding data. These vectors and their contents | |
197 | are protected by the database lock. | |
198 | ||
199 | File truncations are represented by a linked list of struct | |
200 | ubik_trunc, each of which contains a file number and the new size of | |
201 | that file. This list, and the records contained in it, are protected | |
202 | by the database lock. | |
203 | ||
204 | While transactions are currently protected by the database lock, it | |
205 | may be desirable in the future to introduce a per-transaction lock, | |
206 | in order to facilitate increased concurrency. | |
207 | ||
208 | ||
209 | struct ubik_server - server status | |
210 | ||
211 | This structure is used to track the set of peer servers, including | |
212 | state related to elections and recovery. The global 'ubik_servers' | |
213 | points to a linked list of these structures, which is constructed | |
214 | during initialization. Once the list has been constructed, no | |
215 | entries are ever added or removed; in addition, the following fields | |
216 | are not changed after startup: | |
217 | ||
218 | + addr[0] - server's primary address | |
219 | + magic - true if this is the magic server | |
220 | + isClone - true if this server is a clone | |
221 | ||
222 | The following fields are used by the BEACON package to track beacons | |
223 | to this server and their results. Currently, they are sometimes | |
224 | accessed under the database lock, and sometimes without benefit of | |
225 | any lock. They should probably be protected by the same lock as | |
226 | globals private to the BEACON package. | |
227 | ||
228 | + lastVoteTime - last time this server voted yes for us | |
229 | + lastBeaconSent - last time a beacon was sent to this server | |
230 | + lastVote - true if its late vote response was yes | |
231 | + up - true if this server is believed to be up | |
232 | + beaconSinceDown - true if a beacon has successfully been sent to | |
233 | this server since the last time it was down | |
234 | ||
235 | The following fields are used by the RECOVERY package to track the | |
236 | state of each server. They are also examined and modified by the | |
237 | various ContactQuorum_* functions. Currently, they are sometimes | |
238 | accessed under the database lock, and sometimes without benefit of | |
239 | any lock. They should be fully protected by the database lock. | |
240 | ||
241 | + version - server's database version | |
242 | + currentDB - true if server's database is up-to-date | |
243 | ||
244 | The addr[] array is used to track all known addresses for the remote | |
245 | server; this list is used when attempting to contact a down server | |
246 | via an alternate address, and to identify the source of incoming | |
247 | server-to-server RPCs. The first element of this array is always the | |
248 | primary address and does not change after initialization; other | |
249 | elements are updated during initialization and when a remote server | |
250 | calls DISK_UpdateInterfaceAddr(). | |
251 | ||
252 | The system normally maintains an active Rx connection to the VOTE and | |
253 | DISK services on each server; pointers to these are kept in the | |
254 | 'vote_rxcid' and 'disk_rxcid' fields. | |
255 | ||
256 | Currently, the addr[] array, 'vote_rxcid', and 'disk_rxcid' are not | |
257 | protected by any lock. Because these are shared among a number of | |
258 | subsystems and it is usually desirable to avoid blocking on the | |
259 | database lock when making RPCs, it is recommended to introduce a new | |
260 | lock (the "server address lock") to protect these fields and the | |
261 | globals 'ubikSecClass' and 'ubikSecIndex', which contain the security | |
262 | classes used for setting up such connections. The new lock would | |
263 | need to be held in the following cases: | |
264 | + Before beginning any RPC using 'vote_rxcid' or 'disk_rxcid', for | |
265 | the time it takes to call rx_GetConnection() or rx_NewCall() | |
266 | + For the duration of ubikGetPrimaryInterfaceAddr() | |
267 | + In ubeacon_Interact(), when building the set of connections to be | |
268 | used for multi_VOTE_Beacon(). | |
269 | + In updateUbikNetworkAddress(), first while constructing the set of | |
270 | connections to be used for multi_DISK_UpdateInterfaceAddr(), and | |
271 | then again each time a server's list of addresses is updated. | |
272 | However, the lock must NOT be held while waiting for the RPC to | |
273 | complete, to avoid deadlock when multiple servers are starting at | |
274 | the same time. | |
275 | + For the duration of SDISK_UpdateInterfaceAddr() | |
276 | + For the duration of ubeacon_ReinitServer(), with the optional | |
277 | exception of the call to afsconf_UpToDate(). | |
278 | + In src/ubik/recovery.c:DoProbe(), first while constructing the set | |
279 | of connections to be used for multi_DISK_Probe() and, if the probe | |
280 | is successful, again while destroying the old connections for the | |
281 | server under test and replacing them with new ones. See the | |
282 | section on the RECOVERY package for additional notes. | |
283 | ||
284 | ||
285 | GLOBAL VARIABLES | |
286 | ||
287 | The library has an unfortunate number of global variables which are not | |
288 | part of any data structure. Nonetheless, some of these are | |
289 | database-specific; see MULTI-DATABASE SUPPORT, below. | |
290 | ||
291 | Globals private to a particular subsystem are discussed in the | |
292 | description of that subsystem. This section discusses global variables | |
293 | which are shared and/or part of external interfaces. | |
294 | ||
295 | The following globals are used to convey configuration parameters from | |
296 | the application to the Ubik library, and are intended to be set before | |
297 | the first call to ubik_ServerInit(). They are not modified by Ubik, and | |
298 | are not protected by locks: | |
299 | ||
300 | + ubik_debugFlag - print debug messages (unused) | |
301 | + ubikPrimaryAddrOnly - use only primary interface | |
302 | + ubik_nBuffers - number of DISK package page cache buffers | |
303 | + ubik_SRXSecurityProc - callback to create server security class | |
304 | + ubik_SRXSecurityRock - argument for ubik_SRXSecurityProc | |
305 | + ubik_CRXSecurityProc - callback to create client security class | |
306 | + ubik_CRXSecurityRock - argument for ubik_CRXSecurityProc | |
307 | + ubik_SyncWriterCacheProc - callback to update application cache | |
308 | + ubik_CheckRXSecurityProc - callback to check if caller is OK | |
309 | + ubik_CheckRXSecurityRock - argument for ubik_CheckRXSecurityProc | |
310 | ||
311 | The following globals contain values which are computed during Ubik | |
312 | initialization and not modified thereafter. They are not protected by | |
313 | any lock, which is safe so long as suitable memory barriers exist across | |
314 | creation of threads and Rx services (see the notes on initialization in | |
315 | the UBIK package section). Note that some of these values are | |
316 | database-specific and properly belong in the ubik_dbase structure: | |
317 | ||
318 | + ubik_servers - linked list of servers (see above) | |
319 | + amIClone - true if this server is a clone | |
320 | + ubik_callPortal - UDP port used for server-to-server RPCs | |
321 | + ubik_quorum - number of servers to make a quorum | |
322 | + ubik_dbase - pointer to the (only) database (see above) | |
323 | + ubik_host[] - array of local addresses | |
324 | + ubik_sc[] - security classes for Ubik RPC services | |
325 | ||
326 | The following globals are currently protected by the database lock, | |
327 | though in most cases there are existing references which do not hold the | |
328 | lock. Later sections of this document contain recommendations in each | |
329 | case to establish new locks to protect these or to correct cases where | |
330 | they are accessed without the correct lock. | |
331 | ||
332 | + ubik_currentTrans - pointer to REMOTE's current transaction | |
333 | + ubik_epochTime - epoch for transaction IDs | |
334 | + urecovery_state - recovery state bits | |
335 | + ubik_dbVersion - sync site's database version | |
336 | + ubik_dbTid - sync site's transaction ID | |
337 | + ubikSecIndex - security class index for making Ubik RPCs | |
338 | + ubikSecClass - security class for making Ubik RPCs | |
339 | ||
340 | The global 'ubik_amSyncSite' belongs to the BEACON package, and should | |
341 | be protected by the same lock as that package's private globals. In | |
342 | fact, it should probably be made private; the only references outside | |
343 | BEACON are in SVOTE_Debug() and SVOTE_DebugOld() and can easily be moved | |
344 | into ubeacon_Debug(). | |
345 | ||
346 | ||
347 | MAJOR SUBSYSTEMS | |
348 | ||
349 | This section describes each of the major Ubik server subsystems. For | |
350 | each subsystem, a brief overview is provided, along with a description | |
351 | of that subsystem's handling of synchronization (or lack thereof). | |
352 | Subsystems are described beginning with the lowest layer and working | |
353 | toward the highest. | |
354 | ||
355 | PHYS - Physical I/O | |
356 | ||
357 | This package is responsible for handling I/O to database files. It | |
358 | maintains a private, global array of MAXFDCACHE(4) fdcache | |
359 | structures, each representing an open file descriptor on a database | |
360 | file, either presently in use or idle and available for use. Entries | |
361 | are indexed by database file ID, and each contain a file descriptor | |
362 | number and a refcount. There is also a global 'inited' flag, and | |
363 | a buffer used to construct pathnames when opening database files. | |
364 | ||
365 | The static function uphys_open() takes a database pointer and fileid | |
366 | and attempts to open the corresponding file. On success, it returns | |
367 | a file descriptor; on failure, it returns the same as open(2) (with | |
368 | errno set). This prefers to reuse an already-open fd whose cache | |
369 | refcount is 0; failing that, it opens a new fd and attempts to enter | |
370 | it in the cache. It prefers first to use an unused cache entry, then | |
371 | to reclaim an idle one whose refcount is 0; if neither is possible, | |
372 | the newly-opened fd is returned without caching. Note that the | |
373 | refcount on an active entry is always exactly 1; the same entry | |
374 | cannot be referenced more than once. | |
375 | ||
376 | The function uphys_close() releases a file descriptor opened by | |
377 | uphys_open(). If the file descriptor was cached, the cache entry is | |
378 | dereferenced but the file is left open for future use; 0 is returned. | |
379 | If the file descriptor is not found in the cache, then the return | |
380 | value is that of an immediate call to close(2). This function also | |
381 | handles cleanup when a cached fd is closed after the file to which it | |
382 | refers is invalidated by uphys_invalidate(). This function is not | |
383 | static, but may as well be -- it is used only by other functions in | |
384 | the PHYS package. | |
385 | ||
386 | Functions in this package are called via method pointers in the | |
387 | database structure, which are set during initialization and not | |
388 | changed thereafter. These functions must be called with the database | |
389 | lock held; this protects the file descriptor cache and other globals | |
390 | mentioned above, and prevents conflicting access to database files. | |
391 | ||
392 | The requirement that functions in this package be called with the | |
393 | database lock held means that concurrent access to database files is | |
394 | not possible. This situation could be improved by the addition of | |
395 | a new global lock, which would be held for most of the duration of | |
396 | uphys_open(), uphys_close(), and uphys_invalidate(), but need _not_ | |
397 | be held during actual file accesses. This would eliminate the | |
398 | requirement that calls to this package be made with the database lock | |
399 | held, allowing multiple calls to be active concurrently. However, it | |
400 | would still be necessary for higher layers to employ appropriate | |
401 | synchronization as needed to maintain database consistency. | |
402 | ||
403 | ||
404 | DISK - Logical Database I/O and Transaction Management | |
405 | ||
406 | This package is responsible for logical database file I/O, logging, | |
407 | and transaction management. It maintains a private cache of database | |
408 | file page buffers. This entire cache, including buffer space, is | |
409 | allocated and initialized the first time a transaction is started; | |
410 | the global initd flag keeps track of whether or not this has been | |
411 | done. The size of this cache defaults to 20 pages, but may be | |
412 | overridden by setting the global 'ubik_nBuffers' (all OpenAFS Ubik | |
413 | services do this); this global is not protected by any lock, but is | |
414 | read exactly once, when the cache is initialized. | |
415 | ||
416 | Except for udisk_Debug() (see below), all udisk_* interfaces require | |
417 | a pointer either to the database or to an active transaction, and | |
418 | must be called with the database lock held. This protects the DISK | |
419 | package's accesses of the active transaction list and associated | |
420 | state variables as well as its calls into the physical access layer, | |
421 | including insuring that operations requiring multiple calls into that | |
422 | layer are performed atomically. | |
423 | ||
424 | In addition, the database lock protects the page cache hash table and | |
425 | LRU queue, the buffer control structures, buffer contents, the | |
426 | truncation record free list (freeTruncList), and global statistics. | |
427 | However, it may be desirable in the future to introduce a separate | |
428 | lock to protect these structures, in order to facilitate multiple | |
429 | database support. | |
430 | ||
431 | udisk_LogOpcode, udisk_LogEnd, udisk_LogTruncate, udisk_LogWriteData | |
432 | ||
433 | These are internal interfaces for writing to the transaction log. | |
434 | They each construct a log record and then directly call the | |
435 | physical layer to append it to the log; the log file is _not_ | |
436 | cached by the page buffer cache, and these functions do not touch | |
437 | the cache. Each of these must be atomic with respect to other | |
438 | writes to the log file, which is accomplished by insuring they are | |
439 | called only with the database locked. | |
440 | ||
441 | DInit, DRead, DTrunc, DRelease, DFlush, DAbort, DSync, DNew | |
442 | ||
443 | These are internal interfaces for accessing the page buffer cache. | |
444 | They must be called with the database lock held. | |
445 | ||
446 | DRead() and DNew() return a pointer to the page data buffer, not | |
447 | to the buffer control structure; this pointer is later passed to | |
448 | DRelease() to release the buffer. This mechanism depends on the | |
449 | fact that buffer control structures are allocated in the same | |
450 | order as the actual page data buffers. | |
451 | ||
452 | DRead() guarantees that a read transaction will always see a clean | |
453 | copy of the requested page, without any modifications made by an | |
454 | in-progress write transaction. However, this is true only if the | |
455 | next layer insures that once DRead() is called with write intent | |
456 | on a page, that no further calls to DRead() are made for that page | |
457 | until the buffer is written and DRelease() is called with the | |
458 | dirty flag. Presently, this is the case because udisk_read() and | |
459 | udisk_write(), which are the only callers of DRead(), are both | |
460 | called with the database lock held. | |
461 | ||
462 | udisk_read, udisk_truncate, udisk_write, udisk_begin, udisk_commit, | |
463 | udisk_abort, udisk_end, udisk_Invalidate | |
464 | ||
465 | These are external interfaces provided by the DISK package to | |
466 | higher layers. Presently, the locking requirements of these | |
467 | functions and the lower-layer functions they call are satisfied by | |
468 | the simple rule that these must always be called with the database | |
469 | lock. In the future, support for multiple databases and/or | |
470 | a greater level of concurrency may require relaxing this rule | |
471 | and/or acquiring additional locks within these functions. | |
472 | ||
473 | In udisk_commit(), when committing the first write transaction | |
474 | after becoming sync site, the database is relabelled. In this | |
475 | case, the UBIK_RECLABELDB recovery state bit should be set before | |
476 | propagating the label change to other servers, rather than after. | |
477 | This is because because ContactQuorum_DISK_Setversion() will | |
478 | release the database lock, at which point the recovery state may | |
479 | be cleared by urecovery_ResetState() running in another thread. | |
480 | It is important that a relabelling which occurs before recovery | |
481 | state is cleared not result in the UBIK_RECLABELDB recovery state | |
482 | bit being set after; otherwise, the server may fail to correctly | |
483 | relabel the database the next time it becomes sync site. | |
484 | ||
485 | udisk_Debug | |
486 | ||
487 | The udisk_Debug() interface is called as part of preparing the | |
488 | response to Ubik debugging requests. Unlike other udisk_* APIs, | |
489 | this interface is not called on a specific database and does not | |
490 | require the database lock to be held. It reports the version data | |
491 | of the single database identified by the global 'ubik_dbase' and | |
492 | traverses the buffer cache collecting statistics, both without | |
493 | benefit of holding any locks. This may produce inconsistent data | |
494 | but does not normally risk damaging data structures or accessing | |
495 | invalid or uninitialized memory (but see below), and avoiding | |
496 | locking may be of greater benefit than guaranteeing the return of | |
497 | consistent data. | |
498 | ||
499 | The global 'ubik_dbase' pointer is set when the first (only) | |
500 | database is initialized, and the buffer cache data structures are | |
501 | allocated the first time a transaction is started, either locally | |
502 | or via the REMOTE package. Once set, the pointers used by | |
503 | udisk_Debug() are never changed, and the data structures they | |
504 | point to remain valid and are never freed. The buffer cache is | |
505 | traversed by iterating over the array of buffer cache elements | |
506 | (this is done in other places as well), not by following a linked | |
507 | list or other data structure that may reordered, and so should not | |
508 | be dangerous. | |
509 | ||
510 | At a minimum, then, it should be arranged that udisk_Debug() | |
511 | cannot be called before the database structure and page cache have | |
512 | been initialized. This means that both of these actions must be | |
513 | taken before any incoming RPCs are accepted. Practically, this | |
514 | means that DInit() should be renamed udisk_init(), and called from | |
515 | ubik_ServerInitCommon() during initialization. | |
516 | ||
517 | Full correctness demands that udisk_Debug actually hold the | |
518 | database lock while copying the database version and examining the | |
519 | page cache. However, it may be desirable to sacrifice this | |
520 | correctness in order to reduce interaction between the debugging | |
521 | interfaces and the rest of the system. | |
522 | ||
523 | Helper functions: | |
524 | + unthread - remove transaction from activeTrans | |
525 | + Dlru, Dmru - move page to front/end of LRUQ | |
526 | + MatchBuffer - check if a buffer's dbase/file/page match | |
527 | + FixupBucket - move page to the right bucket | |
528 | + newslot - find and allocate an unused buffer | |
529 | + DedupBuffer - throw away stale copies after page is flushed | |
530 | + GetTrunc, PutTrunc - trunc record allocation | |
531 | + FindTrunc - find active trunc record for file | |
532 | + DoTruncs - apply truncations | |
533 | ||
534 | ||
535 | LOCK - Database Transaction Locking | |
536 | ||
537 | This package is responsible for managing database locks held as part | |
538 | of a transaction. It supports a single read/write lock, which is | |
539 | represented by the global variable 'rwlock', a struct Lock. The | |
540 | global 'rwlockinit' records whether Lock_Init() has been called on | |
541 | this lock, and is protected by the database lock. Properly, this | |
542 | lock should be treated as belonging to the database. | |
543 | ||
544 | ulock_getLock() expects to be called with the database lock held; it | |
545 | depends on this to protect access to fields within the transaction | |
546 | structure as well as to the global 'rwlockinit'. However, note that | |
547 | the database lock must be temporarily released while obtaining the | |
548 | read/write lock, and so ulock_getLock() should not be called in the | |
549 | middle of a critical section. The 'await' parameter may in theory be | |
550 | set to 0 to effect a non-blocking lock; however, this depends on | |
551 | (incorrect) knowledge of the internals of struct Lock. Fortunately, | |
552 | existing Ubik code always sets await=1. This feature should be fixed | |
553 | or removed. | |
554 | ||
555 | ulock_relLock() releases the read/write lock. It expects to be | |
556 | called with the database lock held. Unlock ulock_getLock(), this | |
557 | function does not release the database lock. | |
558 | ||
559 | ulock_Debug() reports the state of the read/write lock, for debugging | |
560 | purposes. It depends on knowledge of the internals of struct Lock, | |
561 | and accesses both the Lock structure and the global rwlockinit | |
562 | without benefit of holding any lock. | |
563 | ||
564 | It would probably be best to eliminate 'rwlockinit' in favor of | |
565 | always initializing the global 'rwlock' during startup; ulock_Debug() | |
566 | would then have no need to examine the flag. Further, ulock_Debug() | |
567 | should at a minimum use the appropriate locks when looking inside | |
568 | struct Lock; ideally, knowledge of that structure's internals should | |
569 | be abstracted into appropriate interfaces in src/lwp/lock.[ch]. | |
570 | ||
571 | ||
572 | VOTE - Voting Logic and VOTE_* RPCs | |
573 | ||
574 | This package is responsible for keeping track of who is currently | |
575 | sync site and deciding how to vote and whether this server should try | |
576 | to become sync site. It provides the VOTE RPC package, which is | |
577 | primarily responsible for processing vote requests from peer servers | |
578 | (VOTE_Beacon), but also provides debugging interfaces and an | |
579 | interface (VOTE_GetSyncSite) to allow clients to discover the current | |
580 | sync site. | |
581 | ||
582 | Voting decisions are made using state tracked in a set of global | |
583 | variables, listed below. Currently there is no locking protecting | |
584 | these variables; LWP-based Ubik depends on the cooperative nature of | |
585 | LWP and on the fact that routines which examine and manipulate them | |
586 | do not block. Introduction of a new "vote lock" is recommended to | |
587 | protect the following globals: | |
588 | ||
589 | + ubik_lastYesTime - last time VOTE_Beacon voted "yes" | |
590 | + lastYesHost - host voted for at ubik_lastYesTime | |
591 | + lastYesClaim - time lastYesHost started its voting cycle | |
592 | + lastYesState - whether lastYesHost claimed to be sync site | |
593 | + lowestHost - lowest host seen | |
594 | + lowestTime - last time lowestHost was updated/heard from | |
595 | + syncHost - sync host, or 0 if none known | |
596 | + syncTime - last time syncHost was updated | |
597 | + ubik_dbVersion - sync site's database version | |
598 | + ubik_dbTid - sync site's transaction ID | |
599 | ||
600 | The vote lock should be initialized in uvote_Init() and held for the | |
601 | remainder of that function and also during of uvote_ShouldIRun(), | |
602 | uvote_GetSyncSite(). It should also be held for the bulk of | |
603 | SVOTE_Beacon() -- specifically, it should be acquired just before | |
604 | beginning the lowest-host calculation, and released just before | |
605 | calling urecovery_CheckTid(), in the event of a yes vote or else just | |
606 | before returning. | |
607 | ||
608 | In addition, the vote lock would protect the globals 'ubik_dbVersion' | |
609 | and 'ubik_dbTid', which represent this server's knowledge of the | |
610 | state of the database on the sync site. This variable is set in two | |
611 | places in the REMOTE package, which change the local and sync site | |
612 | database versions at the same time. It is also examined in one place | |
613 | in the REMOTE package, and compared to the database version in the | |
614 | RECOVERY package. It is imperative that when the local and remote | |
615 | versions are changed at the same time, these changes become visible | |
616 | to the RECOVERY package at the same time. Thus, both the changes and | |
617 | the comparison must be done under both locks. For this purpose, | |
618 | three new functions should be provided by the VOTE package: | |
619 | ||
620 | + A function to safely set 'ubik_dbVersion' | |
621 | + A function to compare 'ubik_dbVersion' to another value | |
622 | + A function to check whether this is sync site and compare | |
623 | 'ubik_dbVersion' to another value, without releasing the vote lock | |
624 | in between (to be used in recovery). | |
625 | ||
626 | Existing accesses to 'ubik_dbVersion' in REMOTE and RECOVERY, all of | |
627 | which are already made under the database lock, would be replaced | |
628 | with calls to the new accessors. | |
629 | ||
630 | The result of these changes will require in several places that the | |
631 | vote lock be acquired while the database lock is already held. | |
632 | Therefore, the vote lock MUST NOT be held while acquiring the | |
633 | database lock. | |
634 | ||
635 | SVOTE_Beacon() currently calls urecovery_CheckTid() without benefit | |
636 | of the database lock, which must be held when calling that function. | |
637 | This is fairly simple to fix; however, the vote lock must be released | |
638 | first. One simple approach would be to move the call to | |
639 | urecovery_CheckTid() to after the point where the vote lock will be | |
640 | released, wrapping it in DBHOLD/DBRELE, and conditionalizing the | |
641 | whole thing on 'vote' being nonzero. | |
642 | ||
643 | Separately, it may be worth examining whether it is appropriate to | |
644 | call urecovery_CheckTid() only when voting yes, and not whenever | |
645 | a beacon is received from a server claiming to be sync site. | |
646 | ||
647 | As with the debug functions in other packages, SVOTE_Debug() and | |
648 | SVOTE_DebugOld() access global variables belong to this and other | |
649 | modules without benefit of the appropriate locks. There is also very | |
650 | little abstraction here. It is probably desirable to introduce Debug | |
651 | interfaces for the VOTE and RECOVERY packages, to be called from | |
652 | these functions. Full correctness requires... | |
653 | ||
654 | + Holding the vote lock while accessing the VOTE package globals | |
655 | described above. | |
656 | + Holding the database lock while accessing the database flags and | |
657 | tidCounter, ubik_epochTime, ubik_currentTrans, and urecovery_state | |
658 | ||
659 | ||
660 | BEACON - elections, tracking whether this is sync site | |
661 | ||
662 | This package is responsible for running elections, collecting votes, | |
663 | and deciding whether and for how long this is the sync site. It also | |
664 | contains routines responsible for setting up the list of servers and | |
665 | acquiring alternate addresses for other servers during startup. | |
666 | ||
667 | For this purpose, it maintains a number of global variables and | |
668 | ubik_server structure fields. Currently, these variables and fields | |
669 | are sometimes accessed under the database lock, and sometimes without | |
670 | benefit of any lock. In order to allow this package to operate with | |
671 | a minimum of disruption due to database accesses, it is recommended | |
672 | that a new lock be established to protect these. It is particularly | |
673 | desirable to prevent loss of sync due to the potentially long-lived | |
674 | RPCs the RECOVERY package must make under the database lock when | |
675 | transferring full database copies between servers; for this reason, | |
676 | it is important that both the BEACON thread and SVOTE_Beacon() be | |
677 | able to operate without acquiring the database lock. The global | |
678 | variables and ubik_server structure fields to be protected by the new | |
679 | lock are the following: | |
680 | ||
681 | + ubik_amSyncSite - true if this is the sync site | |
682 | + syncSiteUntil - time until which this is the sync site | |
683 | + ts->lastVoteTime - last time this server voted yes for us | |
684 | + ts->lastBeaconSent - last time a beacon was sent to this server | |
685 | + ts->lastVote - true if its late vote response was yes | |
686 | + ts->up - true if this server is believed to be up | |
687 | + ts->beaconSinceDown - true if a beacon has successfully been sent | |
688 | to this server since the last time it was down | |
689 | ||
690 | The new lock would need to be held in the following cases: | |
691 | + In ubeacon_AmSyncSite(), from the first time 'ubik_amSyncSite' is | |
692 | examined until after the potential call to urecovery_ResetState(). | |
693 | + In ubeacon_Interact(), when building the list of servers to which | |
694 | to send beacons (to protect access to the 'up' flag). Note that | |
695 | the server address lock must also be held during this loop, and | |
696 | must be acquired _after_ the beacon lock. | |
697 | + In ubeacon_Interact(), when updating status associated with | |
698 | a server after receiving its beacon response. | |
699 | + In ubeacon_Interact(), when updating globals after an election. | |
700 | However, it must be released before calling urecovery_ResetState(). | |
701 | + In updateUbikNetworkAddress(), when marking a server down. | |
702 | + In the various ContactQuorum_* functions, when checking whether | |
703 | a server is up before calling it, and again when marking a server | |
704 | down after a call fails. | |
705 | + In ubik_EndTrans, when examining a server's 'beaconSinceDown' flag | |
706 | and 'lastBeaconSent' timestamp to determine whether it has recently | |
707 | gone down, requiring a brief hold until it either comes back up or | |
708 | times out. | |
709 | ||
710 | The following global variables are set during initialization and do | |
711 | not change thereafter: | |
712 | ||
713 | + nServers - number of voting servers | |
714 | + amIMagic - true if this server gets the extra half vote | |
715 | + ubik_singleServer - true if this is the only server | |
716 | ||
717 | Currently, all initialization of this module is done _after_ the Rx | |
718 | services have been created and an Rx server thread started. This is | |
719 | done to avoid a situation in which all servers are blocked in | |
720 | updateUbikNetworkAddress(), which is responsible for exchanging | |
721 | addresses with peer servers, while none of them is prepared to | |
722 | handled the DISK_UpdateInterfaceAddr call made by that function. | |
723 | ||
724 | When using LWP, this early initialization of Rx is not a problem, | |
725 | because the various initialization routines don't actually block until | |
726 | updateUbikNetworkAddress() waits for RPCs to complete. However, when | |
727 | using pthreads, it is a problem, because it means that incoming RPCs | |
728 | may be processed when initialization is not complete. | |
729 | ||
730 | To address this problem and insure orderly initialization, it is | |
731 | recommended to change the initialization sequence so that most work, | |
732 | including the bulk of BEACON package initialization, happens before | |
733 | Ubik's Rx services have been created. The only work that should be | |
734 | deferred until after that point is updateUbikNetworkAddress(), | |
735 | followed by starting the long-running threads belonging to the BEACON | |
736 | and RECOVERY packages. This requires updateUbikNetworkAddress() to | |
737 | be exported (and perhaps to undergo a name change). | |
738 | ||
739 | The major part of the work of this package is done in the function | |
740 | ubeacon_Interact(), which is the body of a long-running thread known | |
741 | as the beacon thread. This thread is responsible for periodically | |
742 | sending "beacons" to other servers to request votes (when the sending | |
743 | server is the sync site, these also include how long it will remain | |
744 | so, the database version, and the active transaction ID). Beacons | |
745 | are sent only from servers that are already sync site or believe they | |
746 | are the best candidate, as determined by uvote_ShouldIRun(). | |
747 | ||
748 | On the sync site, the beacon thread is responsible for insuring that | |
749 | new votes are collected frequently enough to avoid loss of quorum. | |
750 | This means it must not block for an extended time; therefore, it must | |
751 | not acquire the database lock, which in other threads may sometimes | |
752 | be held during long-running operations (most notably, database file | |
753 | transfers done under this lock by the recovery thread). Instead, | |
754 | a new lock is proposed (the "version lock"), which is used to allow | |
755 | the beacon thread to examine (but not modify) certain fields without | |
756 | holding the database lock. | |
757 | ||
758 | The version lock should be acquired _in addition to_ the database | |
759 | lock when modifying 'ubik_epochTime' or the database 'version', | |
760 | 'flags', 'tidCounter', and 'writeTidCounter' fields; such | |
761 | modifications occur at the following locations: | |
762 | + in ubik.c:BeginTrans(), when beginning a new transaction | |
763 | + in udisk_begin(), when setting DBWRITING | |
764 | + in udisk_commit(), when updating the database version at the end of | |
765 | a write transaction (note this includes the case of relabelling the | |
766 | database on the first write transaction after becoming sync site) | |
767 | + in udisk_end(), when clearing DBWRITING | |
768 | + in recovery.c:InitializeDB(), when setting the initial version of | |
769 | an empty database | |
770 | + in urecovery_Interact(), when labelling a new database the first | |
771 | time quorum is established | |
772 | + in urecovery_Interact(), after fetching the latest database from | |
773 | another server (whether successful or not; there are two cases) | |
774 | + in SDISK_SendFile(), when updating the database version both before | |
775 | and after receiving a new database from the sync site. Note the | |
776 | lock must _not_ be held during the transfer. | |
777 | + in SDISK_SetVersion() | |
778 | ||
779 | The version lock should be acquired by the beacon thread while | |
780 | examining these fields, shortly before calling multi_VOTE_Beacon(). | |
781 | Since it must release the lock before making that call, the database | |
782 | version will need to be explicitly copied into a local variable, as | |
783 | is already done with the current transaction ID. | |
784 | ||
785 | Note that if UBIK_PAUSE is defined, the DBVOTING flag poses an | |
786 | additional problem, as it must be modified by the beacon thread. If | |
787 | it is desirable to continue to support this variant, then it becomes | |
788 | necessary for all accesses to the database flags to be protected by | |
789 | the version lock. Other UBIK_PAUSE code may not have been reviewed | |
790 | in depth and may pose additional problems. | |
791 | ||
792 | The function ubeacon_AmSyncSite() is called both from the beacon | |
793 | thread and from various other points to determine whether they are | |
794 | running on the sync site, and thus whether various operations are | |
795 | safe and appropriate. Calls to this function from outside the beacon | |
796 | thread (often, via urecovery_AllBetter()) must be made with the | |
797 | database lock held, and for operations modifying the database or | |
798 | relying on its contents to remain constant, this lock must not then | |
799 | be released until the operation is complete. The does not guarantee | |
800 | that this will remain sync site for the duration of the operation, | |
801 | but does insure that any operations done after or as a result of | |
802 | losing sync will in fact occur after the operation holding the lock. | |
803 | ||
804 | Of course, ubeacon_Interact() itself cannot call ubeacon_AmSyncSite() | |
805 | with the database lock held, since it must not block on that lock | |
806 | when running on the sync site (otherwise sync could be lost, as | |
807 | described above). So, that portion of ubeacon_AmSyncSite() other | |
808 | than the call to urecovery_ResetState() should be abstracted out into | |
809 | a new function, which presumably returns separate flags indicating | |
810 | both whether it is in fact running on the sync site (which becomes | |
811 | the return value of ubeacon_AmSyncSite()) and whether a call to | |
812 | urecovery_ResetState() is indicated. When the latter is set, the | |
813 | beacon thread can acquire the database lock (since, if this is not | |
814 | the sync site, blocking on this lock is not a problem) and call | |
815 | urecovery_ResetState(). | |
816 | ||
817 | This difference in behavior is acceptable because the required | |
818 | invariant is that when 'ubik_amSyncSite' is cleared, it cannot become | |
819 | true again until urecovery_state has also cleared. This insures that | |
820 | once a server discovers it is not sync site, urecovery_AllBetter() | |
821 | cannot succeed on the basis of being sync site until sync is regained | |
822 | and recovery has run, insuring the database is up to date. Since | |
823 | only the beacon thread ever sets 'ubik_amSyncSite' true, that thread | |
824 | can meet the invariant by calling urecovery_ResetState() before doing | |
825 | anything else, while other threads must continue to hold the beacon | |
826 | lock to prevent the beacon thread from setting 'ubik_amSyncSite'. | |
827 | ||
828 | ubeacon_Debug() reports the value of syncSiteUntil, without benefit | |
829 | of obtaining any locks. It should also report the value of | |
830 | 'ubik_amSyncSite', allowing that global to be made private to the | |
831 | BEACON package. Full correctness requires that the beacon lock be | |
832 | held while accessing these globals. | |
833 | ||
834 | ||
835 | RECOVERY - Consistency, Down Server Recovery, Propagation, Log Replay | |
836 | ||
837 | This package is responsible for tracking whether the local database | |
838 | is current and usable and for discovering when a server comes back up | |
839 | after having been marked down. On the sync site, it is also | |
840 | responsible for tracking the database versions on remote servers and | |
841 | for recovering from inconsistencies. | |
842 | ||
843 | To carry out these tasks, the RECOVERY package maintains a single | |
844 | global, 'urecovery_state'. This is not currently effectively | |
845 | protected by any lock; however, it should be protected by the | |
846 | database lock. This is particularly important because high-level | |
847 | routines that manipulate the database depend on a check against this | |
848 | field to determine that the database is up-to-date and safe to | |
849 | modify, and that it will remain so until the operation is complete. | |
850 | ||
851 | The main work of this package is done by urecovery_Interact(), which | |
852 | is the body of a long-running thread known as the recovery thread. | |
853 | This thread wakes up periodically to discover if any down servers | |
854 | have come back up and, on the sync site, to locate the latest | |
855 | available database, fetch it if necessary, and make sure it has been | |
856 | distributed to all servers. This work is done in several steps, | |
857 | taken on in sequence every few seconds: | |
858 | ||
859 | + The first part of urecovery_Interact() is to identify servers which | |
860 | are believed to be down, and determine whether any of them have | |
861 | come back up. This is done every 30 seconds, even when this is not | |
862 | the sync site. Currently, the database lock is held for the | |
863 | duration of this loop, released only in DoProbe() while actually | |
864 | waiting for the DISK_Probe() RPC to complete. This is mostly | |
865 | unnecessary; the server probe loop needs this lock only to examine | |
866 | the database 'currentDB' flag and to clear the UBIK_RECFOUNDDB bit, | |
867 | and DoProbe() doesn't need it at all. However, the beacon lock is | |
868 | needed to examine and update the database 'up' flag, and DoProbe() | |
869 | needs to hold the server address lock when examining server | |
870 | addresses and, if the server comes back up, to replace connections. | |
871 | See the section on struct ubik_server for more details on this | |
872 | lock. | |
873 | ||
874 | + The next step is to determine whether this is the sync site, and | |
875 | update the UBIK_RECSYNCSITE bit appropriately. The database lock | |
876 | should be held for the duration of this step, as | |
877 | ubeacon_AmSyncSite() may end up calling urecovery_ResetState(), | |
878 | which requires that lock be held. If this is not the sync site, | |
879 | the cycle ends here. | |
880 | ||
881 | + Step three is to determine the location of the latest database. | |
882 | This is done by making DISK_GetVersion() calls to all active | |
883 | servers, keeping track of the newest version found and the server | |
884 | which reported it. Once this is done, the best version is compared | |
885 | to the local version, update the UBIK_RECHAVEDB bit (so that it is | |
886 | true if and only if the local version is the latest) set the | |
887 | UBIK_RECFOUNDDB bit to indicate the latest database has been | |
888 | located, and clear the UBIK_RECSENTDB bit to force any servers with | |
889 | out-of-date databases to be updated. | |
890 | ||
891 | In this section, it is not necessary to hold any locks while | |
892 | collecting versions on remote servers, though as before, it is | |
893 | necessary to hold the beacon lock while examining the server 'up' | |
894 | flag, and the address lock to obtain a reference on each connection | |
895 | as it is used (but this lock should not be held during the RPC!). | |
896 | Once all remote versions have been fetched, the database lock is | |
897 | needed while examining the local version and updating | |
898 | 'urecovery_state'. Note that write transactions may happen while | |
899 | versions are collected, resulting in some servers having newer | |
900 | versions than were detected. However, if this happens, the result | |
901 | is always consistent and ends with the sync site having the latest, | |
902 | canonical version: | |
903 | ||
904 | + If the actual latest version X.N before the write was in an epoch | |
905 | created by the current sync site, then that version must | |
906 | necessarily already be present on the sync site, since writes are | |
907 | always committed first on the sync site. Thus, when the sync | |
908 | site updates the version counter, it produces a new version X.N+1 | |
909 | which no other server can believe it already has; thus, there is | |
910 | no possibility of inconsistency. | |
911 | ||
912 | + If the actual latest version X.N before the write was in an epoch | |
913 | not created by the sync site, then it is possible that after | |
914 | quorum was established but before any writes happened, a new | |
915 | server came online which had a newer version X.N+1 (that was | |
916 | committed to less than a quorum of sites before a crash). If | |
917 | a write transaction begins before recovery detects this fact and | |
918 | fetches the version X.N+1 database, then the new write will start | |
919 | with the version X.N database, essentially creating a version | |
920 | fork. However, when this happens, the resulting database will be | |
921 | version Y.1 (with Y>X), because the first write on a new sync | |
922 | site always results in a new database. Thus, the version X.N+1 | |
923 | database will no longer be latest (because the local version, | |
924 | which is examined last, is Y.1), and will end up being discarded. | |
925 | This is no worse than if the server with the version X.N+1 | |
926 | database had not come up until completely after the transaction | |
927 | resulting in version Y.1. | |
928 | ||
929 | + Step four is to fetch the latest database, if the sync site does | |
930 | not already have it. Since this step replaces the database | |
931 | wholesale, there can be no active transactions while it runs, and | |
932 | all other database activity must be locked out for the duration of | |
933 | the transfer. This means the database lock must be held for this | |
934 | entire step, starting from the check that the UBIK_RECSYNCSITE | |
935 | and/or UBIK_RECFOUNDDB bits are still set. (Note that the existing | |
936 | comment which claims that it is impossible for UBIK_RECFOUNDDB not | |
937 | to be set at this point is true only if the database is not | |
938 | released since checking or updating that bit in step three). In | |
939 | addition, the address lock must be held while setting up the call | |
940 | to be used for DISK_GetFile(). | |
941 | ||
942 | + Step five, again after verifying that this is still the sync site | |
943 | and that the UBIK_RECHAVEDB bit is set, is to distribute the new | |
944 | database to any servers which do not already have it. Before this | |
945 | can be done, if the database was newly-initialized (with label | |
946 | epoch 1), it is relabeled with version 2.1 before it is distributed | |
947 | to other sites. This allows readany transactions to run on such | |
948 | a database (though probably not successfully, since the database is | |
949 | empty); it is deferred until this point in recovery to insure that | |
950 | if any server in the quorum contains a valid database, that version | |
951 | is used rather than the empty one. | |
952 | ||
953 | For this step, the database lock should be held starting at the | |
954 | point at which the UBIK_RECSYNCSITE and/or UBIK_RECHAVEDB bits are | |
955 | tested. In the event the DBWRITING flag is set, the lock must be | |
956 | released while waiting for the active write transaction to end. In | |
957 | this case, once the wait is completed, it is necessary to again | |
958 | check urecovery_state to determine that this is still the sync site | |
959 | and have an up-to-date database before proceeding. | |
960 | ||
961 | It is probably simplest for urecovery_Interact() to acquire the | |
962 | database lock at the start of step 2 described above, release it for | |
963 | the duration of the version-collection part of step 3 and while | |
964 | waiting for write transactions to terminate in step 5, and otherwise | |
965 | hold it until the end of the cycle. If no work is needed, this means | |
966 | the lock will be held relatively briefly; if it is necessary to fetch | |
967 | and or send full database versions, these operations themselves must | |
968 | be done with the lock held, and there is little point in releasing it | |
969 | between them. However, it is acceptable to release the lock between | |
970 | any steps described above, provided that each time the database lock | |
971 | is released and reacquired, 'urecovery_state' is checked to verify | |
972 | that no regression has occurred. Note that outside of the recovery | |
973 | thread, the only changes that may occur to 'urecovery_state' are that | |
974 | it may be completely cleared by urecovery_ResetState(), and that the | |
975 | UBIK_RECLABELDB bit may be set by udisk_commit(), on the sync site. | |
976 | ||
977 | The function urecovery_AllBetter() is called under the database lock | |
978 | by routines which need to know whether there is a usable database. | |
979 | As noted in the discussion of ubeacon_AmSyncSite() above, callers | |
980 | must then continue to hold the database lock until any database | |
981 | operation is complete. Further, callers which intend to write to the | |
982 | database currently follow a call to urecovery_AllBetter() with a call | |
983 | to ubeacon_AmSyncSite(), to determine whether a write is permissible. | |
984 | Unfortunately, this is not safe, because it is possible (though | |
985 | admittedly unlikely) for urecovery_AllBetter() to succeed on the | |
986 | basis of being a non-sync-site with an up-to-date database, and then | |
987 | for a subsequent call to ubeacon_AmSyncSite() to return a true result | |
988 | (because the latter may block on acquiring the beacon lock, during | |
989 | which time this may become the sync site). To insure that no write | |
990 | operation is performed without an up-to-date, writeable database, | |
991 | urecovery_AllBetter() should be modified to indicate whether a write | |
992 | operation is permitted, which is the case only when UBIK_RECHAVEDB is | |
993 | tested and found to be set. | |
994 | ||
995 | The functions urecovery_AbortAll(), urecovery_CheckTid(), and | |
996 | urecovery_ResetState() all expect to be called with the database lock | |
997 | held. | |
998 | ||
999 | The function urecovery_LostServer() need not be called with any | |
1000 | locks; it is merely a wrapper for ubeacon_ReinitServer(). | |
1001 | ||
1002 | The RECOVERY package is also responsible at startup for replaying the | |
1003 | transaction log and initializing an empty database. These tasks are | |
1004 | handled by urecovery_Initialize() and its helpers, ReplayLog() and | |
1005 | InitializeDB(). Because the helpers use PHYS package functions and | |
1006 | manipulate database structure fields, urecovery_Initialize() should | |
1007 | acquire the database lock before calling them. | |
1008 | ||
1009 | ||
1010 | REMOTE - remote database I/O (DISK) RPCs | |
1011 | ||
1012 | This package provides the DISK RPC package, which mediates remote | |
1013 | access to the local database copy, when another server is sync site. | |
1014 | It also includes DISK_UpdateInterfaceAddr(), which is used during | |
1015 | initialization to verify consistency of configuration across servers, | |
1016 | and DISK_Probe(), which is used by the RECOVERY module to determine | |
1017 | whether the server is up. | |
1018 | ||
1019 | As the functions in this package are RPC implementations, they may be | |
1020 | called at any time and are entirely responsible for their own | |
1021 | locking. There can be at most one active remote write transaction at | |
1022 | a time; this rule is enforced by this package, which maintains the | |
1023 | global 'ubik_currentTrans' as a pointer to the active transaction. | |
1024 | ||
1025 | For the most part, the RPCs in this module use a common prologue, | |
1026 | which includes acquiring the database lock, and do not release that | |
1027 | lock until just before returning. However, the common prologue | |
1028 | examines both 'ubik_currentTrans' and the transaction flags without | |
1029 | benefit of the database lock. This can be corrected fairly easily by | |
1030 | acquiring the database lock sooner; however, this requires referring | |
1031 | to the database via the ubik_dbase global rather than via the 'dbase' | |
1032 | pointer in 'ubik_currentTrans'. | |
1033 | ||
1034 | Note that SDISK_GetVersion() and SDISK_SetVersion() contain a sanity | |
1035 | check to ensure they are not running on the sync site, since these | |
1036 | are called only from that portion of the recovery loop which runs | |
1037 | only on the sync site. These calls to ubeacon_AmSyncSite() must be | |
1038 | made with the database lock held, for the results to be valid. The | |
1039 | same holds true for a commented-out check in SDISK_GetFile() (which | |
1040 | really should remain that way -- there's no reason not to allow any | |
1041 | authorized caller to do this at any time). | |
1042 | ||
1043 | Additionally, note that SDISK_Commit() acquires a write lock on the | |
1044 | application cache, to insure that no transactions are running which | |
1045 | depend on the application cache, page cache, or underlying database | |
1046 | to change. The locking order requires that this lock be acquired | |
1047 | before taking the database lock; thus, both locks must be acquired | |
1048 | earlier. | |
1049 | ||
1050 | SDISK_SendFile() is really quite bad in its handling of the database | |
1051 | lock under error conditions. Presently, an error condition may cause | |
1052 | it to release the lock too early nor not at all. | |
1053 | ||
1054 | SDISK_UpdateInterfaceAddr() is called by peer servers as part of | |
1055 | their startup process, to trade lists of additional addresses and | |
1056 | insure sufficient agreement on configuration to allow the new server | |
1057 | to start. It will need to acquire the lock protecting the addr[] | |
1058 | array attached to each host. | |
1059 | ||
1060 | ||
1061 | UBIK (high-level API) | |
1062 | ||
1063 | This package contains the high-level APIs exported by Ubik to | |
1064 | applications, including initialization, the APIs for creating and | |
1065 | using transactions, and various utilities. It also contains a number | |
1066 | of internal utilities. | |
1067 | ||
1068 | Initialization | |
1069 | ||
1070 | Initialization of Ubik is handled in ubik_ServerInitCommon(), | |
1071 | which is called by two API functions, ubik_ServerInit() and | |
1072 | ubik_ServerInitByInfo(), which provide different interfaces for | |
1073 | passing configuration information. This code handles | |
1074 | initialization of the database, locks, and various globals, calls | |
1075 | the initialization routines for each subsystem, makes sure Rx has | |
1076 | been initialized, and creates Ubik's Rx services and threads. | |
1077 | ||
1078 | Most of this is done before any Ubik-specific threads are created | |
1079 | and before Ubik's RPCs can be called, and so takes place without | |
1080 | holding any lock. However, since Rx may have been initialized and | |
1081 | started before Ubik, it is possible that some Rx worker threads | |
1082 | may have already been created. In practice, locks internal to Rx | |
1083 | should insure that any memory writes made before an Rx service is | |
1084 | created become visible to any worker running on behalf of that | |
1085 | service; nonetheless, this should not be depended on, and proper | |
1086 | locking should be used during all phases of initialization. | |
1087 | ||
1088 | Unfortunately, the current code actually starts Rx services before | |
1089 | initialization is complete, resulting in a number of possible | |
1090 | races. The general THREAD SAFETY section below contains | |
1091 | recommendations for reordering the startup process to eliminate | |
1092 | this problem, while still avoiding potential deadlock if multiple | |
1093 | servers start at once. | |
1094 | ||
1095 | ContactQuorum_* | |
1096 | ||
1097 | Nearly a third of ubik.c consists of various ContactQuorum_* | |
1098 | functions, which are used to make a particular RPC to every server | |
1099 | in the quorum. These used to be a single ContactQuorum() | |
1100 | function, but have been split into multiple functions to regain | |
1101 | type-safety. These retain a common prologue and epilogue, some | |
1102 | small part of which has been abstracted out, but most of which | |
1103 | remains duplicated in each of these functions. Indeed, the only | |
1104 | part unique to each function is the actual RPC call and, in the | |
1105 | case of ContactQuorum_DISK_WriteV(), code to fall back to multiple | |
1106 | calls to DISK_Write() on a per-server basis. | |
1107 | ||
1108 | The code common to all ContactQuorum_* functions will need to | |
1109 | acquire the beacon lock when checking whether a server is up (no | |
1110 | calls are made to servers which are not marked up), and again when | |
1111 | marking a server down. Note that when a server is marked down, | |
1112 | the 'up' and 'beaconSinceDown' flags must be cleared at the same | |
1113 | time, without releasing the beacon lock. In addition, the helper | |
1114 | function Quorum_StartIO() must obtain the server address lock | |
1115 | while it obtains a reference on the Rx connection that will be | |
1116 | used. | |
1117 | ||
1118 | All of the ContactQuorum_* functions must be called with the | |
1119 | database lock held; this is necessary to protect their access to | |
1120 | each server's 'version' and 'currentDB' fields and to the local | |
1121 | database version. However, these functions release the lock while | |
1122 | actually making RPCs. This behavior is new -- ContactQuorum() in | |
1123 | OpenAFS 1.4 did not release the database lock during RPCs -- but | |
1124 | it is safe, because no caller of ContactQuorum_*() depends on | |
1125 | state read under the database lock before the call to remain valid | |
1126 | after. | |
1127 | ||
1128 | Transactions | |
1129 | ||
1130 | The main Ubik transaction API consists of ubik_BeginTrans(), | |
1131 | ubik_BeginTransReadAny(), ubik_BeginTransReadAnyWrite(), | |
1132 | ubik_AbortTrans(), ubik_EndTrans(), ubik_Read(), ubik_Write(), | |
1133 | ubik_Flush(), ubik_Seek(), ubik_Tell(), ubik_Truncate(), and | |
1134 | ubik_SetLock(). The ubik_Begin* functions require a database | |
1135 | pointer; all others require an active transaction pointer. These | |
1136 | functions are called directly by the application without any | |
1137 | locks, and are responsible for their own locking. Note that many | |
1138 | of these examine the transaction type before acquiring any locks; | |
1139 | this is permissible because the transaction type is immutable once | |
1140 | the transaction is created (but it requires that the call be made | |
1141 | from the thread that created the transaction, or that the | |
1142 | application have used some mechanism to insure an appropriate | |
1143 | memory barrier exists; the same mechanism used to protect the | |
1144 | transaction pointer should suffice). | |
1145 | ||
1146 | BeginTrans() contains the common code responsible for starting | |
1147 | a new transaction; this is made available to applications as | |
1148 | ubik_BeginTrans(), ubik_BeginTransReadAny(), and | |
1149 | ubik_BeginTransReadAnyWrite(). If a write transaction is in | |
1150 | progress, this function waits for it to complete. However, since | |
1151 | doing so releases the database lock, it is necessary to call | |
1152 | urecovery_AllBetter() again once the lock has been reacquired. | |
1153 | Further, in the UBIK_PAUSE case, the DBVOTING flag may have been | |
1154 | set, so this also must be retested (see the BEACON package section | |
1155 | for more on the difficulties of UBIK_PAUSE and DBVOTING). | |
1156 | ||
1157 | Both ubik_Flush() and ubik_Write() examine and manipulate the | |
1158 | transaction I/O vector without benefit of any lock. These fields | |
1159 | ('iovec_info' and 'iovec_data') should be protected by the | |
1160 | database lock. In the case of ubik_Write(), this may require | |
1161 | releasing the lock before calling ubik_Flush(), or the | |
1162 | introduction of a private ubik_Flush_r() which assumes the lock is | |
1163 | already held (but note that in either case, the lock is released, | |
1164 | because ubik_Flush() calls ContactQuorum_* functions). | |
1165 | ||
1166 | Utilities | |
1167 | ||
1168 | The functions ubik_GetVersion() and ubik_WaitVersion() provide the | |
1169 | application with a way to discover the current database version | |
1170 | and to wait for it to change. These interfaces are not currently | |
1171 | used in OpenAFS. ubik_GetVersion() needs to acquire the database | |
1172 | lock while copying the database version. | |
1173 | ||
1174 | The internal function ubikGetPrimaryInterfaceAddr() is used by | |
1175 | Ubik RPCs to determine a peer server's primary address, given the | |
1176 | IP address from which a call arrived. This needs to hold the | |
1177 | server address lock, as described in the section on struct | |
1178 | ubik_server. | |
1179 | ||
1180 | Application Cache | |
1181 | ||
1182 | The section on struct ubik_dbase describes the operation of the | |
1183 | application cache. The function ubik_CheckCache() is provided to | |
1184 | allow an application to insure the cache is up to date and obtain | |
1185 | a read lock on the cache which lasts for the duration of the | |
1186 | transaction. | |
1187 | ||
1188 | Note that ubik_AbortTrans() currently always invalidates the | |
1189 | application cache by clearing ubik_dbase->cachedVersion, for which | |
1190 | it requires a write lock on the cache_lock. This means that | |
1191 | aborted transactions block waiting for completion of any | |
1192 | transactions which hold the cache_lock, which may well include all | |
1193 | outstanding transactions. In the case of read transactions, which | |
1194 | cannot have modified the database, this is unnecessary and may | |
1195 | significantly reduce concurrency. | |
1196 | ||
1197 | ||
1198 | THREAD SAFETY | |
1199 | ||
1200 | This section discusses changes needed to insure thread safety. | |
1201 | ||
1202 | The initialization sequence in ubik_ServerInitCommon() should be cleaned | |
1203 | up, to ensure that every subsystem is fully initialized before starting | |
1204 | any background threads or accepting RPCs. The correct initialization | |
1205 | sequence should look something like this: | |
1206 | ||
1207 | + Initialize error tables | |
1208 | + Allocate and initialize the database structure | |
1209 | + Initialize Rx | |
1210 | + Initialize the various subsystems: | |
1211 | + udisk_Init() (formerly disk.c:DInit) | |
1212 | + ulock_Init() (new, pulled out of ulock_getLock) | |
1213 | + uvote_Init() | |
1214 | + urecovery_Initialize() | |
1215 | + ubeacon_InitServerList* | |
1216 | + Create Rx services | |
1217 | + Start an Rx server thread | |
1218 | + updateUbikNetworkAddress() (rename and export from beacon.c) | |
1219 | + Start the beacon and recovery threads | |
1220 | ||
1221 | Initialization functions may need to be added for some subsystems which | |
1222 | do not currently have them: | |
1223 | ||
1224 | + src/ubik/disk.c:DInit() should be exported and renamed | |
1225 | udisk_Init(); it can then be called from ubik_ServerInitCommon() | |
1226 | instead of from udisk_begin(). | |
1227 | ||
1228 | + The lock initialization currently done in ulock_getLock() should | |
1229 | instead be done in a separate ulock_Init(), which should be called | |
1230 | from ubik_ServerInitCommon() | |
1231 | ||
1232 | A new lock is needed to protect state belonging to the VOTE package, | |
1233 | including the globals 'ubik_dbVersion' and 'ubik_dbTid', which represent | |
1234 | the sync site's database state. See the VOTE package section for | |
1235 | details on the use of the vote lock. | |
1236 | ||
1237 | A new lock is needed to protect state belonging to the BEACON package, | |
1238 | including the globals 'ubik_amSyncSite' and 'syncSiteUntil' and several | |
1239 | fields in the ubik_server structure. This lock also protects the | |
1240 | per-server 'up' flag, which is shared among several modules. In | |
1241 | addition, some beacon-specific fields are accessed directly by other | |
1242 | modules. Thus, this lock must be visible outside the BEACON module, or | |
1243 | else accessors must be provided for these items. See the BEACON package | |
1244 | section for details on the use of the beacon lock. | |
1245 | ||
1246 | A new lock is needed to provide additional protection for the database | |
1247 | version, transaction counter, and flags, so that these can safely be | |
1248 | examined (but not updated) by the beacon thread without the need to | |
1249 | acquire the database lock. See the BEACON package section for details | |
1250 | on the use of the version lock. | |
1251 | ||
1252 | A new lock is needed to protect per-server lists of non-primary | |
1253 | addresses, connections to the VOTE and DISK services on each server, and | |
1254 | the client-mode security class objects used to set up these connections. | |
1255 | See the section on struct ubik_server for details on the use of the | |
1256 | server address lock. | |
1257 | ||
1258 | The required locking order is described by the following list. Any of | |
1259 | these locks may be acquired singly; when acquiring multiple locks, they | |
1260 | should be acquired in the listed order: | |
1261 | ||
1262 | + application cache lock (dbase->cache_lock) | |
1263 | + database lock (DBHOLD/DBRELE) | |
1264 | + beacon lock | |
1265 | + vote lock | |
1266 | + version lock | |
1267 | + server address lock | |
1268 | ||
1269 | Some cleanup is required in various places where existing locking | |
1270 | conventions are not properly obeyed: | |
1271 | ||
1272 | + SVOTE_Beacon() needs to hold the database lock when calling | |
1273 | urecovery_CheckTid() | |
1274 | ||
1275 | + Management of the database lock in SDISK_SendFile() needs to be | |
1276 | cleaned up so that it is always released at the proper time. | |
1277 | ||
1278 | + urecovery_Interact() is not consistent about holding the database | |
1279 | lock when it examines and updates 'urecovery_state'. It also | |
1280 | examines the local database version at least once without benefit | |
1281 | of this lock. See the RECOVERY package section for details on | |
1282 | which locks are needed when by urecovery_Interact(). | |
1283 | ||
1284 | + ubik_Flush() and ubik_Write() need to acquire the database lock | |
1285 | before accessing the 'iovec_info' and 'iovec_data' fields. | |
1286 | ||
1287 | + ubik_GetVersion() needs to acquire the database lock before copying | |
1288 | the database version. | |
1289 | ||
1290 | The various debugging RPCs in the VOTE packages, and the data collection | |
1291 | routines they call in other packages, generally operate without benefit | |
1292 | of any locks. This is probably desirable, as it avoids any possibility | |
1293 | of debugging operations blocking on or interfering with the rest of the | |
1294 | system, and allows debugging data to be collected even in the event of | |
1295 | some unforeseen deadlock. It is also "safe", in that it does not risk | |
1296 | damage to data structures or access to uninitialized memory, provided | |
1297 | that data structure initialization is complete before these functions | |
1298 | are called. However, it does mean that these RPCs may return data that | |
1299 | is not entirely self-consistent. | |
1300 | ||
1301 | + Correctness requires various locks be held during parts of | |
1302 | SVOTE_Debug() and SVOTE_DebugOld(); see the VOTE package section. | |
1303 | ||
1304 | + SVOTE_Debug() also examines 'ubik_currentTrans' without benefit of | |
1305 | the database lock. This usage is not "safe", as that transaction | |
1306 | may be freed by another thread while SVOTE_Debug() is examining it. | |
1307 | ||
1308 | + SVOTE_XSDebug() should hold the server address lock while copying | |
1309 | out server addresses, and other locks while copying per-server | |
1310 | state. | |
1311 | ||
1312 | + udisk_Debug() should hold the database lock while examining the | |
1313 | database version and collecting page cache statistics. | |
1314 | ||
1315 | + ulock_Debug should use LOCK_LOCK/LOCK_UNLOCK when examining the | |
1316 | internals of struct Lock. Ideally, knowledge of these internals | |
1317 | should be abstracted to src/lwp/lock.h. | |
1318 | ||
1319 | + ubeacon_Debug() should hold the beacon lock while accessing the | |
1320 | value of 'syncSiteUntil' and, if the access is moved here from | |
1321 | SVOTE_Debug/SVOTE_DebugOld, 'ubik_amSyncSite'. | |
1322 | ||
1323 | CONCURRENCY THOUGHTS | |
1324 | ||
1325 | This section collections information discussed previously about changes | |
1326 | that may be required or desirable in order to improve concurrency. Note | |
1327 | that none of these changes are needed to insure thread safety, provided | |
1328 | the changes discussed in the previous section are made. | |
1329 | ||
1330 | + PHYS could be made more concurrent by maintaining a separate lock for | |
1331 | the fdcache, avoiding the need to DBHOLD while calling these. | |
1332 | However, higher-layer consistency probably requires that certain calls | |
1333 | or groups of calls to this package be atomic and isolated. | |
1334 | ||
1335 | + Could DISK be made more concurrent by maintaining a separate lock for | |
1336 | the page cache and or free truncation record list? | |
1337 | ||
1338 | + Could concurrency be improved by maintaining separate per-transaction | |
1339 | locks? If so, what is the locking hierarchy with respect to database | |
1340 | locks and the DISK and PHYS package locks? | |
1341 | ||
1342 | + Could concurrency be improved by maintaining more reader/writer state | |
1343 | and/or separate write locks to insure consistency of database file | |
1344 | access without requiring so much DBHOLD? | |
1345 | ||
1346 | + Alternately, could concurrency be improved by requiring that an active | |
1347 | transaction be accessed only by the thread that created it? | |
1348 | ||
1349 | ||
1350 | MULTI-DATABASE SUPPORT | |
1351 | ||
1352 | There isn't any. Some of the Ubik interfaces and even the organization | |
1353 | of some internal data structures makes it appear as though multiple | |
1354 | databases can be supported. However, this is not the case, for several | |
1355 | reasons: | |
1356 | ||
1357 | + Only a single set of peer servers is tracked, via a global linked list | |
1358 | of server structures. Each server structure includes state such as | |
1359 | the server's database version and whether it is up-to-date, so it is | |
1360 | not possible to track multiple databases even for a common set of | |
1361 | servers. | |
1362 | + The physical I/O package (phys.c) tracks cached open file descriptors | |
1363 | without regard to what database a file is part of; thus, it cannot | |
1364 | handle accessing multiple databases at the same time. | |
1365 | + There are a variety of globals containing state which needs to be | |
1366 | tracked on a per-database basis. | |
1367 | + There are a variety of globals which are effectively protected by | |
1368 | a per-database lock, or by no lock at all. | |
1369 | + The various RPCs used to communicate between servers assume a single | |
1370 | database and do not include a database identifier. Thus, in the | |
1371 | presence of multiple databases, it would be impossible to determine, | |
1372 | for example, for which database a DISK RPC was intended. | |
1373 | ||
1374 | The upshot of this is that considerable work would be needed to get Ubik | |
1375 | into a condition which would allow a single process to maintain multiple | |
1376 | independent Ubik databases at the same time. Note that this is | |
1377 | orthogonal to the question of whether a single multi-file database is | |
1378 | possible. | |
1379 | ||
1380 | ||
1381 | OTHER DATA STRUCTURES | |
1382 | ||
1383 | Some other data structures are mentioned in ubik.p.h or elsewhere, but | |
1384 | are of relatively little interest from the point of view of threading. | |
1385 | They are described here primarily as an alternative to deleting | |
1386 | descriptive text that was already written. | |
1387 | ||
1388 | struct ubik_hdr - on-disk database file header | |
1389 | ||
1390 | This structure represents the on-disk format of the Ubik database | |
1391 | file header (label). It is used only for local variables in | |
1392 | uphys_getlabel() and uphys_setlabel(), which read and write this | |
1393 | header, respectively. | |
1394 | ||
1395 | struct ubik_stat - database file status | |
1396 | ||
1397 | This structure is used to convey the size and modification timestamp | |
1398 | of a database file. It is used as the type of an output parameter of | |
1399 | uphys_stat(), and for local variables in callers of that function. | |
1400 | ||
1401 | struct ubik_stats - global statistics | |
1402 | ||
1403 | This structure is used to contain "miscellaneous" global statistics. | |
1404 | Currently that consists of a single counter which is incremented in | |
1405 | one place (an exceptional condition in ubik_EndTrans) and is never | |
1406 | read. The one increment is performed only when holding the database | |
1407 | lock; thus, this structure may be treated the same as other globals | |
1408 | protected by non-global locks. | |
1409 | ||
1410 | OTHER NOTES | |
1411 | ||
1412 | There appears to be nothing to insure that calls to disk.c:DRead() | |
1413 | during a write transaction won't find and use a "clean" cached page when | |
1414 | there is already a dirty page in the cache resulting from an earlier | |
1415 | write as part of the same transaction. This means that if a transaction | |
1416 | reads a page after writing it, it may read back the original data | |
1417 | instead of the new, and if a transaction writes a page more than once, | |
1418 | the later writes may corrupt the database. It seems likely that we are | |
1419 | merely lucky in this regard, and this problem should be fixed. | |
1420 | ||
1421 | src/ubik/beacon.c:ubeacon_ReinitServer() assumes 'ubik_CRXSecurityRock' | |
1422 | is in fact an AFS configuration directory pointer, suitable as an | |
1423 | argument to afsconf_UpToDate(). This is inappropriate; | |
1424 | 'ubik_CRXSecurityRock' is an opaque argument to | |
1425 | (*ubik_CRXSecurityProc)(), Ubik must not make assumptions about what it | |
1426 | is. | |
1427 | ||
1428 | It may be worth examining whether it is appropriate for SVOTE_Beacon() | |
1429 | to call urecovery_CheckTid() only when voting yes, and not whenever | |
1430 | a beacon is received from a server claiming to be sync site. | |
1431 | ||
1432 | Recently, code has been introduced which attempts to insure that the | |
1433 | recovery process does not truncate an old database file before a new one | |
1434 | has been fully transferred, since this could, depending on the timing of | |
1435 | a server failure, deprive the new quorum of a valid database. This | |
1436 | so-called "new recovery" code violates the abstraction provided by the | |
1437 | PHYS package, encoding specific knowledge of the underlying file store | |
1438 | into the REMOTE and RECOVERY packages. This means that the backend | |
1439 | provided by phys.c cannot be replaced with an alternative, since in that | |
1440 | case recovery would do the wrong thing with the transferred database. | |
1441 | ||
1442 | The DBWRITING loop in urecovery_Interact() is not portable, as it | |
1443 | depends the contents of the timeout after a call to select(2). This was | |
1444 | fine for IOMGR_Select(), which has defined behavior, but not for | |
1445 | select(2). | |
1446 |