Import Upstream version 1.8.5
[hcoop/debian/openafs.git] / doc / txt / ubik.txt
CommitLineData
805e021f
CE
1This file contains a threading analysis of Ubik prepared by
2Jeffrey Hutzelman and sent to the openafs-devel@openafs.org
3mailing list in February 2011, archived at:
4https://lists.openafs.org/pipermail/openafs-devel/2011-February/018329.html
5
6A while ago, Jeff Altman asked me to do an analysis of the Ubik code with
7repsect to threading, to try to determine how ready the code is to be moved
8from an LWP-based environment to one using preemptive POSIX threads, and
9what it will take to get it the rest of the way there. Now that the work
10is complete, I'm posting the complete analysis and my recommendations for
11discussion and as a backdrop for changes I expect to see proposed in the
12near future.
13
14This work was funded by Your File System, Inc.
15
16-- Jeff
17
18INTRODUCTION
19
20 This document describes the structure of Ubik, with an eye toward
21 thread-safety and concurrency. It begins by discussing the elements,
22 usage, and locking considerations of major shared data structures. This
23 is followed by a discussion of each major subsystem. The emphasis is on
24 deficiencies in thread-safety and the changes needed to correct them.
25
26 Most of this document focuses on the user-mode Ubik server code, which
27 comprises the bulk of the Ubik library code and also of the complexity.
28 A separate section near the end is dedicated to Ubik client code, which
29 itself is intended primarily for use in user-mode applications. The
30 OpenAFS cache manager includes its own mechanisms for accessing
31 replicated services, including the Ubik-managed VLDB, and for tracking
32 the servers that provide them.
33
34 Occasionally, issues related to concurrency, performance, and modularity
35 (particularly as regards support for supporting multiple independent
36 databases in a single server process) are also discussed; however, these
37 discussions are peripheral and are included only to record data obtained
38 while examining the question of thread-safety and as a basis for future
39 work. Similarly, throughout this document are detailed discussions of
40 some parts of the code and certain data structures which, while not
41 always directly relevant to thread-safety, were produced in the course
42 of analyzing the Ubik library code. These are included to provide
43 additional background and as a basis for future work in documentation of
44 Ubik library internals and interfaces.
45
46
47MAJOR DATA STRUCTURES
48
49 This section calls out the major shared data structures used in the Ubik
50 server code and details what they are used for and how shared access to
51 them is protected. Much of this text is probably suitable as a starting
52 point for expanding on the documentation found in comments in ubik.p.h.
53
54 struct ubik_dbase - database
55
56 This structure contains nearly everything Ubik knows about an open
57 database (for an explanation of "nearly", see the section below on
58 globals), including database-wide parameters and state and a list of
59 active transactions.
60
61 The following structure elements are set only when the database
62 structure is being initialized, after which they are never modified:
63
64 + pathName - The base path used to construct database filenames
65 + physical later method pointers (read, write, truncate, sync,
66 stat, open, setlabel, getlabel, getnfiles), used to manipulate
67 the physical database files on disk. These currently always
68 point to the methods defined in phys.c, and in fact other
69 modules contain knowledge of the internal implementation of that
70 module.
71
72 The primary synchronization mechanism used to protect this structure
73 is the database lock (versionLock), which is manipulated via the
74 DBHOLD() and DBRELE() macros. This lock protects the following
75 structure elements, as well as a number of globals as described
76 elsewhere in this document:
77
78 + activeTrans - list of active transactions
79 + version - database version (*)
80 + tidCounter - transaction ID of latest transaction (*)
81 + writeTidCounter - transaction ID of latest write transaction (*)
82 + flags (*)
83 + readers
84
85 (*) These fields are currently protected by the database lock, except
86 that the beacon thread does not hold that lock when examining them,
87 or the global 'ubik_epochTime', which holds the corresponding epoch.
88 The recommendation is for these fields, along with ubik_epochTime, to
89 be additionally protected by a second lock, allowing the beacon
90 thread to examine them without holding the database lock; see the
91 section on the BEACON package for details.
92
93 The condition variable 'version_cond' is used to signal to that the
94 database version may have changed; it is broadcast in udisk_commit(),
95 in SDISK_SendFile(), and from the recovery thread; it is monitored by
96 ubik_WaitVersion(), which can be called by an application to wait for
97 a database version change (this is not currently used in OpenAFS).
98 This CV is associated with the database lock. When LWP is used, this
99 condition is signalled on &ubik_dbase->version.
100
101 The condition variable 'flags_cond' is used by udisk_end() to signal
102 that DBWRITING flag has been cleared. This wakes threads waiting in
103 ubik.c:BeginTrans() to begin a new transaction. This CV is
104 associated with the database lock. When LWP is used, this condition
105 is signalled on &ubik_dbase->flags.
106
107 When LWP is used, signals are also sometimes sent on &urecovery_state
108 and on &ubik_amSyncSite. However, nothing ever waits on these, and
109 no corresponding condition variables exist when pthreads is used.
110
111 The 'cachedVersion' field holds the database version represented by
112 the application's cache; this is compared against the current
113 database version to discover if the cache is up to date. Both
114 cachedVersion and the application cache itself are protected by
115 cache_lock, an AFS reader/writer lock.
116
117 When the application wishes to use the cache, it calls
118 ubik_CacheUpdate(), which verifies that the cache is current,
119 updating it if necessary (under a write lock), and then returns with
120 the application cache read-locked. The application can then rely on
121 the cache's contents not to change until it ends the transaction by
122 calling ubik_EndTrans() or ubik_AbortTrans(). When one of these
123 functions is called, the read lock is dropped, and a write lock may
124 be acquired temporarily to update both cachedVersion and the cache
125 (in the case of committing a write transaction) or to clear
126 cachedVersion, indicating the cache is invalid (in the case of
127 aborting a transaction).
128
129 In the case of ubik_EndTrans() on a write transaction, the cache is
130 held write-locked until udisk_commit() has completed; this is also
131 done in SDISK_Commit(), which insures that the contents of the
132 on-disk database and page cache cannot change while an active
133 transaction holds the cache lock. Note that both the recovery thread
134 and SDISK_SendFile() may replace the database contents wholesale;
135 however, these operations first abort all outstanding transactions.
136
137 Prior to the introduction of ubik_BeginTransReadAnyWrite(), an
138 application could choose to manage its own cache, updating it during
139 write transactions and when, at the beginning of a read transaction,
140 it was discovered to be out of date (due to a write done by another
141 server). Under this model, the application uses Ubik transaction
142 locks to protect not only the database but also its cache. However,
143 the use of ubik_BeginTransReadAnyWrite() allows some transactions to
144 proceed and read the database without acquiring any locks, which
145 makes it unsafe to depend on Ubik transaction locks to protect the
146 application cache. Applications which make use of this mode must set
147 ubik_SyncWriterCacheProc to point to a procedure to be called to
148 update the cache during ubik_EndTrans(), after udisk_commit() has
149 succeeded and before the cache_lock is released. These applications
150 must then refrain from ever updating the cache other than during that
151 procedure or during the callback passed to ubik_CheckCache().
152
153
154 struct ubik_trans - active transaction
155
156 This structure represents the state of an active transaction,
157 including transaction metadata and state and a list of all changes
158 made as part of the transaction.
159
160 Transactions are organized into a linked list referenced by the
161 corresponding database structure, protected by the database lock.
162 Each transaction contains an upward pointer to the database,
163 a transaction type, and a transaction ID, all of which are immutable
164 for as long as the transaction exists.
165
166 A transaction created via the REMOTE interface exists until ended by
167 a call to SDISK_Abort() or SDISK_ReleaseLocks() or aborted in
168 urecovery_CheckTid() due to a transaction ID mismatch or creation of
169 a new transaction (note that in this last case, cleanup will be
170 deferred if SDISK_Lock() is blocked on this transaction). In these
171 cases, ubik_currentTrans is cleared at the same time, under the
172 database lock, and further REMOTE calls will fail gracefully until
173 a new transaction is started. Thus, REMOTE transactions may be
174 considered valid only for as long as the database lock is held.
175
176 Transactions created via ubik_BeginTrans(), ubik_BeginTransReadAny(),
177 or ubik_BeginTransReadAnyWrite() exist until ended by a call to
178 ubik_AbortTrans() or ubik_EndTrans(). They are never destroyed from
179 within the UBIK package. Code which discovers a transaction by
180 traversing the database's active transaction list may access the
181 transaction only as long as the database lock is held, since once it
182 is released, an application thread may end the transaction.
183
184 Each transaction contains the following fields, which are protected
185 by the database lock:
186
187 + locktype - type of lock held by this transaction
188 + minCommitTime - unused
189 + seekFile - file number of database file pointer
190 + seekPos - position of database file pointer
191 + flags
192
193 File writes are represented by 'iovec_info', an XDR vector structure
194 (type iovec_wrt, a vector of struct ubik_iovec) containing the file,
195 position, and length of each write, and 'iovec_data', an XDR opaque
196 containing the corresponding data. These vectors and their contents
197 are protected by the database lock.
198
199 File truncations are represented by a linked list of struct
200 ubik_trunc, each of which contains a file number and the new size of
201 that file. This list, and the records contained in it, are protected
202 by the database lock.
203
204 While transactions are currently protected by the database lock, it
205 may be desirable in the future to introduce a per-transaction lock,
206 in order to facilitate increased concurrency.
207
208
209 struct ubik_server - server status
210
211 This structure is used to track the set of peer servers, including
212 state related to elections and recovery. The global 'ubik_servers'
213 points to a linked list of these structures, which is constructed
214 during initialization. Once the list has been constructed, no
215 entries are ever added or removed; in addition, the following fields
216 are not changed after startup:
217
218 + addr[0] - server's primary address
219 + magic - true if this is the magic server
220 + isClone - true if this server is a clone
221
222 The following fields are used by the BEACON package to track beacons
223 to this server and their results. Currently, they are sometimes
224 accessed under the database lock, and sometimes without benefit of
225 any lock. They should probably be protected by the same lock as
226 globals private to the BEACON package.
227
228 + lastVoteTime - last time this server voted yes for us
229 + lastBeaconSent - last time a beacon was sent to this server
230 + lastVote - true if its late vote response was yes
231 + up - true if this server is believed to be up
232 + beaconSinceDown - true if a beacon has successfully been sent to
233 this server since the last time it was down
234
235 The following fields are used by the RECOVERY package to track the
236 state of each server. They are also examined and modified by the
237 various ContactQuorum_* functions. Currently, they are sometimes
238 accessed under the database lock, and sometimes without benefit of
239 any lock. They should be fully protected by the database lock.
240
241 + version - server's database version
242 + currentDB - true if server's database is up-to-date
243
244 The addr[] array is used to track all known addresses for the remote
245 server; this list is used when attempting to contact a down server
246 via an alternate address, and to identify the source of incoming
247 server-to-server RPCs. The first element of this array is always the
248 primary address and does not change after initialization; other
249 elements are updated during initialization and when a remote server
250 calls DISK_UpdateInterfaceAddr().
251
252 The system normally maintains an active Rx connection to the VOTE and
253 DISK services on each server; pointers to these are kept in the
254 'vote_rxcid' and 'disk_rxcid' fields.
255
256 Currently, the addr[] array, 'vote_rxcid', and 'disk_rxcid' are not
257 protected by any lock. Because these are shared among a number of
258 subsystems and it is usually desirable to avoid blocking on the
259 database lock when making RPCs, it is recommended to introduce a new
260 lock (the "server address lock") to protect these fields and the
261 globals 'ubikSecClass' and 'ubikSecIndex', which contain the security
262 classes used for setting up such connections. The new lock would
263 need to be held in the following cases:
264 + Before beginning any RPC using 'vote_rxcid' or 'disk_rxcid', for
265 the time it takes to call rx_GetConnection() or rx_NewCall()
266 + For the duration of ubikGetPrimaryInterfaceAddr()
267 + In ubeacon_Interact(), when building the set of connections to be
268 used for multi_VOTE_Beacon().
269 + In updateUbikNetworkAddress(), first while constructing the set of
270 connections to be used for multi_DISK_UpdateInterfaceAddr(), and
271 then again each time a server's list of addresses is updated.
272 However, the lock must NOT be held while waiting for the RPC to
273 complete, to avoid deadlock when multiple servers are starting at
274 the same time.
275 + For the duration of SDISK_UpdateInterfaceAddr()
276 + For the duration of ubeacon_ReinitServer(), with the optional
277 exception of the call to afsconf_UpToDate().
278 + In src/ubik/recovery.c:DoProbe(), first while constructing the set
279 of connections to be used for multi_DISK_Probe() and, if the probe
280 is successful, again while destroying the old connections for the
281 server under test and replacing them with new ones. See the
282 section on the RECOVERY package for additional notes.
283
284
285GLOBAL VARIABLES
286
287 The library has an unfortunate number of global variables which are not
288 part of any data structure. Nonetheless, some of these are
289 database-specific; see MULTI-DATABASE SUPPORT, below.
290
291 Globals private to a particular subsystem are discussed in the
292 description of that subsystem. This section discusses global variables
293 which are shared and/or part of external interfaces.
294
295 The following globals are used to convey configuration parameters from
296 the application to the Ubik library, and are intended to be set before
297 the first call to ubik_ServerInit(). They are not modified by Ubik, and
298 are not protected by locks:
299
300 + ubik_debugFlag - print debug messages (unused)
301 + ubikPrimaryAddrOnly - use only primary interface
302 + ubik_nBuffers - number of DISK package page cache buffers
303 + ubik_SRXSecurityProc - callback to create server security class
304 + ubik_SRXSecurityRock - argument for ubik_SRXSecurityProc
305 + ubik_CRXSecurityProc - callback to create client security class
306 + ubik_CRXSecurityRock - argument for ubik_CRXSecurityProc
307 + ubik_SyncWriterCacheProc - callback to update application cache
308 + ubik_CheckRXSecurityProc - callback to check if caller is OK
309 + ubik_CheckRXSecurityRock - argument for ubik_CheckRXSecurityProc
310
311 The following globals contain values which are computed during Ubik
312 initialization and not modified thereafter. They are not protected by
313 any lock, which is safe so long as suitable memory barriers exist across
314 creation of threads and Rx services (see the notes on initialization in
315 the UBIK package section). Note that some of these values are
316 database-specific and properly belong in the ubik_dbase structure:
317
318 + ubik_servers - linked list of servers (see above)
319 + amIClone - true if this server is a clone
320 + ubik_callPortal - UDP port used for server-to-server RPCs
321 + ubik_quorum - number of servers to make a quorum
322 + ubik_dbase - pointer to the (only) database (see above)
323 + ubik_host[] - array of local addresses
324 + ubik_sc[] - security classes for Ubik RPC services
325
326 The following globals are currently protected by the database lock,
327 though in most cases there are existing references which do not hold the
328 lock. Later sections of this document contain recommendations in each
329 case to establish new locks to protect these or to correct cases where
330 they are accessed without the correct lock.
331
332 + ubik_currentTrans - pointer to REMOTE's current transaction
333 + ubik_epochTime - epoch for transaction IDs
334 + urecovery_state - recovery state bits
335 + ubik_dbVersion - sync site's database version
336 + ubik_dbTid - sync site's transaction ID
337 + ubikSecIndex - security class index for making Ubik RPCs
338 + ubikSecClass - security class for making Ubik RPCs
339
340 The global 'ubik_amSyncSite' belongs to the BEACON package, and should
341 be protected by the same lock as that package's private globals. In
342 fact, it should probably be made private; the only references outside
343 BEACON are in SVOTE_Debug() and SVOTE_DebugOld() and can easily be moved
344 into ubeacon_Debug().
345
346
347MAJOR SUBSYSTEMS
348
349 This section describes each of the major Ubik server subsystems. For
350 each subsystem, a brief overview is provided, along with a description
351 of that subsystem's handling of synchronization (or lack thereof).
352 Subsystems are described beginning with the lowest layer and working
353 toward the highest.
354
355 PHYS - Physical I/O
356
357 This package is responsible for handling I/O to database files. It
358 maintains a private, global array of MAXFDCACHE(4) fdcache
359 structures, each representing an open file descriptor on a database
360 file, either presently in use or idle and available for use. Entries
361 are indexed by database file ID, and each contain a file descriptor
362 number and a refcount. There is also a global 'inited' flag, and
363 a buffer used to construct pathnames when opening database files.
364
365 The static function uphys_open() takes a database pointer and fileid
366 and attempts to open the corresponding file. On success, it returns
367 a file descriptor; on failure, it returns the same as open(2) (with
368 errno set). This prefers to reuse an already-open fd whose cache
369 refcount is 0; failing that, it opens a new fd and attempts to enter
370 it in the cache. It prefers first to use an unused cache entry, then
371 to reclaim an idle one whose refcount is 0; if neither is possible,
372 the newly-opened fd is returned without caching. Note that the
373 refcount on an active entry is always exactly 1; the same entry
374 cannot be referenced more than once.
375
376 The function uphys_close() releases a file descriptor opened by
377 uphys_open(). If the file descriptor was cached, the cache entry is
378 dereferenced but the file is left open for future use; 0 is returned.
379 If the file descriptor is not found in the cache, then the return
380 value is that of an immediate call to close(2). This function also
381 handles cleanup when a cached fd is closed after the file to which it
382 refers is invalidated by uphys_invalidate(). This function is not
383 static, but may as well be -- it is used only by other functions in
384 the PHYS package.
385
386 Functions in this package are called via method pointers in the
387 database structure, which are set during initialization and not
388 changed thereafter. These functions must be called with the database
389 lock held; this protects the file descriptor cache and other globals
390 mentioned above, and prevents conflicting access to database files.
391
392 The requirement that functions in this package be called with the
393 database lock held means that concurrent access to database files is
394 not possible. This situation could be improved by the addition of
395 a new global lock, which would be held for most of the duration of
396 uphys_open(), uphys_close(), and uphys_invalidate(), but need _not_
397 be held during actual file accesses. This would eliminate the
398 requirement that calls to this package be made with the database lock
399 held, allowing multiple calls to be active concurrently. However, it
400 would still be necessary for higher layers to employ appropriate
401 synchronization as needed to maintain database consistency.
402
403
404 DISK - Logical Database I/O and Transaction Management
405
406 This package is responsible for logical database file I/O, logging,
407 and transaction management. It maintains a private cache of database
408 file page buffers. This entire cache, including buffer space, is
409 allocated and initialized the first time a transaction is started;
410 the global initd flag keeps track of whether or not this has been
411 done. The size of this cache defaults to 20 pages, but may be
412 overridden by setting the global 'ubik_nBuffers' (all OpenAFS Ubik
413 services do this); this global is not protected by any lock, but is
414 read exactly once, when the cache is initialized.
415
416 Except for udisk_Debug() (see below), all udisk_* interfaces require
417 a pointer either to the database or to an active transaction, and
418 must be called with the database lock held. This protects the DISK
419 package's accesses of the active transaction list and associated
420 state variables as well as its calls into the physical access layer,
421 including insuring that operations requiring multiple calls into that
422 layer are performed atomically.
423
424 In addition, the database lock protects the page cache hash table and
425 LRU queue, the buffer control structures, buffer contents, the
426 truncation record free list (freeTruncList), and global statistics.
427 However, it may be desirable in the future to introduce a separate
428 lock to protect these structures, in order to facilitate multiple
429 database support.
430
431 udisk_LogOpcode, udisk_LogEnd, udisk_LogTruncate, udisk_LogWriteData
432
433 These are internal interfaces for writing to the transaction log.
434 They each construct a log record and then directly call the
435 physical layer to append it to the log; the log file is _not_
436 cached by the page buffer cache, and these functions do not touch
437 the cache. Each of these must be atomic with respect to other
438 writes to the log file, which is accomplished by insuring they are
439 called only with the database locked.
440
441 DInit, DRead, DTrunc, DRelease, DFlush, DAbort, DSync, DNew
442
443 These are internal interfaces for accessing the page buffer cache.
444 They must be called with the database lock held.
445
446 DRead() and DNew() return a pointer to the page data buffer, not
447 to the buffer control structure; this pointer is later passed to
448 DRelease() to release the buffer. This mechanism depends on the
449 fact that buffer control structures are allocated in the same
450 order as the actual page data buffers.
451
452 DRead() guarantees that a read transaction will always see a clean
453 copy of the requested page, without any modifications made by an
454 in-progress write transaction. However, this is true only if the
455 next layer insures that once DRead() is called with write intent
456 on a page, that no further calls to DRead() are made for that page
457 until the buffer is written and DRelease() is called with the
458 dirty flag. Presently, this is the case because udisk_read() and
459 udisk_write(), which are the only callers of DRead(), are both
460 called with the database lock held.
461
462 udisk_read, udisk_truncate, udisk_write, udisk_begin, udisk_commit,
463 udisk_abort, udisk_end, udisk_Invalidate
464
465 These are external interfaces provided by the DISK package to
466 higher layers. Presently, the locking requirements of these
467 functions and the lower-layer functions they call are satisfied by
468 the simple rule that these must always be called with the database
469 lock. In the future, support for multiple databases and/or
470 a greater level of concurrency may require relaxing this rule
471 and/or acquiring additional locks within these functions.
472
473 In udisk_commit(), when committing the first write transaction
474 after becoming sync site, the database is relabelled. In this
475 case, the UBIK_RECLABELDB recovery state bit should be set before
476 propagating the label change to other servers, rather than after.
477 This is because because ContactQuorum_DISK_Setversion() will
478 release the database lock, at which point the recovery state may
479 be cleared by urecovery_ResetState() running in another thread.
480 It is important that a relabelling which occurs before recovery
481 state is cleared not result in the UBIK_RECLABELDB recovery state
482 bit being set after; otherwise, the server may fail to correctly
483 relabel the database the next time it becomes sync site.
484
485 udisk_Debug
486
487 The udisk_Debug() interface is called as part of preparing the
488 response to Ubik debugging requests. Unlike other udisk_* APIs,
489 this interface is not called on a specific database and does not
490 require the database lock to be held. It reports the version data
491 of the single database identified by the global 'ubik_dbase' and
492 traverses the buffer cache collecting statistics, both without
493 benefit of holding any locks. This may produce inconsistent data
494 but does not normally risk damaging data structures or accessing
495 invalid or uninitialized memory (but see below), and avoiding
496 locking may be of greater benefit than guaranteeing the return of
497 consistent data.
498
499 The global 'ubik_dbase' pointer is set when the first (only)
500 database is initialized, and the buffer cache data structures are
501 allocated the first time a transaction is started, either locally
502 or via the REMOTE package. Once set, the pointers used by
503 udisk_Debug() are never changed, and the data structures they
504 point to remain valid and are never freed. The buffer cache is
505 traversed by iterating over the array of buffer cache elements
506 (this is done in other places as well), not by following a linked
507 list or other data structure that may reordered, and so should not
508 be dangerous.
509
510 At a minimum, then, it should be arranged that udisk_Debug()
511 cannot be called before the database structure and page cache have
512 been initialized. This means that both of these actions must be
513 taken before any incoming RPCs are accepted. Practically, this
514 means that DInit() should be renamed udisk_init(), and called from
515 ubik_ServerInitCommon() during initialization.
516
517 Full correctness demands that udisk_Debug actually hold the
518 database lock while copying the database version and examining the
519 page cache. However, it may be desirable to sacrifice this
520 correctness in order to reduce interaction between the debugging
521 interfaces and the rest of the system.
522
523 Helper functions:
524 + unthread - remove transaction from activeTrans
525 + Dlru, Dmru - move page to front/end of LRUQ
526 + MatchBuffer - check if a buffer's dbase/file/page match
527 + FixupBucket - move page to the right bucket
528 + newslot - find and allocate an unused buffer
529 + DedupBuffer - throw away stale copies after page is flushed
530 + GetTrunc, PutTrunc - trunc record allocation
531 + FindTrunc - find active trunc record for file
532 + DoTruncs - apply truncations
533
534
535 LOCK - Database Transaction Locking
536
537 This package is responsible for managing database locks held as part
538 of a transaction. It supports a single read/write lock, which is
539 represented by the global variable 'rwlock', a struct Lock. The
540 global 'rwlockinit' records whether Lock_Init() has been called on
541 this lock, and is protected by the database lock. Properly, this
542 lock should be treated as belonging to the database.
543
544 ulock_getLock() expects to be called with the database lock held; it
545 depends on this to protect access to fields within the transaction
546 structure as well as to the global 'rwlockinit'. However, note that
547 the database lock must be temporarily released while obtaining the
548 read/write lock, and so ulock_getLock() should not be called in the
549 middle of a critical section. The 'await' parameter may in theory be
550 set to 0 to effect a non-blocking lock; however, this depends on
551 (incorrect) knowledge of the internals of struct Lock. Fortunately,
552 existing Ubik code always sets await=1. This feature should be fixed
553 or removed.
554
555 ulock_relLock() releases the read/write lock. It expects to be
556 called with the database lock held. Unlock ulock_getLock(), this
557 function does not release the database lock.
558
559 ulock_Debug() reports the state of the read/write lock, for debugging
560 purposes. It depends on knowledge of the internals of struct Lock,
561 and accesses both the Lock structure and the global rwlockinit
562 without benefit of holding any lock.
563
564 It would probably be best to eliminate 'rwlockinit' in favor of
565 always initializing the global 'rwlock' during startup; ulock_Debug()
566 would then have no need to examine the flag. Further, ulock_Debug()
567 should at a minimum use the appropriate locks when looking inside
568 struct Lock; ideally, knowledge of that structure's internals should
569 be abstracted into appropriate interfaces in src/lwp/lock.[ch].
570
571
572 VOTE - Voting Logic and VOTE_* RPCs
573
574 This package is responsible for keeping track of who is currently
575 sync site and deciding how to vote and whether this server should try
576 to become sync site. It provides the VOTE RPC package, which is
577 primarily responsible for processing vote requests from peer servers
578 (VOTE_Beacon), but also provides debugging interfaces and an
579 interface (VOTE_GetSyncSite) to allow clients to discover the current
580 sync site.
581
582 Voting decisions are made using state tracked in a set of global
583 variables, listed below. Currently there is no locking protecting
584 these variables; LWP-based Ubik depends on the cooperative nature of
585 LWP and on the fact that routines which examine and manipulate them
586 do not block. Introduction of a new "vote lock" is recommended to
587 protect the following globals:
588
589 + ubik_lastYesTime - last time VOTE_Beacon voted "yes"
590 + lastYesHost - host voted for at ubik_lastYesTime
591 + lastYesClaim - time lastYesHost started its voting cycle
592 + lastYesState - whether lastYesHost claimed to be sync site
593 + lowestHost - lowest host seen
594 + lowestTime - last time lowestHost was updated/heard from
595 + syncHost - sync host, or 0 if none known
596 + syncTime - last time syncHost was updated
597 + ubik_dbVersion - sync site's database version
598 + ubik_dbTid - sync site's transaction ID
599
600 The vote lock should be initialized in uvote_Init() and held for the
601 remainder of that function and also during of uvote_ShouldIRun(),
602 uvote_GetSyncSite(). It should also be held for the bulk of
603 SVOTE_Beacon() -- specifically, it should be acquired just before
604 beginning the lowest-host calculation, and released just before
605 calling urecovery_CheckTid(), in the event of a yes vote or else just
606 before returning.
607
608 In addition, the vote lock would protect the globals 'ubik_dbVersion'
609 and 'ubik_dbTid', which represent this server's knowledge of the
610 state of the database on the sync site. This variable is set in two
611 places in the REMOTE package, which change the local and sync site
612 database versions at the same time. It is also examined in one place
613 in the REMOTE package, and compared to the database version in the
614 RECOVERY package. It is imperative that when the local and remote
615 versions are changed at the same time, these changes become visible
616 to the RECOVERY package at the same time. Thus, both the changes and
617 the comparison must be done under both locks. For this purpose,
618 three new functions should be provided by the VOTE package:
619
620 + A function to safely set 'ubik_dbVersion'
621 + A function to compare 'ubik_dbVersion' to another value
622 + A function to check whether this is sync site and compare
623 'ubik_dbVersion' to another value, without releasing the vote lock
624 in between (to be used in recovery).
625
626 Existing accesses to 'ubik_dbVersion' in REMOTE and RECOVERY, all of
627 which are already made under the database lock, would be replaced
628 with calls to the new accessors.
629
630 The result of these changes will require in several places that the
631 vote lock be acquired while the database lock is already held.
632 Therefore, the vote lock MUST NOT be held while acquiring the
633 database lock.
634
635 SVOTE_Beacon() currently calls urecovery_CheckTid() without benefit
636 of the database lock, which must be held when calling that function.
637 This is fairly simple to fix; however, the vote lock must be released
638 first. One simple approach would be to move the call to
639 urecovery_CheckTid() to after the point where the vote lock will be
640 released, wrapping it in DBHOLD/DBRELE, and conditionalizing the
641 whole thing on 'vote' being nonzero.
642
643 Separately, it may be worth examining whether it is appropriate to
644 call urecovery_CheckTid() only when voting yes, and not whenever
645 a beacon is received from a server claiming to be sync site.
646
647 As with the debug functions in other packages, SVOTE_Debug() and
648 SVOTE_DebugOld() access global variables belong to this and other
649 modules without benefit of the appropriate locks. There is also very
650 little abstraction here. It is probably desirable to introduce Debug
651 interfaces for the VOTE and RECOVERY packages, to be called from
652 these functions. Full correctness requires...
653
654 + Holding the vote lock while accessing the VOTE package globals
655 described above.
656 + Holding the database lock while accessing the database flags and
657 tidCounter, ubik_epochTime, ubik_currentTrans, and urecovery_state
658
659
660 BEACON - elections, tracking whether this is sync site
661
662 This package is responsible for running elections, collecting votes,
663 and deciding whether and for how long this is the sync site. It also
664 contains routines responsible for setting up the list of servers and
665 acquiring alternate addresses for other servers during startup.
666
667 For this purpose, it maintains a number of global variables and
668 ubik_server structure fields. Currently, these variables and fields
669 are sometimes accessed under the database lock, and sometimes without
670 benefit of any lock. In order to allow this package to operate with
671 a minimum of disruption due to database accesses, it is recommended
672 that a new lock be established to protect these. It is particularly
673 desirable to prevent loss of sync due to the potentially long-lived
674 RPCs the RECOVERY package must make under the database lock when
675 transferring full database copies between servers; for this reason,
676 it is important that both the BEACON thread and SVOTE_Beacon() be
677 able to operate without acquiring the database lock. The global
678 variables and ubik_server structure fields to be protected by the new
679 lock are the following:
680
681 + ubik_amSyncSite - true if this is the sync site
682 + syncSiteUntil - time until which this is the sync site
683 + ts->lastVoteTime - last time this server voted yes for us
684 + ts->lastBeaconSent - last time a beacon was sent to this server
685 + ts->lastVote - true if its late vote response was yes
686 + ts->up - true if this server is believed to be up
687 + ts->beaconSinceDown - true if a beacon has successfully been sent
688 to this server since the last time it was down
689
690 The new lock would need to be held in the following cases:
691 + In ubeacon_AmSyncSite(), from the first time 'ubik_amSyncSite' is
692 examined until after the potential call to urecovery_ResetState().
693 + In ubeacon_Interact(), when building the list of servers to which
694 to send beacons (to protect access to the 'up' flag). Note that
695 the server address lock must also be held during this loop, and
696 must be acquired _after_ the beacon lock.
697 + In ubeacon_Interact(), when updating status associated with
698 a server after receiving its beacon response.
699 + In ubeacon_Interact(), when updating globals after an election.
700 However, it must be released before calling urecovery_ResetState().
701 + In updateUbikNetworkAddress(), when marking a server down.
702 + In the various ContactQuorum_* functions, when checking whether
703 a server is up before calling it, and again when marking a server
704 down after a call fails.
705 + In ubik_EndTrans, when examining a server's 'beaconSinceDown' flag
706 and 'lastBeaconSent' timestamp to determine whether it has recently
707 gone down, requiring a brief hold until it either comes back up or
708 times out.
709
710 The following global variables are set during initialization and do
711 not change thereafter:
712
713 + nServers - number of voting servers
714 + amIMagic - true if this server gets the extra half vote
715 + ubik_singleServer - true if this is the only server
716
717 Currently, all initialization of this module is done _after_ the Rx
718 services have been created and an Rx server thread started. This is
719 done to avoid a situation in which all servers are blocked in
720 updateUbikNetworkAddress(), which is responsible for exchanging
721 addresses with peer servers, while none of them is prepared to
722 handled the DISK_UpdateInterfaceAddr call made by that function.
723
724 When using LWP, this early initialization of Rx is not a problem,
725 because the various initialization routines don't actually block until
726 updateUbikNetworkAddress() waits for RPCs to complete. However, when
727 using pthreads, it is a problem, because it means that incoming RPCs
728 may be processed when initialization is not complete.
729
730 To address this problem and insure orderly initialization, it is
731 recommended to change the initialization sequence so that most work,
732 including the bulk of BEACON package initialization, happens before
733 Ubik's Rx services have been created. The only work that should be
734 deferred until after that point is updateUbikNetworkAddress(),
735 followed by starting the long-running threads belonging to the BEACON
736 and RECOVERY packages. This requires updateUbikNetworkAddress() to
737 be exported (and perhaps to undergo a name change).
738
739 The major part of the work of this package is done in the function
740 ubeacon_Interact(), which is the body of a long-running thread known
741 as the beacon thread. This thread is responsible for periodically
742 sending "beacons" to other servers to request votes (when the sending
743 server is the sync site, these also include how long it will remain
744 so, the database version, and the active transaction ID). Beacons
745 are sent only from servers that are already sync site or believe they
746 are the best candidate, as determined by uvote_ShouldIRun().
747
748 On the sync site, the beacon thread is responsible for insuring that
749 new votes are collected frequently enough to avoid loss of quorum.
750 This means it must not block for an extended time; therefore, it must
751 not acquire the database lock, which in other threads may sometimes
752 be held during long-running operations (most notably, database file
753 transfers done under this lock by the recovery thread). Instead,
754 a new lock is proposed (the "version lock"), which is used to allow
755 the beacon thread to examine (but not modify) certain fields without
756 holding the database lock.
757
758 The version lock should be acquired _in addition to_ the database
759 lock when modifying 'ubik_epochTime' or the database 'version',
760 'flags', 'tidCounter', and 'writeTidCounter' fields; such
761 modifications occur at the following locations:
762 + in ubik.c:BeginTrans(), when beginning a new transaction
763 + in udisk_begin(), when setting DBWRITING
764 + in udisk_commit(), when updating the database version at the end of
765 a write transaction (note this includes the case of relabelling the
766 database on the first write transaction after becoming sync site)
767 + in udisk_end(), when clearing DBWRITING
768 + in recovery.c:InitializeDB(), when setting the initial version of
769 an empty database
770 + in urecovery_Interact(), when labelling a new database the first
771 time quorum is established
772 + in urecovery_Interact(), after fetching the latest database from
773 another server (whether successful or not; there are two cases)
774 + in SDISK_SendFile(), when updating the database version both before
775 and after receiving a new database from the sync site. Note the
776 lock must _not_ be held during the transfer.
777 + in SDISK_SetVersion()
778
779 The version lock should be acquired by the beacon thread while
780 examining these fields, shortly before calling multi_VOTE_Beacon().
781 Since it must release the lock before making that call, the database
782 version will need to be explicitly copied into a local variable, as
783 is already done with the current transaction ID.
784
785 Note that if UBIK_PAUSE is defined, the DBVOTING flag poses an
786 additional problem, as it must be modified by the beacon thread. If
787 it is desirable to continue to support this variant, then it becomes
788 necessary for all accesses to the database flags to be protected by
789 the version lock. Other UBIK_PAUSE code may not have been reviewed
790 in depth and may pose additional problems.
791
792 The function ubeacon_AmSyncSite() is called both from the beacon
793 thread and from various other points to determine whether they are
794 running on the sync site, and thus whether various operations are
795 safe and appropriate. Calls to this function from outside the beacon
796 thread (often, via urecovery_AllBetter()) must be made with the
797 database lock held, and for operations modifying the database or
798 relying on its contents to remain constant, this lock must not then
799 be released until the operation is complete. The does not guarantee
800 that this will remain sync site for the duration of the operation,
801 but does insure that any operations done after or as a result of
802 losing sync will in fact occur after the operation holding the lock.
803
804 Of course, ubeacon_Interact() itself cannot call ubeacon_AmSyncSite()
805 with the database lock held, since it must not block on that lock
806 when running on the sync site (otherwise sync could be lost, as
807 described above). So, that portion of ubeacon_AmSyncSite() other
808 than the call to urecovery_ResetState() should be abstracted out into
809 a new function, which presumably returns separate flags indicating
810 both whether it is in fact running on the sync site (which becomes
811 the return value of ubeacon_AmSyncSite()) and whether a call to
812 urecovery_ResetState() is indicated. When the latter is set, the
813 beacon thread can acquire the database lock (since, if this is not
814 the sync site, blocking on this lock is not a problem) and call
815 urecovery_ResetState().
816
817 This difference in behavior is acceptable because the required
818 invariant is that when 'ubik_amSyncSite' is cleared, it cannot become
819 true again until urecovery_state has also cleared. This insures that
820 once a server discovers it is not sync site, urecovery_AllBetter()
821 cannot succeed on the basis of being sync site until sync is regained
822 and recovery has run, insuring the database is up to date. Since
823 only the beacon thread ever sets 'ubik_amSyncSite' true, that thread
824 can meet the invariant by calling urecovery_ResetState() before doing
825 anything else, while other threads must continue to hold the beacon
826 lock to prevent the beacon thread from setting 'ubik_amSyncSite'.
827
828 ubeacon_Debug() reports the value of syncSiteUntil, without benefit
829 of obtaining any locks. It should also report the value of
830 'ubik_amSyncSite', allowing that global to be made private to the
831 BEACON package. Full correctness requires that the beacon lock be
832 held while accessing these globals.
833
834
835 RECOVERY - Consistency, Down Server Recovery, Propagation, Log Replay
836
837 This package is responsible for tracking whether the local database
838 is current and usable and for discovering when a server comes back up
839 after having been marked down. On the sync site, it is also
840 responsible for tracking the database versions on remote servers and
841 for recovering from inconsistencies.
842
843 To carry out these tasks, the RECOVERY package maintains a single
844 global, 'urecovery_state'. This is not currently effectively
845 protected by any lock; however, it should be protected by the
846 database lock. This is particularly important because high-level
847 routines that manipulate the database depend on a check against this
848 field to determine that the database is up-to-date and safe to
849 modify, and that it will remain so until the operation is complete.
850
851 The main work of this package is done by urecovery_Interact(), which
852 is the body of a long-running thread known as the recovery thread.
853 This thread wakes up periodically to discover if any down servers
854 have come back up and, on the sync site, to locate the latest
855 available database, fetch it if necessary, and make sure it has been
856 distributed to all servers. This work is done in several steps,
857 taken on in sequence every few seconds:
858
859 + The first part of urecovery_Interact() is to identify servers which
860 are believed to be down, and determine whether any of them have
861 come back up. This is done every 30 seconds, even when this is not
862 the sync site. Currently, the database lock is held for the
863 duration of this loop, released only in DoProbe() while actually
864 waiting for the DISK_Probe() RPC to complete. This is mostly
865 unnecessary; the server probe loop needs this lock only to examine
866 the database 'currentDB' flag and to clear the UBIK_RECFOUNDDB bit,
867 and DoProbe() doesn't need it at all. However, the beacon lock is
868 needed to examine and update the database 'up' flag, and DoProbe()
869 needs to hold the server address lock when examining server
870 addresses and, if the server comes back up, to replace connections.
871 See the section on struct ubik_server for more details on this
872 lock.
873
874 + The next step is to determine whether this is the sync site, and
875 update the UBIK_RECSYNCSITE bit appropriately. The database lock
876 should be held for the duration of this step, as
877 ubeacon_AmSyncSite() may end up calling urecovery_ResetState(),
878 which requires that lock be held. If this is not the sync site,
879 the cycle ends here.
880
881 + Step three is to determine the location of the latest database.
882 This is done by making DISK_GetVersion() calls to all active
883 servers, keeping track of the newest version found and the server
884 which reported it. Once this is done, the best version is compared
885 to the local version, update the UBIK_RECHAVEDB bit (so that it is
886 true if and only if the local version is the latest) set the
887 UBIK_RECFOUNDDB bit to indicate the latest database has been
888 located, and clear the UBIK_RECSENTDB bit to force any servers with
889 out-of-date databases to be updated.
890
891 In this section, it is not necessary to hold any locks while
892 collecting versions on remote servers, though as before, it is
893 necessary to hold the beacon lock while examining the server 'up'
894 flag, and the address lock to obtain a reference on each connection
895 as it is used (but this lock should not be held during the RPC!).
896 Once all remote versions have been fetched, the database lock is
897 needed while examining the local version and updating
898 'urecovery_state'. Note that write transactions may happen while
899 versions are collected, resulting in some servers having newer
900 versions than were detected. However, if this happens, the result
901 is always consistent and ends with the sync site having the latest,
902 canonical version:
903
904 + If the actual latest version X.N before the write was in an epoch
905 created by the current sync site, then that version must
906 necessarily already be present on the sync site, since writes are
907 always committed first on the sync site. Thus, when the sync
908 site updates the version counter, it produces a new version X.N+1
909 which no other server can believe it already has; thus, there is
910 no possibility of inconsistency.
911
912 + If the actual latest version X.N before the write was in an epoch
913 not created by the sync site, then it is possible that after
914 quorum was established but before any writes happened, a new
915 server came online which had a newer version X.N+1 (that was
916 committed to less than a quorum of sites before a crash). If
917 a write transaction begins before recovery detects this fact and
918 fetches the version X.N+1 database, then the new write will start
919 with the version X.N database, essentially creating a version
920 fork. However, when this happens, the resulting database will be
921 version Y.1 (with Y>X), because the first write on a new sync
922 site always results in a new database. Thus, the version X.N+1
923 database will no longer be latest (because the local version,
924 which is examined last, is Y.1), and will end up being discarded.
925 This is no worse than if the server with the version X.N+1
926 database had not come up until completely after the transaction
927 resulting in version Y.1.
928
929 + Step four is to fetch the latest database, if the sync site does
930 not already have it. Since this step replaces the database
931 wholesale, there can be no active transactions while it runs, and
932 all other database activity must be locked out for the duration of
933 the transfer. This means the database lock must be held for this
934 entire step, starting from the check that the UBIK_RECSYNCSITE
935 and/or UBIK_RECFOUNDDB bits are still set. (Note that the existing
936 comment which claims that it is impossible for UBIK_RECFOUNDDB not
937 to be set at this point is true only if the database is not
938 released since checking or updating that bit in step three). In
939 addition, the address lock must be held while setting up the call
940 to be used for DISK_GetFile().
941
942 + Step five, again after verifying that this is still the sync site
943 and that the UBIK_RECHAVEDB bit is set, is to distribute the new
944 database to any servers which do not already have it. Before this
945 can be done, if the database was newly-initialized (with label
946 epoch 1), it is relabeled with version 2.1 before it is distributed
947 to other sites. This allows readany transactions to run on such
948 a database (though probably not successfully, since the database is
949 empty); it is deferred until this point in recovery to insure that
950 if any server in the quorum contains a valid database, that version
951 is used rather than the empty one.
952
953 For this step, the database lock should be held starting at the
954 point at which the UBIK_RECSYNCSITE and/or UBIK_RECHAVEDB bits are
955 tested. In the event the DBWRITING flag is set, the lock must be
956 released while waiting for the active write transaction to end. In
957 this case, once the wait is completed, it is necessary to again
958 check urecovery_state to determine that this is still the sync site
959 and have an up-to-date database before proceeding.
960
961 It is probably simplest for urecovery_Interact() to acquire the
962 database lock at the start of step 2 described above, release it for
963 the duration of the version-collection part of step 3 and while
964 waiting for write transactions to terminate in step 5, and otherwise
965 hold it until the end of the cycle. If no work is needed, this means
966 the lock will be held relatively briefly; if it is necessary to fetch
967 and or send full database versions, these operations themselves must
968 be done with the lock held, and there is little point in releasing it
969 between them. However, it is acceptable to release the lock between
970 any steps described above, provided that each time the database lock
971 is released and reacquired, 'urecovery_state' is checked to verify
972 that no regression has occurred. Note that outside of the recovery
973 thread, the only changes that may occur to 'urecovery_state' are that
974 it may be completely cleared by urecovery_ResetState(), and that the
975 UBIK_RECLABELDB bit may be set by udisk_commit(), on the sync site.
976
977 The function urecovery_AllBetter() is called under the database lock
978 by routines which need to know whether there is a usable database.
979 As noted in the discussion of ubeacon_AmSyncSite() above, callers
980 must then continue to hold the database lock until any database
981 operation is complete. Further, callers which intend to write to the
982 database currently follow a call to urecovery_AllBetter() with a call
983 to ubeacon_AmSyncSite(), to determine whether a write is permissible.
984 Unfortunately, this is not safe, because it is possible (though
985 admittedly unlikely) for urecovery_AllBetter() to succeed on the
986 basis of being a non-sync-site with an up-to-date database, and then
987 for a subsequent call to ubeacon_AmSyncSite() to return a true result
988 (because the latter may block on acquiring the beacon lock, during
989 which time this may become the sync site). To insure that no write
990 operation is performed without an up-to-date, writeable database,
991 urecovery_AllBetter() should be modified to indicate whether a write
992 operation is permitted, which is the case only when UBIK_RECHAVEDB is
993 tested and found to be set.
994
995 The functions urecovery_AbortAll(), urecovery_CheckTid(), and
996 urecovery_ResetState() all expect to be called with the database lock
997 held.
998
999 The function urecovery_LostServer() need not be called with any
1000 locks; it is merely a wrapper for ubeacon_ReinitServer().
1001
1002 The RECOVERY package is also responsible at startup for replaying the
1003 transaction log and initializing an empty database. These tasks are
1004 handled by urecovery_Initialize() and its helpers, ReplayLog() and
1005 InitializeDB(). Because the helpers use PHYS package functions and
1006 manipulate database structure fields, urecovery_Initialize() should
1007 acquire the database lock before calling them.
1008
1009
1010 REMOTE - remote database I/O (DISK) RPCs
1011
1012 This package provides the DISK RPC package, which mediates remote
1013 access to the local database copy, when another server is sync site.
1014 It also includes DISK_UpdateInterfaceAddr(), which is used during
1015 initialization to verify consistency of configuration across servers,
1016 and DISK_Probe(), which is used by the RECOVERY module to determine
1017 whether the server is up.
1018
1019 As the functions in this package are RPC implementations, they may be
1020 called at any time and are entirely responsible for their own
1021 locking. There can be at most one active remote write transaction at
1022 a time; this rule is enforced by this package, which maintains the
1023 global 'ubik_currentTrans' as a pointer to the active transaction.
1024
1025 For the most part, the RPCs in this module use a common prologue,
1026 which includes acquiring the database lock, and do not release that
1027 lock until just before returning. However, the common prologue
1028 examines both 'ubik_currentTrans' and the transaction flags without
1029 benefit of the database lock. This can be corrected fairly easily by
1030 acquiring the database lock sooner; however, this requires referring
1031 to the database via the ubik_dbase global rather than via the 'dbase'
1032 pointer in 'ubik_currentTrans'.
1033
1034 Note that SDISK_GetVersion() and SDISK_SetVersion() contain a sanity
1035 check to ensure they are not running on the sync site, since these
1036 are called only from that portion of the recovery loop which runs
1037 only on the sync site. These calls to ubeacon_AmSyncSite() must be
1038 made with the database lock held, for the results to be valid. The
1039 same holds true for a commented-out check in SDISK_GetFile() (which
1040 really should remain that way -- there's no reason not to allow any
1041 authorized caller to do this at any time).
1042
1043 Additionally, note that SDISK_Commit() acquires a write lock on the
1044 application cache, to insure that no transactions are running which
1045 depend on the application cache, page cache, or underlying database
1046 to change. The locking order requires that this lock be acquired
1047 before taking the database lock; thus, both locks must be acquired
1048 earlier.
1049
1050 SDISK_SendFile() is really quite bad in its handling of the database
1051 lock under error conditions. Presently, an error condition may cause
1052 it to release the lock too early nor not at all.
1053
1054 SDISK_UpdateInterfaceAddr() is called by peer servers as part of
1055 their startup process, to trade lists of additional addresses and
1056 insure sufficient agreement on configuration to allow the new server
1057 to start. It will need to acquire the lock protecting the addr[]
1058 array attached to each host.
1059
1060
1061 UBIK (high-level API)
1062
1063 This package contains the high-level APIs exported by Ubik to
1064 applications, including initialization, the APIs for creating and
1065 using transactions, and various utilities. It also contains a number
1066 of internal utilities.
1067
1068 Initialization
1069
1070 Initialization of Ubik is handled in ubik_ServerInitCommon(),
1071 which is called by two API functions, ubik_ServerInit() and
1072 ubik_ServerInitByInfo(), which provide different interfaces for
1073 passing configuration information. This code handles
1074 initialization of the database, locks, and various globals, calls
1075 the initialization routines for each subsystem, makes sure Rx has
1076 been initialized, and creates Ubik's Rx services and threads.
1077
1078 Most of this is done before any Ubik-specific threads are created
1079 and before Ubik's RPCs can be called, and so takes place without
1080 holding any lock. However, since Rx may have been initialized and
1081 started before Ubik, it is possible that some Rx worker threads
1082 may have already been created. In practice, locks internal to Rx
1083 should insure that any memory writes made before an Rx service is
1084 created become visible to any worker running on behalf of that
1085 service; nonetheless, this should not be depended on, and proper
1086 locking should be used during all phases of initialization.
1087
1088 Unfortunately, the current code actually starts Rx services before
1089 initialization is complete, resulting in a number of possible
1090 races. The general THREAD SAFETY section below contains
1091 recommendations for reordering the startup process to eliminate
1092 this problem, while still avoiding potential deadlock if multiple
1093 servers start at once.
1094
1095 ContactQuorum_*
1096
1097 Nearly a third of ubik.c consists of various ContactQuorum_*
1098 functions, which are used to make a particular RPC to every server
1099 in the quorum. These used to be a single ContactQuorum()
1100 function, but have been split into multiple functions to regain
1101 type-safety. These retain a common prologue and epilogue, some
1102 small part of which has been abstracted out, but most of which
1103 remains duplicated in each of these functions. Indeed, the only
1104 part unique to each function is the actual RPC call and, in the
1105 case of ContactQuorum_DISK_WriteV(), code to fall back to multiple
1106 calls to DISK_Write() on a per-server basis.
1107
1108 The code common to all ContactQuorum_* functions will need to
1109 acquire the beacon lock when checking whether a server is up (no
1110 calls are made to servers which are not marked up), and again when
1111 marking a server down. Note that when a server is marked down,
1112 the 'up' and 'beaconSinceDown' flags must be cleared at the same
1113 time, without releasing the beacon lock. In addition, the helper
1114 function Quorum_StartIO() must obtain the server address lock
1115 while it obtains a reference on the Rx connection that will be
1116 used.
1117
1118 All of the ContactQuorum_* functions must be called with the
1119 database lock held; this is necessary to protect their access to
1120 each server's 'version' and 'currentDB' fields and to the local
1121 database version. However, these functions release the lock while
1122 actually making RPCs. This behavior is new -- ContactQuorum() in
1123 OpenAFS 1.4 did not release the database lock during RPCs -- but
1124 it is safe, because no caller of ContactQuorum_*() depends on
1125 state read under the database lock before the call to remain valid
1126 after.
1127
1128 Transactions
1129
1130 The main Ubik transaction API consists of ubik_BeginTrans(),
1131 ubik_BeginTransReadAny(), ubik_BeginTransReadAnyWrite(),
1132 ubik_AbortTrans(), ubik_EndTrans(), ubik_Read(), ubik_Write(),
1133 ubik_Flush(), ubik_Seek(), ubik_Tell(), ubik_Truncate(), and
1134 ubik_SetLock(). The ubik_Begin* functions require a database
1135 pointer; all others require an active transaction pointer. These
1136 functions are called directly by the application without any
1137 locks, and are responsible for their own locking. Note that many
1138 of these examine the transaction type before acquiring any locks;
1139 this is permissible because the transaction type is immutable once
1140 the transaction is created (but it requires that the call be made
1141 from the thread that created the transaction, or that the
1142 application have used some mechanism to insure an appropriate
1143 memory barrier exists; the same mechanism used to protect the
1144 transaction pointer should suffice).
1145
1146 BeginTrans() contains the common code responsible for starting
1147 a new transaction; this is made available to applications as
1148 ubik_BeginTrans(), ubik_BeginTransReadAny(), and
1149 ubik_BeginTransReadAnyWrite(). If a write transaction is in
1150 progress, this function waits for it to complete. However, since
1151 doing so releases the database lock, it is necessary to call
1152 urecovery_AllBetter() again once the lock has been reacquired.
1153 Further, in the UBIK_PAUSE case, the DBVOTING flag may have been
1154 set, so this also must be retested (see the BEACON package section
1155 for more on the difficulties of UBIK_PAUSE and DBVOTING).
1156
1157 Both ubik_Flush() and ubik_Write() examine and manipulate the
1158 transaction I/O vector without benefit of any lock. These fields
1159 ('iovec_info' and 'iovec_data') should be protected by the
1160 database lock. In the case of ubik_Write(), this may require
1161 releasing the lock before calling ubik_Flush(), or the
1162 introduction of a private ubik_Flush_r() which assumes the lock is
1163 already held (but note that in either case, the lock is released,
1164 because ubik_Flush() calls ContactQuorum_* functions).
1165
1166 Utilities
1167
1168 The functions ubik_GetVersion() and ubik_WaitVersion() provide the
1169 application with a way to discover the current database version
1170 and to wait for it to change. These interfaces are not currently
1171 used in OpenAFS. ubik_GetVersion() needs to acquire the database
1172 lock while copying the database version.
1173
1174 The internal function ubikGetPrimaryInterfaceAddr() is used by
1175 Ubik RPCs to determine a peer server's primary address, given the
1176 IP address from which a call arrived. This needs to hold the
1177 server address lock, as described in the section on struct
1178 ubik_server.
1179
1180 Application Cache
1181
1182 The section on struct ubik_dbase describes the operation of the
1183 application cache. The function ubik_CheckCache() is provided to
1184 allow an application to insure the cache is up to date and obtain
1185 a read lock on the cache which lasts for the duration of the
1186 transaction.
1187
1188 Note that ubik_AbortTrans() currently always invalidates the
1189 application cache by clearing ubik_dbase->cachedVersion, for which
1190 it requires a write lock on the cache_lock. This means that
1191 aborted transactions block waiting for completion of any
1192 transactions which hold the cache_lock, which may well include all
1193 outstanding transactions. In the case of read transactions, which
1194 cannot have modified the database, this is unnecessary and may
1195 significantly reduce concurrency.
1196
1197
1198THREAD SAFETY
1199
1200 This section discusses changes needed to insure thread safety.
1201
1202 The initialization sequence in ubik_ServerInitCommon() should be cleaned
1203 up, to ensure that every subsystem is fully initialized before starting
1204 any background threads or accepting RPCs. The correct initialization
1205 sequence should look something like this:
1206
1207 + Initialize error tables
1208 + Allocate and initialize the database structure
1209 + Initialize Rx
1210 + Initialize the various subsystems:
1211 + udisk_Init() (formerly disk.c:DInit)
1212 + ulock_Init() (new, pulled out of ulock_getLock)
1213 + uvote_Init()
1214 + urecovery_Initialize()
1215 + ubeacon_InitServerList*
1216 + Create Rx services
1217 + Start an Rx server thread
1218 + updateUbikNetworkAddress() (rename and export from beacon.c)
1219 + Start the beacon and recovery threads
1220
1221 Initialization functions may need to be added for some subsystems which
1222 do not currently have them:
1223
1224 + src/ubik/disk.c:DInit() should be exported and renamed
1225 udisk_Init(); it can then be called from ubik_ServerInitCommon()
1226 instead of from udisk_begin().
1227
1228 + The lock initialization currently done in ulock_getLock() should
1229 instead be done in a separate ulock_Init(), which should be called
1230 from ubik_ServerInitCommon()
1231
1232 A new lock is needed to protect state belonging to the VOTE package,
1233 including the globals 'ubik_dbVersion' and 'ubik_dbTid', which represent
1234 the sync site's database state. See the VOTE package section for
1235 details on the use of the vote lock.
1236
1237 A new lock is needed to protect state belonging to the BEACON package,
1238 including the globals 'ubik_amSyncSite' and 'syncSiteUntil' and several
1239 fields in the ubik_server structure. This lock also protects the
1240 per-server 'up' flag, which is shared among several modules. In
1241 addition, some beacon-specific fields are accessed directly by other
1242 modules. Thus, this lock must be visible outside the BEACON module, or
1243 else accessors must be provided for these items. See the BEACON package
1244 section for details on the use of the beacon lock.
1245
1246 A new lock is needed to provide additional protection for the database
1247 version, transaction counter, and flags, so that these can safely be
1248 examined (but not updated) by the beacon thread without the need to
1249 acquire the database lock. See the BEACON package section for details
1250 on the use of the version lock.
1251
1252 A new lock is needed to protect per-server lists of non-primary
1253 addresses, connections to the VOTE and DISK services on each server, and
1254 the client-mode security class objects used to set up these connections.
1255 See the section on struct ubik_server for details on the use of the
1256 server address lock.
1257
1258 The required locking order is described by the following list. Any of
1259 these locks may be acquired singly; when acquiring multiple locks, they
1260 should be acquired in the listed order:
1261
1262 + application cache lock (dbase->cache_lock)
1263 + database lock (DBHOLD/DBRELE)
1264 + beacon lock
1265 + vote lock
1266 + version lock
1267 + server address lock
1268
1269 Some cleanup is required in various places where existing locking
1270 conventions are not properly obeyed:
1271
1272 + SVOTE_Beacon() needs to hold the database lock when calling
1273 urecovery_CheckTid()
1274
1275 + Management of the database lock in SDISK_SendFile() needs to be
1276 cleaned up so that it is always released at the proper time.
1277
1278 + urecovery_Interact() is not consistent about holding the database
1279 lock when it examines and updates 'urecovery_state'. It also
1280 examines the local database version at least once without benefit
1281 of this lock. See the RECOVERY package section for details on
1282 which locks are needed when by urecovery_Interact().
1283
1284 + ubik_Flush() and ubik_Write() need to acquire the database lock
1285 before accessing the 'iovec_info' and 'iovec_data' fields.
1286
1287 + ubik_GetVersion() needs to acquire the database lock before copying
1288 the database version.
1289
1290 The various debugging RPCs in the VOTE packages, and the data collection
1291 routines they call in other packages, generally operate without benefit
1292 of any locks. This is probably desirable, as it avoids any possibility
1293 of debugging operations blocking on or interfering with the rest of the
1294 system, and allows debugging data to be collected even in the event of
1295 some unforeseen deadlock. It is also "safe", in that it does not risk
1296 damage to data structures or access to uninitialized memory, provided
1297 that data structure initialization is complete before these functions
1298 are called. However, it does mean that these RPCs may return data that
1299 is not entirely self-consistent.
1300
1301 + Correctness requires various locks be held during parts of
1302 SVOTE_Debug() and SVOTE_DebugOld(); see the VOTE package section.
1303
1304 + SVOTE_Debug() also examines 'ubik_currentTrans' without benefit of
1305 the database lock. This usage is not "safe", as that transaction
1306 may be freed by another thread while SVOTE_Debug() is examining it.
1307
1308 + SVOTE_XSDebug() should hold the server address lock while copying
1309 out server addresses, and other locks while copying per-server
1310 state.
1311
1312 + udisk_Debug() should hold the database lock while examining the
1313 database version and collecting page cache statistics.
1314
1315 + ulock_Debug should use LOCK_LOCK/LOCK_UNLOCK when examining the
1316 internals of struct Lock. Ideally, knowledge of these internals
1317 should be abstracted to src/lwp/lock.h.
1318
1319 + ubeacon_Debug() should hold the beacon lock while accessing the
1320 value of 'syncSiteUntil' and, if the access is moved here from
1321 SVOTE_Debug/SVOTE_DebugOld, 'ubik_amSyncSite'.
1322
1323CONCURRENCY THOUGHTS
1324
1325 This section collections information discussed previously about changes
1326 that may be required or desirable in order to improve concurrency. Note
1327 that none of these changes are needed to insure thread safety, provided
1328 the changes discussed in the previous section are made.
1329
1330 + PHYS could be made more concurrent by maintaining a separate lock for
1331 the fdcache, avoiding the need to DBHOLD while calling these.
1332 However, higher-layer consistency probably requires that certain calls
1333 or groups of calls to this package be atomic and isolated.
1334
1335 + Could DISK be made more concurrent by maintaining a separate lock for
1336 the page cache and or free truncation record list?
1337
1338 + Could concurrency be improved by maintaining separate per-transaction
1339 locks? If so, what is the locking hierarchy with respect to database
1340 locks and the DISK and PHYS package locks?
1341
1342 + Could concurrency be improved by maintaining more reader/writer state
1343 and/or separate write locks to insure consistency of database file
1344 access without requiring so much DBHOLD?
1345
1346 + Alternately, could concurrency be improved by requiring that an active
1347 transaction be accessed only by the thread that created it?
1348
1349
1350MULTI-DATABASE SUPPORT
1351
1352 There isn't any. Some of the Ubik interfaces and even the organization
1353 of some internal data structures makes it appear as though multiple
1354 databases can be supported. However, this is not the case, for several
1355 reasons:
1356
1357 + Only a single set of peer servers is tracked, via a global linked list
1358 of server structures. Each server structure includes state such as
1359 the server's database version and whether it is up-to-date, so it is
1360 not possible to track multiple databases even for a common set of
1361 servers.
1362 + The physical I/O package (phys.c) tracks cached open file descriptors
1363 without regard to what database a file is part of; thus, it cannot
1364 handle accessing multiple databases at the same time.
1365 + There are a variety of globals containing state which needs to be
1366 tracked on a per-database basis.
1367 + There are a variety of globals which are effectively protected by
1368 a per-database lock, or by no lock at all.
1369 + The various RPCs used to communicate between servers assume a single
1370 database and do not include a database identifier. Thus, in the
1371 presence of multiple databases, it would be impossible to determine,
1372 for example, for which database a DISK RPC was intended.
1373
1374 The upshot of this is that considerable work would be needed to get Ubik
1375 into a condition which would allow a single process to maintain multiple
1376 independent Ubik databases at the same time. Note that this is
1377 orthogonal to the question of whether a single multi-file database is
1378 possible.
1379
1380
1381OTHER DATA STRUCTURES
1382
1383 Some other data structures are mentioned in ubik.p.h or elsewhere, but
1384 are of relatively little interest from the point of view of threading.
1385 They are described here primarily as an alternative to deleting
1386 descriptive text that was already written.
1387
1388 struct ubik_hdr - on-disk database file header
1389
1390 This structure represents the on-disk format of the Ubik database
1391 file header (label). It is used only for local variables in
1392 uphys_getlabel() and uphys_setlabel(), which read and write this
1393 header, respectively.
1394
1395 struct ubik_stat - database file status
1396
1397 This structure is used to convey the size and modification timestamp
1398 of a database file. It is used as the type of an output parameter of
1399 uphys_stat(), and for local variables in callers of that function.
1400
1401 struct ubik_stats - global statistics
1402
1403 This structure is used to contain "miscellaneous" global statistics.
1404 Currently that consists of a single counter which is incremented in
1405 one place (an exceptional condition in ubik_EndTrans) and is never
1406 read. The one increment is performed only when holding the database
1407 lock; thus, this structure may be treated the same as other globals
1408 protected by non-global locks.
1409
1410OTHER NOTES
1411
1412 There appears to be nothing to insure that calls to disk.c:DRead()
1413 during a write transaction won't find and use a "clean" cached page when
1414 there is already a dirty page in the cache resulting from an earlier
1415 write as part of the same transaction. This means that if a transaction
1416 reads a page after writing it, it may read back the original data
1417 instead of the new, and if a transaction writes a page more than once,
1418 the later writes may corrupt the database. It seems likely that we are
1419 merely lucky in this regard, and this problem should be fixed.
1420
1421 src/ubik/beacon.c:ubeacon_ReinitServer() assumes 'ubik_CRXSecurityRock'
1422 is in fact an AFS configuration directory pointer, suitable as an
1423 argument to afsconf_UpToDate(). This is inappropriate;
1424 'ubik_CRXSecurityRock' is an opaque argument to
1425 (*ubik_CRXSecurityProc)(), Ubik must not make assumptions about what it
1426 is.
1427
1428 It may be worth examining whether it is appropriate for SVOTE_Beacon()
1429 to call urecovery_CheckTid() only when voting yes, and not whenever
1430 a beacon is received from a server claiming to be sync site.
1431
1432 Recently, code has been introduced which attempts to insure that the
1433 recovery process does not truncate an old database file before a new one
1434 has been fully transferred, since this could, depending on the timing of
1435 a server failure, deprive the new quorum of a valid database. This
1436 so-called "new recovery" code violates the abstraction provided by the
1437 PHYS package, encoding specific knowledge of the underlying file store
1438 into the REMOTE and RECOVERY packages. This means that the backend
1439 provided by phys.c cannot be replaced with an alternative, since in that
1440 case recovery would do the wrong thing with the transferred database.
1441
1442 The DBWRITING loop in urecovery_Interact() is not portable, as it
1443 depends the contents of the timeout after a call to select(2). This was
1444 fine for IOMGR_Select(), which has defined behavior, but not for
1445 select(2).
1446