test
[hcoop/zz_old/ikiwiki] / SoftwareArchitecturePlans.mdwn
1 #pragma section-numbers on
2
3 This page was meant to organize a discussion and is not the canonical reference on our organizational decisions. It may often be out of date.
4
5 [[TableOfContents(2)]]
6
7 = Terminology =
8 To save space below, we'll use the following working names for the different pieces of hardware involved:
9
10 * '''Main''' is the machine hosting most services.
11 * '''Dynamic''' is the machine hosting member dynamic web sites and other services where we run arbitrary code written by members.
12 * '''Shell''' is the "most anything goes" shell server.
13 = The Big List of Scary Things =
14 These are the issues that we're dealing with for the first time in our new set-up, meaning that we should pay special attention to them.
15
16 * Multiple servers and coordinating their interaction
17 * A shared file system
18 * Increasing member base and corresponding system load
19 * Different public/private networks, thanks to some switch magic
20 * Serious automated remote back-up service
21 * Centralized system logins
22 = The Big Questions =
23 == What Debian version do we run on each server? ==
24 AdamChlipala suggests stable on Main and testing on Dynamic and Shell because:
25
26 * We want our primary services to be as reliable as possible.
27 * Members will want to use some cutting-edge stuff for running their dynamic web sites and custom daemons, and stable doesn't keep up very well with the cutting edge. On the other hand, unstable just seems too risky.
28 * If Shell is used as a testing environment for services later pushed to Dynamic, then it should have the same software versions as Dynamic.
29
30 '''Update''': We're currently planning stable on Main and Dynamic, since testing too often has catastrophic upgrade failures in practice.
31
32 == What resource limits are imposed on the different servers? ==
33 === Decisions that we've agreed on ===
34 * We don't need explicit limits on usage of Main's local resources, because only admins will be able to control them.
35 === Questions to be resolved ===
36 1. Do we impose ulimits and related stuff on Dynamic?
37 AdamChlipala says:
38 . We need some measures in place to prevent runaway processes from crashing everyone's dynamic web sites. The question is, do we use automated measures or do we just monitor closely and intervene manually when needed? A bad runaway process can take the server down quickly, so I think it's necessary to use ulimits and their ilk.
39
40 1. How do we control resource usage on Shell?
41 AdamChlipala says:
42 . I think I'm in favor of no ulimits or similar on Shell, relying on monitoring and manual intervention to deal with runaway processes and other horrors. We've already had some folks unable to use some implementations of non-mainstream programming languages because these implementations aren't able to deal with our resource limits... and, if you know me, you can probably guess that that Just Breaks My Heart!
43
44 1. Where we do decide to use monitoring and manual intervention, what monitoring tools can best help us do it?
45 DavorOcelic says:
46 . I've talked about this multiple times before, and I'm still interested in doing something real in this area. First of all, there's a log parser I've written, which is very similar to Logsurfer (or Logsurfer+ for that matter), but which resolves some of their crucial limitations; we'd definitely turn Main machine into a common loghost, so this would be a good place to deploy this on. Another good thing I have in mind is Nagios, a ping/service/anything monitoring tool. Third tool I have in mind is the excellent Puppet (kind of cfengine new-generation) that we can script to test and fix stuff on our systems.
47 == Who can log into which servers? ==
48 === Decisions that we've agreed on ===
49 * Only admins can log into Main
50 * Everyone can log into Shell
51 DavorOcelic says:
52 . This is a good general rule. For any exceptions, both the usual Unix auth mechanism and LDAP allow great flexibility (per-user list of allowed machines and also per-machine list of allowed users).
53 === Questions to be resolved ===
54 1. Can everyone log into Dynamic, too?
55 AdamChlipala says:
56 . I think it is important to allow this. My mental model has Shell made deliberately unstable because we don't know how to impose automatic limits that allow all of the stuff that people want to do. I know that a lot of the people involved in this planning aren't particularly interested in using non-mainstream programming languages and other things that conventional hosting providers are never going to support, but for me and several other members this is one of the defining aspects of HCoop. That means that we need to be able to go crazy with Shell, while committing to keeping Dynamic up all the time. If Shell is down, members need to be able to use Dynamic to configure their services. That doesn't mean that they can't use the development-production split model when Shell is up, logging in only there.
57 == How are we going to handle the basic logistics of a shared filesystem and logins? ==
58 === Decisions that we've agreed on ===
59 * We're going to use AFS filesystem and Kerberos. (AFS mandates the use of Kerberos).
60 * We're going to use LDAP for logins. (Can play together with AFS and Kerberos, no worries).
61 === Questions to be resolved ===
62 Everything else!
63
64 == How are we going to charge (monetarily or just to have a sense of who is using what) members accurately for their disk usage? ==
65 There are a lot of issues here. We provide a number of shared services whose default models create files on the behalf of members but that are (by default) owned by a single UNIX user. Examples include PostgreSQL and MySQL databases, virtual mailboxes, Mailman mailing lists, and domtool configuration files. Any of these can grow so large as to use up all disk space on a volume, through either malicious action or accidental runaway processes.
66
67 Right now we use this gimpy scheme of group quotas on /home, storing all of these files on that partition with group ownership telling which member is responsible for them. I think AFS provides a nicer way of doing this. With the way we do it now, we are constantly fighting the behavior of the out-of-the-box Debian packages to set permissions differently than how we need them to be. With AFS, I think we can separate permissions from locations.
68
69 = Daemons shared by members =
70 == Off-site file back-up services ==
71 === Questions to be resolved ===
72 * Use [http://rsync.net/ rsync.net]?
73 == DNS ==
74 === Decisions that we've agreed on ===
75
76 * Running djbdns on Main
77
78 '''Update''': Scrap that! We're using BIND on Main and Dynamic, since it's so much better supported throughout the 'net, makes master/slave configurations easier, etc.. In the future, we want to expand to include a tertiary DNS server in a different geographic location and on an entirely different network.
79
80 === Questions to be resolved ===
81 1. How do we arrange redundant DNS infrastructure?
82 JustinLeitgeb says:
83
84 . For now, I think we can just put our backup DNS server on either the shell or web machine at Peer 1, depending on how we finally set things up. We will have to configure this with domtool or it's replacement. I don't really see any other options here, am I missing something?
85 === References to how we do things now ===
86 DnsConfiguration, DomainRegistration
87
88 == FTP ==
89 === Decisions that we've agreed on ===
90 * Run an FTP daemon on Main
91 * Only allow encrypted authentication methods
92 * Only allow users on a white-list to use FTP; they should be using SCP if possible
93 === References to how we do things now ===
94 FtpConfiguration, FileTransfer
95
96 == HTTP ==
97 === Decisions that we've agreed on ===
98 * Using Apache 2
99 * Running all official/administrative HCoop web sites on Main
100 * Running all member dynamic web sites on Dynamic
101 === Questions to be resolved ===
102 1. Do we completely separate adminstrative web sites from the rest, or do we allow any member static web site to be served by Main?
103 DavorOcelic says:
104 . Well. I think we don't have many administrative web sites (nor the ones we have are used heavy enough) to justify complete separation. It should be OK to run static web sites from Main, I believe. We could create default web spaces for users, like ~/public_html/ served from Dynamic, and ~/static_html/ served from Main, or something like that. (Please give more input on this).
105 * I think it would better to have a domtool directive that chose which machine the site was served on (e.g. ServedOn static|dynamic) and then let members choose how to lay out their own directories. -- ClintonEbadi
106 === References to how we do things now ===
107 UserWebsites, DynamicWebSites, VirtualHostConfiguration
108
109 == IMAP/POP ==
110 === Decisions that we've agreed on ===
111 * Running the primary IMAP/POP daemons on Main
112 * Running both SSL and normal versions, where the normal versions can only be used over the local network
113 === Questions to be resolved ===
114 1. Do we keep using Courier IMAP or do we switch to something like Cyrus?
115 === References to how we do things now ===
116 UsingEmail, EmailConfiguration
117
118 == Jabber ==
119 === Decisions that we've agreed on ===
120 * Run the same thing we're running now, on Main
121 === Questions to be resolved ===
122 * Should we add a tool similar to webpasswd to let members enable their jabber accounts without manual intervention? Doing this by hand is easy now, but when we have hundreds of members it would make much more sense to automate the process.
123 * Alternatively we could let members login using their normal passwords (which is fairly secure as long as SSL is forced to be enabled). Ejabberd can use LDAP for authentication so it would be easy to automatically give every HCoop member an account.
124 * Should a tool be added to enable members to set up their own virtual jabber hosts (e.g. member at unknownlamer dot org )? I (ClintonEbadi) could write one in perl.
125 * If we did this should we allow members to add as many accounts as they wish, or only have one account per virtual server for the member? Jabber doesn't use much bandwidth (it's all text), and it would be nice to be able to give friends or family jabber accounts, and then eliminate dependence on other more evil IM services.
126 === References to how we do things now ===
127 JabberServer
128
129 == Mailing lists ==
130 === Decisions that we've agreed on ===
131 * Using the Mailman software
132 * Running the daemon on Main
133 === Questions to be resolved ===
134 1. How/where do we store mailing list data so that it is appropriately charged towards a member's storage quota?
135 === References to how we do things now ===
136 MailingListConfiguration
137
138 == Relational database servers ==
139 === Decisions that we've agreed on ===
140 * Running PostgreSQL and MySQL servers on Main
141 === Questions to be resolved ===
142 1. Are we satisfied with the latest versions from Debian stable, or do we want to do something special?
143 1. Do remote PostgreSQL authentication (from Dynamic, etc.) via the ident method? DavorOcelic thinks it's OK.
144 === References to how we do things now ===
145 UsingDatabases
146
147 == SMTP ==
148 === Decisions that we've agreed on ===
149 * Using Exim 4
150 * Running the primary SMTP daemon on Main
151 === Questions to be resolved ===
152 1. Run secondary MX on Dynamic or elsewhere?
153 === References to how we do things now ===
154 UsingEmail, EmailConfiguration
155
156 == Spam detection ==
157 === Decisions that we've agreed on ===
158 * Running the SpamAssassin spamd daemon on Main
159 * Running it via the spamc client on all mail to opted-in addresses, but leaving filtering based on the added headers up to the individual recipients
160 * Keeping a shared Bayes filtering database that can be trained by members by depositing misclassified messages into shared folders
161 === References to how we do things now ===
162 UsingEmail, SpamAssassin, FeedingSpamAssassin, SpamAssassinAdmin
163
164 == SSH ==
165 === Decisions that we've agreed on ===
166 * Use the standard SSH daemon in Debian
167 * Run it on all of our servers, with varying access permissions based on the shared user list
168 DavorOcelic says:
169 . Do we need ssh on Main too, if we've got a serial console?
170 === References to how we do things now ===
171 SshConfiguration
172
173 == SIP Redirection ==
174 * Do we also want to add the service of SIP redirection? I think this would go along very well with Clinton's suggestion of allowing people to have jabber accounts with their own domain. This way someone could have have their email, jabber and sip addresses all consolidated. A sip redirection server would use next to no bandwidth. All it would do is when a call comes in, give it another address the user can be found on. For example when someone tries to call user1@userdomain.com , the server would spit out a user defined address such as a gizmo or fwd account name and the call would continue on to that seamlessly. - ShaunEmpie
175 = Services run on top of these daemons =
176 == Domtool ==
177 Everyone's favorite spiffy system for letting legions of users manage the same daemons securely.
178
179 AdamChlipala says:
180
181 . I would like to rewrite this completely, for reasons including: From a software engineering perspective, the implementation is not so nice. There is no support for configuring multiple machines from the same configuration file source. Scalability with the increasing amount of configuration is not so hot. The current configuration scheme encourages copying-and-pasting, which makes it hard to make sweeping changes to our suggested configuration base.
182
183 JustinLeitgeb says:
184
185 . If we're doing this, let's think about storing configuration information in a database. It seems that it should scale better, and it would certainly be easier to write programs for users to configure domains via a web interface. I'm also thinking about writing a tool to set up a host with a dynamic IP on the internet (like what dyndns.org provides). For this to occur, we basically need to factor in the ability for fairly frequent, small changes to DNS zones without completely reloading the server. Also we need to be able to configure the TTL on host records (this may already be possible in domtool, I haven't checked). If the new domtool is written in Perl, I will be able to make software contributions, otherwise I probably won't have time to learn a new language in the next few years.
186
187 AdamChlipala says:
188 . My conception of the optimal configuration tool makes every configuration file a program, with textual structure that maps very poorly to a relational database, so I am still strongly against the idea of SQL-based configuration.
189 . domtool already supports everything needed for dynamic DNS, including setting TTL, as someone already requested support for doing that himself.
190 . I won't be involved with any Perl development.
191
192 JustinLeitgeb says:
193 OK, I understand where you're coming from if you want the configuration files to be programs. I agree that it will be a stronger system that way.
194
195 === References to how we do things now ===
196 DomainTool
197
198 == Portal ==
199 === Decisions that we've agreed on ===
200 * Keep doing the same as now, running on Main
201 === References to how we do things now ===
202 [https://members.hcoop.net/ The portal]
203
204 == Web e-mail client ==
205 === Decisions that we've agreed on ===
206 * Keep using SquirrelMail, running on Main
207 === References to how we do things now ===
208 [http://mail.hcoop.net/ SquirrelMail]
209
210 == Webmin/Usermin ==
211 === Decisions that we've agreed on ===
212 * Keep doing the same as now, running on Main
213 === References to how we do things now ===
214 [https://members.hcoop.net/usermin/ Usermin]
215
216 == Wiki ==
217 === Decisions that we've agreed on ===
218 * Start from the same data as our current wiki
219 * Host the wiki on Main
220 * Keep using MoinMoin
221 === Questions to be resolved ===
222 * Upgrade the wiki to the latest release, even if there is no Debian package for it.
223
224 MichaelOlson says:
225
226 . I want to upgrade the Moin software to the latest release. The main reason for this is that the UserPreferences page is broken in the current version, in that it has no '''Mail me my account data''' button, in spite of the instructions on that page. This seems to be fixed on the official Moin wiki, so it is most likely fixed in the latest release.
227 . The idea is for me to start by upgrading my LUG's wiki instance. If no unsolvable problems are encountered, then upgrade HCoop's wiki instance as well. If no up-to-date Debian package is found (and there wasn't one, last time I checked), I could either:
228 . (a) make a Debian package, using the Debian patches against their {{{moin}}} package as a reference, or
229 . (b) backup the Debian additions (site-wide wiki farm settings), remove the {{{moiin}}} Debian package, and install it from source.
230
231 === References to how we do things now ===
232 [http://wiki.hcoop.net/ This wiki]
233
234 = Security =
235 Here are the security issues we need to worry about, sorting by resource categories of varying abstraction levels. What we mostly deal with here is avoiding negative consequences of actions by members with legitimate access to our servers.
236
237 == CPU time ==
238 We haven't really encountered any trouble with this literal resource yet. However, potential problems come in when we're talking about user dynamic web site programs called by a shared Apache daemon. Apache allocates a fixed set of child processes, and each pending dynamic web site program takes up one child process for the duration of its life. Enough infinite-looping or slow CGI scripts can bring Apache down for everyone.
239
240 === Current remedies ===
241 As per ResourceLimits, we use patched {{{suexec}}} programs to limit dynamic page generation programs to 10 seconds of running time. We also have a time-out for {{{mod_proxy}}} accesses, which we provide to allow members to implement dynamic web sites through their own daemons that the main Apache proxies.
242
243 == Disk usage ==
244 We can't let one person use up all of the disk space, now can we?
245
246 === Current remedies ===
247 We use group quotas so that members can be charged for files that they don't own. This is still hackish and allows some unintended behaviors. DaemonFileSecurity has more detail.
248
249 == Network bandwidth ==
250 We don't do a thing to limit this now, since our current host provides significantly more bandwidth than we need.
251
252 === Questions to be resolved ===
253 1. Should we start doing anything beyond monitoring?
254 == Network connection privileges ==
255 It's good to follow least privilege in who is allowed to connect to/listen on which ports.
256
257 === Current remedies ===
258 We have a firewall system in place now. It uses a custom tool documented partially on FirewallRules.
259
260 == Number of processes ==
261 Fork bombs are no fun, and many resource limiting schemes are per-process and so require a limit on process creation to be effective.
262
263 === Current remedies ===
264 As per ResourceLimits, we use the {{{nproc}}} ulimit.
265
266 == RAM ==
267 This is probably the most surprising thing for novices to the hosting co-op planning biz. If you would classify yourself as such, then I bet you would leave RAM off your list of resources that need to be protected with explicit security measures!
268
269 Nonetheless, it may just be the most critical resource to control. In our experiences back when everything ran on Abulafia, the most common cause of system outage was some user running an out-of-control process that allocated all available memory, causing other processes to drop dead left and right as memory allocation calls failed. We're letting people run their own daemons 24/7, so this just can't be ignored.
270
271 === Current remedies ===
272 As per ResourceLimits, we use the {{{as}}} ulimit to put a cap on how much virtual memory a process can allocate.