ToleratingHangsAndLeaks
Contact
Chris McDonough (chrism@zope.com)
Problem
Like many dynamic web publishing solutions (and unlike traditional CGI applications fronted by forking servers) Zope is implemented as a long-running process. While this gives Zope major advantages (speed, connection pooling, shared memory space, etc), it also places burdens on system and application programmers to write "clean" code which neither causes memory "leakage" (the appropriation of system memory without return) or "hangs" (the appropriation of CPU, IO, or thread lock resources without return).
While it would be quite nice to always work with and produce "clean" code, it's often infeasible to do so. Web projects are often "rush" projects which undergo less structured system testing than traditional application development projects. Additionally, the experience level of programmers (whether they code in DTML, Python, or C) is widely varied from project to project. The same holds true of the programmers which produce Zope Products. Additionally, occasionally and unfortunately memory leaks have been found in the Zope core itself and in system libraries created by relational database vendors and others. In short, people make mistakes.
Thus, though it's desirable, it is not always possible to put a completely "clean" solution into production. Solutions will sometimes crash and burn after a days in production for no immediately discernable reason. Diagnosing and repairing this sort of intermittent problem (finding out who made the mistake and where they made it) is always difficult and time-consuming. Often this reality makes it impractical to debug the production solution in situ to rid it of a memory leak or hang when the site is "falling over" every few hours or days. Political, emotional, management and marketing pressures often hamper or override structured debugging efforts.
Under these circumstances, temporary "workaround" solutions are often desirable. A successful workaround solution will quickly take the immediate pressure off the development and sysadmin staffs until such time as they obtain the time or otherwise find it necessary to fix the problem "for good" by careful forensic analysis and remediatory action. A workaround solution, by definition, will not be nearly as elegant or simple as an actual fix for the problem, but it's "good enough" under many circumstances.
Zope is particularly unforgiving if you create a memory leak or a situation in which code hangs, because there is no automated way to detect and restart servers under pathological conditions. It would be useful to canonize a "workaround state" into which a Zope system could be placed while the bug is tracked down. In this state, a Zope system would use some heuristics to restart itself when pathological situations are detected.
Another example of a reasonable workaround in almost every circumstance is to simply restart the systems every so often on a schedule, or after so many requests have been served.
Note
Before the purists come down on me for encouraging or abetting bad coding practices, please note that this type of failure tolerance is "baked in" to the design of other similar dynamic systems such as mod_perl. They've accepted that they need to tolerate these sorts of anomalous situations, and they've provided for it within the environment in which it runs. See http://perl.apache.org/guide/performance.html (e.g. Apache::SizeLimit?, MaxRequestsPerClient?)
Proposed Solutions
Create a "failure detection and tolerance" mode which is activated using an envvar or a command-line switch at startup time. Under this mode, Zope would restart itself iff:
- The asyncore select loop times out and medusa's socket_map is "large" (where "large" can be determined by an envvar but by default can be perhaps 100). This indicates a "hung" server that has leaked all of its database connections and is sitting calmly around queuing up requests that will never be serviced. Note that a prototype of this behavior which coopts the asyncore poll function has already been developed ("AutoLance?").
- Heuristics about process size in relation to available real system memory over some period of time can be used during the asyncore poll function to decide that it would be a good idea to restart the process. For Linux systems, the linuxproc module could be used to supply the raw data on which the heuristics could be based.
The failure tolerance system could be configurable via the control panel iff the envvar or commandline option is detected (like the profiling system).
When a failure results in a system restart, the restart should be logged and perhaps reported via email to a configurable set of addresses.
As an option, the failure tolerance system could restart Zope every "n" hours (also configurable).
Risk Factors
- Failure tolerance is too expensive to be used in production.
- Acommodating possibility of a failure tolerance mode makes normal production code slower.
- Failure tolerance does not detect a number of common real-world failure scenarios.
- The method by which we do fault tolerance (restarting the entire Zope process) may not be granular enough to service some problems. It often takes a long time to start Zope up. It can take an even longer time for commonly-requested data structures to be put into cache.
Scope
None
Deliverables
Zope Product implementing the behaviors described above.
Changes to the control panel.
Asyncore "hook" by which to have a function called during the poll loop.
Changes to the admin guide and Zope Book.
- chrism (Sep 29, 2001 5:56 pm; Comment #1)
- Note that on some systems, resource consumption could perhaps be controlled by setrlimit soft. If a signal was sent to the process when an rlimit value was reached by the process (I'm not sure which UNIX platforms this is true on), the resource-limiting part of this proposal could be implemented in conjunction with CleanSignalHandling.
- chrism (Sep 30, 2001 3:12 am; Comment #2)
- Setting a process size rlimit is much easier and much more effective than I had imagined. There is now a Zope CVS branch named chrism-setrlimit-branch which provides the capability to have Zope automatically restart when a configurable memory usage threshhold is reached. This might be all we need for working around memleaks.
- htrd (Oct 1, 2001 8:07 am; Comment #3)
- Re: control panel configuration: this would seem to be an area where different nodes of a zeo cluster would need different settings... Imagine you discover that one page causes a memory leaks, so you configure your front-end proxy to send all requests for that page to a dedicated ZEO node and reboot only that node frequently. Im not sure the conventional control panel is ideal for this.
- htrd (Oct 1, 2001 8:30 am; Comment #4)
- Its not clear from the description whether the proposed solution is flexible enough to cover a use case that I have in mind:
Squid is growing some impressive new features for use as an accelerator in its
rproxyCVS branch.... one of these is the ability to monitor the backend (zope) servers and remove dead ones from the pool. It detects dead servers using http, and squid can be configured to specify exactly what responses indicate that a server is dead. It would be nice to have (or be able to write) a manage_isDead method that could be called by Squid, to allow it to work around the dead server without it being shutdown (shutdown destroys alot of potentially useful post-mortem information).Essentially: I would like the ability to enable (and access) Zope's death-checker, without enabling the reboot-if-dead trigger.
- chrism (Oct 1, 2001 11:41 am; Comment #5)
> Re: control panel configuration: > this would seem to be an area where different nodes of a zeo cluster > would need different settings... Imagine you discover that one page
Yeah, you're right... this should not be set in the control panel, but instead externally. If the scope of the proposal were trimmed down to only restarting if a memory size limit was reached, this limit could be set via a z2.py command switch or an envvar. Likewise, if the scope was extended to include "hangs", the restart size of medusa's socket_map could be set similarly.
> Essentially: I would like the ability to enable (and access) > Zope's death-checker, > without enabling the reboot-if-dead trigger.
If it's dead, how would you get to it thru HTTP?
- anthony (Oct 2, 2001 11:39 pm; Comment #6)
- We have something similar in our production system (apaches->loadbalancer->ZEOclients?->ZEOserver?) where we test each component in turn from the ZEOserver? to the front end and kick them in the head if they've failed. This requires a test be put into the application that checks for each
thing "that matters" to make sure it's alive (Oracle connections, that sort of thing).
> If it's dead, how would you get to it thru HTTP?
In our case, we try first HTTP (if appropriate) followed by grabbing the pids on the machine directly and kicking them in the head with a kill -9. With just the HTTP checker, we found too many possibilities where a ZEO client would go bonkers and just not be recoverable.
- paul (Oct 5, 2001 9:01 am; Comment #7)
- Some quick comments...
Even without the issue of bad programming, it's still easy to consume database connections. For instance, we recently saw that the mail server on zope.org was taking 30-45 seconds to respond to smtp socket connections. There's really nothing that good or bad programming has to do with this, without changing to a different design.
As you mention in the risks, long startup times can make this proposal less attractive. Also, not having a load balancer that knows not to send that process requests during restart is important to the success of this proposal.
The restarting process should (optionally) gracefully handle current requests before shutting down, similar to the various signals you can send Apache.
- htrd (Jan 22, 2002 6:02 am; Comment #8) Editor Remark Requested
> If it's dead, how would you get to it thru HTTP?
We may well extend the checking to include other systems that do not affect the http or sockets layer, for example if a RDBMS is disconnected, or a shared network filesystem becomes unmounted.
Also, I would like to include a health check before making ICP responses. see http://www.zope.org/Members/htrd/icp/intro