25
I Use This!
Very High Activity

News

Analyzed about 7 hours ago. based on code collected about 10 hours ago.
Posted almost 15 years ago
Its a common question and a worthy topic for an extended article. Here’s the steps I usually follow when diagnosing such issues. Is the cluster allowed to start services? Check quorum status with crm_mon —one-shot Quorum is a property of the ... [More] cluster which is attained when more than half the number of known nodes is online. Unlike Heartbeat, OpenAIS based clusters don’t pretend to have quorum when only one of a possible two nodes is available. In such situations, the cluster’s default behavior is to ensure data integrity by stopping all services. Check the current value of the no-quorum-policy option: crm_attribute -n no-quorum-policy -G If you don’t have quorum, you can tell the cluster to ignore the loss of quorum and start resources anyway: crm_attribute -n no-quorum-policy -v ignore Be careful to ensure STONITH is correctly configured before using the ignore option. Check if the cluster is managing services: Check the global default crm_attribute --type rsc_defaults -n is-managed -G Check the per-resource values cibadmin --query --xpath '//nvpair[@name="is-managed"]' Check the old location for the global default crm_attribute -n is-managed-default -G Look for any results indicating a value of false Check target-role The target-role setting controls what state the resource can achieve. The list of possible states is: Stopped Started Slave Master Look out for any places indicating a value of Stopped. In the case of master/slave resources that aren’t being promoted, a value of Started can also be problematic. Check the global default crm_attribute --type rsc_defaults -n target-role -G Check the per-resource values cibadmin --query --xpath '//nvpair[@name="target-role"]' Look for failures You can see the list of failures in the crm_mon output: crm_mon --one-shot --operations Another good source of information is ptest which can simulate what the cluster would try to do. ptest --live-check -VVV Look for anything unusual in the output such as WARN: unpack_rsc_op: Processing failed op drbd0:1_start_0 on nagios-clu2: unknown error Check the logs ssh -l root nagios-clu2 -- grep drbd0:1 /var/log/messages Cleaning up after failures If you identified any failures above, you can instruct the cluster to “forget” about them: crm_resource --cleanup --node nagios-clu2 This results in the resource history being erased on nagios-clu2. The cluster will then attempt to start any services that were not already active. NOTE: This will have little or no benefit if the underlying issue, the one that caused the resource to fail in the first place, has not been fixed. If the problem persists, the resource will simply return to a failed state and the cluster will still refuse to start it. In a later article, I’ll explain how the cluster can recover from transient failures automatically by timing them out after a certain interval. [Less]
Posted almost 15 years ago
This tumbl/blog/thingy exists because I’ve finally accepted that “If we build it, they will come” is a fallacy.  The internet is a big place and if you don’t speak up, you’ll get lost in the noise of those that do. So, I’m going to try and use this ... [More] place to raise awareness of a project that’s very important to me - Pacemaker - an incredibly advanced open source, high availability cluster resource manager. For those not already certified cluster ninjas, a resource manager is the part of a cluster stack that decides who holds cluster services and what to do when a failure is detected. Pacemaker’s key features, which I will explore in greater depth over the coming days/weeks, are: Recovery from node failures (obviously) Built-in detection of resource failures (no need for mon Support for OpenAIS, an industry standard cluster stack Support for Heartbeat, a popular alternative to OpenAIS Powerful dependancy model for accurately mapping your environment Supports as many nodes as the cluster messaging layer will allow Proven technology - ships as part of SLE10 and SLE 11 High Availability Extension If you’re interested in open source clustering, check us out at http://clusterlabs.org or irc://irc.freenode.net#linux-cluster [Less]
Posted almost 15 years ago
Nothing to see here yet.  Just taking the software for a spin.