how does storm leverage zookeeper for resilience?

from the description of Storm, it is based on Zookeeper, and whenever a worker node dies, it can be recovered and get its state from zookeeper.

Does any one know how that is done? specifically

  1. how does the failed worker node get recovered?
  2. how does zookeeper keep its state. AFAIK, each zone can only store a small amount to data.


Are you talking about workers or supervisors? Each storm worker node runs a storm "supervisor" daemon which manages worker processes.

  1. You need to setup supervision (something like daemontools or supervisord, which is unrelated to storm supervisors) to monitor and restart nimbus and supervisor daemons in case they take an exception. Both nimbus and supervisors are fail fast and stateless. Zookepeer is used for coordination between nimbus and supervisors along with holding state information, which is in zookeeper or on disk so as to not lose state information.
  2. State data isn't large and Zookeeper should be run supervised too.

Check this for more fault tolerance details.

Need Your Help

Blank GoogleMap on a real Android 2.3 device

android google-maps

I am trying to display google map on android 2.3.

Comparing flags with bitwise operators

php bit-manipulation

I am designing a permissions system that determines whether a user can access a page based on the flags set in his/her permissions field. Here's how I had thought to do this: