Understanding the Storm Architecture -
i have been trying understand storm architecture, not sure if got right. i'll try explain possible believe case. please explain - if - got wrong , right.
preliminary thoughts: workers
http://storm.apache.org/releases/2.0.0-snapshot/understanding-the-parallelism-of-a-storm-topology.html suggests worker process http://storm.apache.org/releases/2.0.0-snapshot/concepts.html "worker processes. each worker process physical jvm", http://storm.apache.org/releases/1.0.1/setting-up-a-storm-cluster.html states worker node "nimbus , worker machines". website http://www.michael-noll.com/tutorials/running-multi-node-storm-cluster/ mentions "master node" , "worker nodes". what: worker process or physical node (or node process)? think there 2 things: worker nodes , worker processes.
what believe true
entities in play
- master node = management server
- worker node = slave node
- nimbus jvm process, running on master node
- zookeeper jvm processes, running on zookeeper nodes
- supervisor jvm process, running on worker nodes
- worker process (jvm), running on worker nodes
- executor thread, run worker process
- task (instances of bolts , spouts), executed executor
how things work
the nimbus jvm process, running on physical master node, receives program (storm topology) takes bolts , spouts , generates tasks them. if bolt supposed parallelized 3 times, nimbus generates 3 tasks it. nimbus asks zookeeper jvm processes configurations of cluster, e.g. run tasks , zookeeper jvm processes tells nimbus. in order zookeeper communicates supervisors (what comes later). nimbus distributes tasks workers nodes, physical nodes. worker nodes managed supervisors, jvm processes - 1 supervisor 1 worker node. supervisors manage (start, stop, etc.) worker processes, jvm processes run on worker nodes. each worker node can have multiple worker processes running. worker processes jvm processes. run 1 or multiple threads called executors. each executor thread runs 1 or multiple tasks, meaning 1 or multiple instances of bolt or spout, but have of same bolt.
if true, begs questions:
- what point of having multiple worker processes run on 1 worker node - after process can use multiple processor cores, right?
- what point of having multiple tasks run 1 executor thread if have of same bolt/spout? thread run on 1 processor core, multiple bolt/spout have run after each other , can not parallelized. running indefinitely in storm topology - point having 2 instances of same bolt/spout in exectuor thread?
edit: additional resource: http://www.tutorialspoint.com/apache_storm/apache_storm_cluster_architecture.htm
first few clarification (basically got right).
- the term "node" , "process" not used consistently (unfortunately). got "entities in play" correct.
- about parallelization: number of tasks determined
.setnumtasks()
-- if not specified, sameparallelism_hint
(which sets number of executors). - about deployment: nimbus gets cluster configuration zk, zk not decides run task -- nimbus has scheduler component makes decision (based on given topology, topology configuration, , cluster configuration).
to answer question:
- a single worker process executes task single topology. thus, if want run multiple topologies need multiple worker jvms. reason design fault-tolerance, ie, isolation of topologies. if 1 topology fails (maybe due bad user code), crashing jvm not affect other running topologies.
- tasks allow dynamic re-balancing , changing parallelism @ runtime. without tasks, need stop , redeploy topology if want change parallelism spout/bolt. thus, number of tasks defines maximum parallelism spout/bolt. see https://storm.apache.org/releases/1.0.1/understanding-the-parallelism-of-a-storm-topology.html
Comments
Post a Comment