hadoop - Does single hive application(job) spawn multiple yarn applications? -


does single hive query gets submitted yarn creates multiple jobs (i.e. multiple yarn applications) ? here treat job , application same think in yarn context.

from understand -- yarn creates application master (am) per 'application'. here single hive query can treated 'application'. so, resource manager create container on node , start in container. container in turn may create multiple 'tasks' (not applications) i.e. mappers , reducers within other containers reserved (on same or different node -- immaterial here). collection of these application masters engaged in solving single hive query submitted yarn. in fact that's why per application. since submitted 1 hive query, yarn point of view there 1 application. when fire below yarn command, should show me 1 application running:-

yarn application -list 

is understanding correct? or if have several mappers , reducers spawned 1 hive query, multiple yarn applications invoked?

you correct in beginning:

  • a yarn application created mapreduce called job. application = job. correct.
  • there 1 per job. correct.

from there, things said bit mixed up. hive query not application. hive query translated hive chained mapreduce jobs. when execute complicated hive query, hive submit necessary mapreduce jobs (which yarn applications), running 1 after other obtain final result.

for example, let's take following sql query:

select sum(total) sumtotal, city donations group city order sumtotal; 

if want manually solve mapreduce, need create 2 jobs:

  • job 1 - aggregation: map input (city,total) pairs , reduce sum value per city
  • job 2 - sorting: map result of job 1 inversed pairs (total,city) , let shuffle/reduce sort them

detailed explanation , illustration on how solve mr jobs here.

if run query in hive, output looks this:

info  : number of splits:3 info  : submitting tokens job: job_1454508485700_0039 info  : url track job: http://ubuntu0:8088/proxy/application_1454508485700_0039/ info  : starting job = job_1454508485700_0039, tracking url = http://ubuntu0:8088/proxy/application_1454508485700_0039/ info  : kill command = /home/hduser/hadoop/bin/hadoop job  -kill job_1454508485700_0039 info  : hadoop job information stage-1: number of mappers: 3; number of reducers: 3 info  : 2016-02-10 22:21:15,773 stage-1 map = 0%,  reduce = 0% info  : 2016-02-10 22:22:08,421 stage-1 map = 11%,  reduce = 0%, cumulative cpu 99.2 sec info  : 2016-02-10 22:22:17,019 stage-1 map = 44%,  reduce = 0%, cumulative cpu 127.32 sec info  : 2016-02-10 22:22:20,694 stage-1 map = 67%,  reduce = 0%, cumulative cpu 134.32 sec info  : 2016-02-10 22:22:21,906 stage-1 map = 78%,  reduce = 0%, cumulative cpu 135.2 sec info  : 2016-02-10 22:22:32,877 stage-1 map = 89%,  reduce = 0%, cumulative cpu 147.49 sec info  : 2016-02-10 22:22:35,379 stage-1 map = 100%,  reduce = 0%, cumulative cpu 149.85 sec info  : 2016-02-10 22:22:39,108 stage-1 map = 100%,  reduce = 44%, cumulative cpu 160.65 sec info  : 2016-02-10 22:22:41,578 stage-1 map = 100%,  reduce = 56%, cumulative cpu 170.0 sec info  : 2016-02-10 22:22:42,792 stage-1 map = 100%,  reduce = 60%, cumulative cpu 171.87 sec info  : 2016-02-10 22:22:44,022 stage-1 map = 100%,  reduce = 89%, cumulative cpu 183.23 sec info  : 2016-02-10 22:22:46,540 stage-1 map = 100%,  reduce = 100%, cumulative cpu 183.23 sec info  : ended job = job_1454508485700_0039 info  : number of splits:2 info  : submitting tokens job: job_1454508485700_0040 info  : url track job: http://ubuntu0:8088/proxy/application_1454508485700_0040/ info  : starting job = job_1454508485700_0040, tracking url = http://ubuntu0:8088/proxy/application_1454508485700_0040/ info  : kill command = /home/hduser/hadoop/bin/hadoop job  -kill job_1454508485700_0040 info  : hadoop job information stage-2: number of mappers: 2; number of reducers: 1 info  : 2016-02-10 22:23:16,180 stage-2 map = 0%,  reduce = 0% info  : 2016-02-10 22:23:46,453 stage-2 map = 50%,  reduce = 0%, cumulative cpu 13.39 sec info  : 2016-02-10 22:23:47,715 stage-2 map = 67%,  reduce = 0%, cumulative cpu 14.73 sec info  : 2016-02-10 22:23:48,945 stage-2 map = 100%,  reduce = 0%, cumulative cpu 17.38 sec info  : 2016-02-10 22:24:10,960 stage-2 map = 100%,  reduce = 71%, cumulative cpu 25.33 sec info  : 2016-02-10 22:24:13,383 stage-2 map = 100%,  reduce = 98%, cumulative cpu 31.32 sec info  : 2016-02-10 22:24:14,616 stage-2 map = 100%,  reduce = 100%, cumulative cpu 32.61 sec info  : mapreduce total cumulative cpu time: 32 seconds 610 msec info  : ended job = job_1454508485700_0040 info  : moving data to: /user/hduser/donors/hive_output_part2 hdfs://ubuntu0:9000/user/hive/warehouse/.hive-staging_hive_2016-02-10_22-20-50_281_4971139345555329337-4/-ext-10001 info  : table default.hive_output_part2 stats: [numfiles=0, numrows=14966, totalsize=0, rawdatasize=321343] no rows affected (207.86 seconds) 

you can see hive created 2 jobs well, 1 after other. can see "starting job" logged twice, new job urls generated twice.

hive refers jobs "stages" these normal mapreduce jobs (i.e. applications).

you can find comparisons , benchmarks of hive vs manual jobs did here. hive used same number of jobs , has same execution times hand-coded mr jobs. still chained mapreduce jobs, generated dont need write code.

everything said valid default execution engine, mapreduce. using tez or spark execution engine different story.


Comments

Popular posts from this blog

sequelize.js - Sequelize group by with association includes id -

android - Robolectric "INTERNET permission is required" -

java - Android raising EPERM (Operation not permitted) when attempting to send UDP packet after network connection -