Output file is getting generated on slave machine in apache spark -
i facing issue while running spark java program reads file, manipulation , generates output file @ given path. every thing works fine when master , slaves on same machine .ie: in standalone-cluster mode. problem started when deployed same program in multi machine multi node cluster set up. means master running @ x.x.x.102
, slave running on x.x.x.104
. both master -slave have shared ssh keys , reachable each other.
initially slave not able read input file , came know need call sc.addfile()
before sc.textfile()
. solved issue. see output being generated on slave machine in _temporary folder under output path. ie: /tmp/emi/_temporary/0/task-xxxx/part-00000
in local cluster mode works fine , generates output file in /tmp/emi/part-00000
.
i came know need use sparkfiles.get()
. not able understand how , use method.
till using
dataframe dataobj = ... dataobj.javardd().coalesce(1).saveastextfile("file:/tmp/emi");
can 1 please let me know how call sparkfiles.get()
?
in short how can tell slave create output file in machine driver running?
please help.
thanks lot in advance.
there nothing unexpected here. each worker writes own part of data separately. using file
scheme means data writer file in file system local worker perspective.
regarding sparkfiles
not applicable in particular case. sparkfiles
can used distribute common files worker machines not deal results.
if reason want perform writes on machine used run driver code you'll have fetch data driver machine first (either collect
requires enough memory fit data or tolocaliterator
collects partition @ time , requires multiple jobs) , use standard tools write results local file system. in general though writing driver not practice , of time useless.
Comments
Post a Comment