java - How to load external models when processing Spark Dataframes? -


i trying process large volumes of text. part of process want things such tokenization , stemming. of steps require loading external model (for example opennlp tokenizers). trying following approach:

    sparkconf sparkconf = new sparkconf().setappname("spark tokenizer");     javasparkcontext sparkcontext = new javasparkcontext(sparkconf);     sqlcontext sqlcontext = new sqlcontext(sparkcontext);     dataframe corpus = sqlcontext.read().text("/home/zezke/document.nl");      // create pipeline components     tokenizer tokenizer = new tokenizer()             .setinputcol("value")             .setoutputcol("tokens");     dataframe tokenizedcorpus = tokenizer.transform(corpus);      // save output     tokenizedcorpus.write().mode(savemode.overwrite).json("/home/zezke/experimentoutput"); 

the current approach trying using unarytransformer.

public class tokenizer extends unarytransformer<string, list<string>, tokenizer> implements serializable {      private final static string uid = tokenizer.class.getsimplename() + "_" + uuid.randomuuid().tostring();      private static map<string, string> stringreplacemap;      @override     public void validateinputtype(datatype inputtype) {         assert (inputtype.equals(datatypes.stringtype)) :                 string.format("input type must %s, got %s", datatypes.stringtype.simplestring(), inputtype.simplestring());     }      public function1<string, list<string>> createtransformfunc() {         function1<string, list<string>> f = new tokenizerfunction();         return f;     }      public datatype outputdatatype() {         return datatypes.createarraytype(datatypes.stringtype, true);     }      public string uid() {         return uid;     }      private class tokenizerfunction extends abstractfunction1<string, list<string>> implements serializable {         public list<string> apply(string sentence) {              ... code goes here         }     }  } 

now questions are:

  1. what best time load model? don't want load model multiple times.
  2. how distribute model various nodes?

thanks in advance, spark bit daunting into, looks promising.

you can load model in driver code, , store attribute in tokenizer object. model serialized , transport worker nodes automatically. method requires model fit inside driver's memory.

otherwise, can load model inside createtransformfunc(), store inside tokenizerfunction object attribute. believe way, each worker node load model on own, though i'm not 100% sure.


Comments

Popular posts from this blog

sequelize.js - Sequelize group by with association includes id -

java - Android raising EPERM (Operation not permitted) when attempting to send UDP packet after network connection -

c++ - Migration from QScriptEngine to QJSEngine -