java - How to load external models when processing Spark Dataframes? -
i trying process large volumes of text. part of process want things such tokenization , stemming. of steps require loading external model (for example opennlp tokenizers). trying following approach:
sparkconf sparkconf = new sparkconf().setappname("spark tokenizer"); javasparkcontext sparkcontext = new javasparkcontext(sparkconf); sqlcontext sqlcontext = new sqlcontext(sparkcontext); dataframe corpus = sqlcontext.read().text("/home/zezke/document.nl"); // create pipeline components tokenizer tokenizer = new tokenizer() .setinputcol("value") .setoutputcol("tokens"); dataframe tokenizedcorpus = tokenizer.transform(corpus); // save output tokenizedcorpus.write().mode(savemode.overwrite).json("/home/zezke/experimentoutput");
the current approach trying using unarytransformer.
public class tokenizer extends unarytransformer<string, list<string>, tokenizer> implements serializable { private final static string uid = tokenizer.class.getsimplename() + "_" + uuid.randomuuid().tostring(); private static map<string, string> stringreplacemap; @override public void validateinputtype(datatype inputtype) { assert (inputtype.equals(datatypes.stringtype)) : string.format("input type must %s, got %s", datatypes.stringtype.simplestring(), inputtype.simplestring()); } public function1<string, list<string>> createtransformfunc() { function1<string, list<string>> f = new tokenizerfunction(); return f; } public datatype outputdatatype() { return datatypes.createarraytype(datatypes.stringtype, true); } public string uid() { return uid; } private class tokenizerfunction extends abstractfunction1<string, list<string>> implements serializable { public list<string> apply(string sentence) { ... code goes here } } }
now questions are:
- what best time load model? don't want load model multiple times.
- how distribute model various nodes?
thanks in advance, spark bit daunting into, looks promising.
you can load model in driver code, , store attribute in tokenizer
object. model serialized , transport worker nodes automatically. method requires model fit inside driver's memory.
otherwise, can load model inside createtransformfunc()
, store inside tokenizerfunction
object attribute. believe way, each worker node load model on own, though i'm not 100% sure.
Comments
Post a Comment