python - Fill Pyspark dataframe column null values with average value from same column -


with dataframe this,

rdd_2 = sc.parallelize([(0,10,223,"201601"), (0,10,83,"2016032"),(1,20,none,"201602"),(1,20,3003,"201601"), (1,20,none,"201603"), (2,40, 2321,"201601"), (2,30, 10,"201602"),(2,61, none,"201601")])  df_data = sqlcontext.createdataframe(rdd_2, ["id", "type", "cost", "date"]) df_data.show()  +---+----+----+-------+ | id|type|cost|   date| +---+----+----+-------+ |  0|  10| 223| 201601| |  0|  10|  83|2016032| |  1|  20|null| 201602| |  1|  20|3003| 201601| |  1|  20|null| 201603| |  2|  40|2321| 201601| |  2|  30|  10| 201602| |  2|  61|null| 201601| +---+----+----+-------+ 

i need fill null values average of existing values, expected result being

+---+----+----+-------+ | id|type|cost|   date| +---+----+----+-------+ |  0|  10| 223| 201601| |  0|  10|  83|2016032| |  1|  20|1128| 201602| |  1|  20|3003| 201601| |  1|  20|1128| 201603| |  2|  40|2321| 201601| |  2|  30|  10| 201602| |  2|  61|1128| 201601| +---+----+----+-------+ 

where 1128 average of existing values. need several columns.

my current approach use na.fill:

fill_values = {column: df_data.agg({column:"mean"}).flatmap(list).collect()[0] column in df_data.columns if column not in ['date','id']} df_data = df_data.na.fill(fill_values)  +---+----+----+-------+ | id|type|cost|   date| +---+----+----+-------+ |  0|  10| 223| 201601| |  0|  10|  83|2016032| |  1|  20|1128| 201602| |  1|  20|3003| 201601| |  1|  20|1128| 201603| |  2|  40|2321| 201601| |  2|  30|  10| 201602| |  2|  61|1128| 201601| +---+----+----+-------+ 

but cumbersome. ideas?

well, 1 way or have to:

  • compute statistics
  • fill blanks

it pretty limits can improve here not mention question better fit codereview. still:

  • replace flatmap(list).collect()[0] first()[0] or structure unpacking
  • compute stats single action
  • use built-in row methods extract dictionary

the final result this:

def fill_with_mean(df, exclude=set()):      stats = df.agg(*(         avg(c).alias(c) c in df.columns if c not in exclude     ))     return df.na.fill(stats.first().asdict())  fill_with_mean(df_data, ["id", "date"]) 

in spark 2.2 or later can use imputer. see replace missing values mean - spark dataframe.


Comments

Popular posts from this blog

sequelize.js - Sequelize group by with association includes id -

java - Android raising EPERM (Operation not permitted) when attempting to send UDP packet after network connection -

c++ - Migration from QScriptEngine to QJSEngine -