python - Fill Pyspark dataframe column null values with average value from same column -
with dataframe this,
rdd_2 = sc.parallelize([(0,10,223,"201601"), (0,10,83,"2016032"),(1,20,none,"201602"),(1,20,3003,"201601"), (1,20,none,"201603"), (2,40, 2321,"201601"), (2,30, 10,"201602"),(2,61, none,"201601")]) df_data = sqlcontext.createdataframe(rdd_2, ["id", "type", "cost", "date"]) df_data.show() +---+----+----+-------+ | id|type|cost| date| +---+----+----+-------+ | 0| 10| 223| 201601| | 0| 10| 83|2016032| | 1| 20|null| 201602| | 1| 20|3003| 201601| | 1| 20|null| 201603| | 2| 40|2321| 201601| | 2| 30| 10| 201602| | 2| 61|null| 201601| +---+----+----+-------+
i need fill null values average of existing values, expected result being
+---+----+----+-------+ | id|type|cost| date| +---+----+----+-------+ | 0| 10| 223| 201601| | 0| 10| 83|2016032| | 1| 20|1128| 201602| | 1| 20|3003| 201601| | 1| 20|1128| 201603| | 2| 40|2321| 201601| | 2| 30| 10| 201602| | 2| 61|1128| 201601| +---+----+----+-------+
where 1128
average of existing values. need several columns.
my current approach use na.fill
:
fill_values = {column: df_data.agg({column:"mean"}).flatmap(list).collect()[0] column in df_data.columns if column not in ['date','id']} df_data = df_data.na.fill(fill_values) +---+----+----+-------+ | id|type|cost| date| +---+----+----+-------+ | 0| 10| 223| 201601| | 0| 10| 83|2016032| | 1| 20|1128| 201602| | 1| 20|3003| 201601| | 1| 20|1128| 201603| | 2| 40|2321| 201601| | 2| 30| 10| 201602| | 2| 61|1128| 201601| +---+----+----+-------+
but cumbersome. ideas?
well, 1 way or have to:
- compute statistics
- fill blanks
it pretty limits can improve here not mention question better fit codereview. still:
- replace
flatmap(list).collect()[0]
first()[0]
or structure unpacking - compute stats single action
- use built-in
row
methods extract dictionary
the final result this:
def fill_with_mean(df, exclude=set()): stats = df.agg(*( avg(c).alias(c) c in df.columns if c not in exclude )) return df.na.fill(stats.first().asdict()) fill_with_mean(df_data, ["id", "date"])
in spark 2.2 or later can use imputer
. see replace missing values mean - spark dataframe.
Comments
Post a Comment