apache spark - filtering two RDD in pyspark -

- June 15, 2011

i want simple filtration here rdd are

rdd1 = [96,104,112]  rdd2 = [112, 30, 384, 96, 288, 352, 104, 368]

so result should rdd contains elements rdd2 not in rdd1

so ,

rdd3 = [30,384,288,352,368]

how should achieve this

i tried one,

 rdd3 = rdd1.map(lambda r: r != r in rdd2)

but not working. how solve this,

thanks in advance

you can use subtract method which:

return each value in self not contained in other.

rdd1 = sc.parallelize([96,104,112]) rdd2 = sc.parallelize([112, 30, 384, 96, 288, 352, 104, 368])  rdd2.subtract(rdd1).collect() ## [384, 352, 368, 288, 30]

Search This Blog

To form

apache spark - filtering two RDD in pyspark -

Comments

Post a Comment

Popular posts from this blog

sequelize.js - Sequelize group by with association includes id -

android - Robolectric "INTERNET permission is required" -

java - Android raising EPERM (Operation not permitted) when attempting to send UDP packet after network connection -