apache spark - filtering two RDD in pyspark -
i want simple filtration here rdd are
rdd1 = [96,104,112] rdd2 = [112, 30, 384, 96, 288, 352, 104, 368]
so result should rdd contains elements rdd2 not in rdd1
so ,
rdd3 = [30,384,288,352,368]
how should achieve this
i tried one,
rdd3 = rdd1.map(lambda r: r != r in rdd2)
but not working. how solve this,
thanks in advance
you can use subtract
method which:
return each value in self not contained in other.
rdd1 = sc.parallelize([96,104,112]) rdd2 = sc.parallelize([112, 30, 384, 96, 288, 352, 104, 368]) rdd2.subtract(rdd1).collect() ## [384, 352, 368, 288, 30]
Comments
Post a Comment