Connecting Spark streaming to streamsets input -


i wondering if possible provide input spark streaming streamsets. noticed spark streaming not supported within streamsets connectors destination https://streamsets.com/connectors/ .

i exploring if there other ways connect them sample poc.

the best way process data coming in streamsets data collector (sdc) in apache spark streaming write data out kafka topic , read data there. allows separate out spark streaming sdc, both can proceed @ own rate of processing.

sdc microbatches defined record count while spark streaming microbatches dictated time. means each sdc batch may not (and not) correspond spark streaming batch (most spark streaming batch have data several sdc batches). sdc "commits" each batch once sent destination - having batch written spark streaming mean each sdc batch need correspond spark streaming batch avoid data loss. possible spark streaming "re-processes" committed batches due processing or node failures. sdc cannot re-process committed batches - recover situation this, you'd have write kafka allows re-process batches. having direct connector writes sdc spark streaming complex , have data loss issues.

in short, best option sdc -> kafka -> spark streaming.


Comments

Popular posts from this blog

sequelize.js - Sequelize group by with association includes id -

android - Robolectric "INTERNET permission is required" -

java - Android raising EPERM (Operation not permitted) when attempting to send UDP packet after network connection -