cql - Cassandra: time series with dynamic columns/sources -


there lot of questions storing time series cassandra, no 1 fits our question, because assume fixed data source , known column names.

about our problem: we're developing stream data engine, can connect different data sources, engine receives data continuous streams. so, have example 2 data sources called energy , weather. each incoming stream (or data source) has own unique key , own schema, e.g.:

source id 1 schema energy may has stream:

timestamp | volts | amps | watts | state 1467795743173 | 210.4 | 2.3 | 290  | "up" 1467795744173 | 212.1 | 2.1 | 287  | "up" 1467795745173 | 213.1 | 2.2 | 242  | "up" ... 

source id 2 schema weather may has stream:

ts | condition | temp 1467795740632 | "cloudy" | 33.1 1467795741381 | "cloudy" | 33.4 ... 

now want give possibility store streams cassandra, can used later "replay" recorded stream, fetch historic results (e.g. analytics) , enrich/join incoming streams specific stored data value (e.g. show/compare current energy value 1 recorded week ago). in sum need these things:

  • read sequentially certian source id
  • grab ranges source id (e.g. fetch device id 2 2016-07-06 between 10:00 , 11:00)
  • save arbitrary columns (in addition fixed timestamp-column), because don't know schema in advance.

since we're new cassandra, we're have no idea best way model tables , columns.

most answers similar questions don't face possibility have unknown schemas (they assume there timestamp, deviceid , double value) , facing problem primary/partion keys.

we read 2 options:

  1. should have 1 single table, e.g. datapoints, contains data source-id+day partition keys? how handle dynamic columns here? have serialize entire tuple single common column, e.g. putting energydata weather data in 1 column called value.

so have table:

create table datapoints (  sourceid bigint,  date text,   time timestamp,   value text,   primary key ((sourceid, date), time)  ) 

obviously, cannot use aggreate or other functions on raw values (e.g. watts, amps or temp).

the other possibility create table per data source, e.g. using day partition key:

create table energy_1 (     date text,   time timestamp,   volts double,  amps double,  watts double,  state text,   primary key (date, time)  )  create table weather_2 (  date text,   time timestamp,   condition text,  temp double,   primary key (date, time)  ) 

since data partioned using date, possible fetch e.g. week or not possible? although there more 1 data source same schema (e.g. 2 energy data sources), don't know , seldom. using device-id partioning key not make sense, because there 1 device-key per schema.

but second solution looks not suitable.

we hope solved similar problem , has advice?!

remark: don't want use other time series db :)

consider using map data values:

create table datapoints (   sourceid bigint,   date text,    time timestamp,    values map<text, text>,   primary key ((sourceid, date), time)  ) 

you use maps distinct data types:

create table datapoints (   sourceid bigint,   date text,    time timestamp,    strvalues map<text, text>,   intvalues map<text, int>,   decvalues map<text, decimal>,   primary key ((sourceid, date), time)  ) 

Comments

Popular posts from this blog

sequelize.js - Sequelize group by with association includes id -

android - Robolectric "INTERNET permission is required" -

java - Android raising EPERM (Operation not permitted) when attempting to send UDP packet after network connection -