cql - Cassandra: time series with dynamic columns/sources -
there lot of questions storing time series cassandra, no 1 fits our question, because assume fixed data source , known column names.
about our problem: we're developing stream data engine, can connect different data sources, engine receives data continuous streams. so, have example 2 data sources called energy
, weather
. each incoming stream (or data source) has own unique key , own schema, e.g.:
source id 1
schema energy
may has stream:
timestamp | volts | amps | watts | state 1467795743173 | 210.4 | 2.3 | 290 | "up" 1467795744173 | 212.1 | 2.1 | 287 | "up" 1467795745173 | 213.1 | 2.2 | 242 | "up" ...
source id 2
schema weather
may has stream:
ts | condition | temp 1467795740632 | "cloudy" | 33.1 1467795741381 | "cloudy" | 33.4 ...
now want give possibility store streams cassandra, can used later "replay" recorded stream, fetch historic results (e.g. analytics) , enrich/join incoming streams specific stored data value (e.g. show/compare current energy value 1 recorded week ago). in sum need these things:
- read sequentially certian source id
- grab ranges source id (e.g. fetch device id 2 2016-07-06 between 10:00 , 11:00)
- save arbitrary columns (in addition fixed timestamp-column), because don't know schema in advance.
since we're new cassandra, we're have no idea best way model tables , columns.
most answers similar questions don't face possibility have unknown schemas (they assume there timestamp, deviceid , double value) , facing problem primary/partion keys.
we read 2 options:
- should have 1 single table, e.g.
datapoints
, contains data source-id+day partition keys? how handle dynamic columns here? have serialize entire tuple single common column, e.g. putting energydata weather data in 1 column calledvalue
.
so have table:
create table datapoints ( sourceid bigint, date text, time timestamp, value text, primary key ((sourceid, date), time) )
obviously, cannot use aggreate or other functions on raw values (e.g. watts, amps or temp).
the other possibility create table per data source, e.g. using day partition key:
create table energy_1 ( date text, time timestamp, volts double, amps double, watts double, state text, primary key (date, time) ) create table weather_2 ( date text, time timestamp, condition text, temp double, primary key (date, time) )
since data partioned using date, possible fetch e.g. week or not possible? although there more 1 data source same schema (e.g. 2 energy data sources), don't know , seldom. using device-id partioning key not make sense, because there 1 device-key per schema.
but second solution looks not suitable.
we hope solved similar problem , has advice?!
remark: don't want use other time series db :)
consider using map data values:
create table datapoints ( sourceid bigint, date text, time timestamp, values map<text, text>, primary key ((sourceid, date), time) )
you use maps distinct data types:
create table datapoints ( sourceid bigint, date text, time timestamp, strvalues map<text, text>, intvalues map<text, int>, decvalues map<text, decimal>, primary key ((sourceid, date), time) )
Comments
Post a Comment