machine learning - Ground-truth and feature extraction for predictive modelling -


i have dataset of users, each user has has daily information activities (numerical values representing measurements of physical activities).

in addition, each user in each day has boolean value represents if he/she took particular action.

the dataset looks follow

+------+----------+----------+----------+-------+ |userid|      date| activity1| activity2| action| +------+----------+----------+----------+-------+ | user1|2016-06-05|       5.3|         6|  false| | user1|2016-06-04|       3.1|         8|   true| | user1|2016-06-03|       2.0|        13|  false| | user1|2016-06-02|       4.7|         1|  false| | user1|2016-06-01|       1.3|         9|  false| | user1|   ...ect.|       ...|       ...|    ...| | user2|2016-06-05|       0.6|         5|   true| | user2|2016-06-04|       3.0|         5|  false| | user2|2016-06-03|       0.0|         0|  false| | user2|2016-06-02|       2.1|         3|  false| | user2|2016-06-01|       6.3|         9|  false| | user2|   ...ect.|       ...|       ...|    ...| | user3|2016-06-05|       5.3|         0|  false| | user3|2016-06-04|       5.3|        11|  false| | user3|2016-06-03|       6.8|         5|  false| | user3|2016-06-02|       4.9|         2|  false| | user3|   ...ect.|       ...|       ...|    ...| +------+----------+----------+----------+-------+ 

note dataset not fixed, 1 new row added each user on every new day. number of columns fixed.

goal

build model predicts user take action in near future (e.g. in of next 7 days).

approach

my approach build feature vectors representing activity values each users on period of time, , use action column source of ground-truth. feed ground-truth , feature vectors binary classification training algorithm (e.g. svm or random forest) in order generate model able predict if user take action or not.

problem

i started positive examples users took action. extract feature vector of positive example, combined activity values in x (30 or 7 or 1) days preceding action (the day of taking action included).

when moved negative examples, gets less obvious, not sure how select negative examples , how extract features them. has led me re-question if my way of selecting positive examples , building feature vectors correct.

questions

  1. how build ground-truth of positive (users did take action) , negative (users didn't take action) examples?
  2. what negative example in case? user didn't take action in fixed period of time? if didn't take action in fixed period, took right after?
  3. what possible approaches of selecting ranges of dates extract feature vectors from.

rational question

is there more suitable approaches (other classification) solve kind of problems?

you're off on start representation have. if @ last x days of activity user before take action, have m time series, 1 each activity. in example m=2, in practice, gather, you'd have many more. can concatenate m time series obtain m*x dimensional feature vector.

for example, if take m=2 , x=5, we'd have, user 1, starting @ 2016-06-05 , going back, 1 time series activity 1 [1.3 4.7 2.0 3.1 5.3] , 1 time series activity 2 [9 1 13 8 6] can concatenate obtain feature vector [1.3 4.7 2.0 3.1 5.3 9 1 13 8 6 action=false].

build loads of these , feed them binary classifier , you've got basis neat.

things depend little bit on action is, , how rare occurs: - if action big, non-reversible , rare, such "has signed our premium product" or "has heart attack", safe in looking @ data prescribed above. - if action occurs more often, , can occur multiple times user, such "has shared running status on facebook our app today", need more aggressively filter negatives, , perhaps @ smaller window, or @ users never action etc.

in general, try simple thing , see performance obtain on independent test set. if it's perhaps there's no need further engineering. if it's bad, start tweaking things in ml pipeline, starting feature extraction, , going down parameters of model or training algorithm.

as modeling choice, if each activity produces relatively continuous signal x days, rather being spiky, many days of inactivity followed 1 of activity, go route of using neural network, or svm signals aware kernels @ least, once have beefier feature extraction setup. random forests not going great signals in case.

you might pose problem 1 of anomaly detection, if it's hard build 1 class (negatives or positives), not other. in setup model distribution of 1 class, , consider has low probability under distribution anomaly or outlier. coursera ml course starting point anomaly detection. believe build multivariate gaussian, can improved upon. knn suggestion form comments good, though it's gonna more computationally complex. problem 1 of density estimation in first form, toolset (parametric methods mixture of gaussians, random fields etc or non-parametric method knn or gaussian processes etc.).

for question 2, don't worry positive , negative. you're dealing imperfect information. whatever system have going have false positives , false negatives. have user 10 years doesn't action, on 3651th day it. mean previous 10 years worth of data invalid? not - still examples of user doesn't sign does. have take care not have bad negative setup - 1 say, more half of x days days positive, whole series ends in negative, that's meta-parameter can tweak in order results.

similarly question 3, x meta-parameter. controls whole process, rather 1 model or another. 1 approach selection going gut feeling or "domain knowledge". x=1 small, x=365 big, x=14 or x=30 seem reasonable. if number of parameters , domains aren't great, grid search - try every combination in part, , choose 1 gives pipeline best results. problem 1 of combinatorial optimization, , grid search basic algorithm solving this, can go wild sub-problem well.

definitely check out chapters on proper algorithm performance evaluation , bias-variance tradeoffs in above coursera courses, since, limited data, might backing in pipeline specialised training data, not generalize well.


Comments

Popular posts from this blog

sequelize.js - Sequelize group by with association includes id -

android - Robolectric "INTERNET permission is required" -

java - Android raising EPERM (Operation not permitted) when attempting to send UDP packet after network connection -