How to remove duplicate columns (content) in data.table R? -
how remove duplicate columns data.table? (keeping 1 of them)
i know there other questions duplicate columns check duplicate column names not content,
what want columns different names same content.
regards
this common task in feature engineering. following code chunk developed myself , community on kaggle purpose:
##### removing identical features features_pair <- combn(names(train), 2, simplify = f) # list column pairs toremove <- c() # init vector store duplicates for(pair in features_pair) { # put pairs testing temp objects f1 <- pair[1] f2 <- pair[2] if (!(f1 %in% toremove) & !(f2 %in% toremove)) { if (all(train[[f1]] == train[[f2]])) { # test duplicates cat(f1, "and", f2, "are equals.\n") toremove <- c(toremove, f2) # build list of duplicates } } }
then can drop whichever copy of duplicates want. default use version stored in temporary object f2
, remove them this:
train <- train[,!toremove]
Comments
Post a Comment