python - parsing CSV to pandas dataframes (one-to-many unmunge) -
i have csv file imported pandas dataframe. came database export combined one-to-many parent , detail table. format of csv file follows:
header1, header2, header3, header4, header5, header6 sample1, property1,,,average1,average2 ,,detail1,detail2,, ,,detail1,detail2,, ,,detail1,detail2,, sample2, ... ,,detail1,detail2,, ,,detail1,detail2,, ...
(i.e. line 0
header
, line 1
record 1
, lines 2
through n
details, line n+1 record 2 , on...)
what best way extricate (renormalize?) details separate dataframes
can referenced using values in sample#
records? number of each subset of details different each sample.
i can use:
samplelist = df.header2[pd.notnull(df.header2)]
to starting index of each sample can grab samplelist.index[0] samplelist.index[1] , put in smaller dataframe. detail records have no reference sample came from, has inferred order of csv file (notice there no intersection of filled/empty fields in example).
should make list of dataframes, dict of dataframes, or panel of dataframes?
can somehow create variables sample1 record fields , somehow attach them each dataframe has detail records (like collection of objects have several scalar members , 1 dataframe each)?
eventually create statistics on data each detail record grouping , plot them against values in sample records (e.g. sampletype, day or date, etc. vs. mystatistic). create intermediate series attached sample grouping kernel density estimation pdf or histogram.
thanks.
you can use fact first column
seems empty unless it's new sample
record .fillna(method='ffill')
, .groupby('header1')
separate groups. on these, can calculate statistics right away or store separate dataframe
. high level sketch follows:
df.header1 = df.header1.fillna(method='ffill') sample, data in df.groupby('header1'): print(sample) # access sample name data = ... # process sample records
Comments
Post a Comment