python - How to efficiently compare rows in a pandas DataFrame? -


i have pandas dataframe containing record of lightning strikes timestamps , global positions in following format:

index      date      time                        lat      lon         fix? 0          1         20160101  00:00:00.9962692  -7.1961  -60.7604    1 1          2         20160101  00:00:01.0646207  -7.0518  -60.6911    1 2          3         20160101  00:00:01.1102066 -25.3913  -57.2922    1 3          4         20160101  00:00:01.2018573  -7.4842  -60.5129    1 4          5         20160101  00:00:01.2942750  -7.3939  -60.4992    1 5          6         20160101  00:00:01.4431493  -9.6386  -62.8448    1 6          8         20160101  00:00:01.5226157 -23.7089  -58.8888    1 7          9         20160101  00:00:01.5932412  -6.3513  -55.6545    1 8          10        20160101  00:00:01.6736350 -23.8019  -58.9382    1 9          11        20160101  00:00:01.6957858 -24.5724  -57.7229    1 

actual dataframe contains millions of rows. wish separate out events happened far away in space , time other events, , store them in new dataframe isolated_fixes. have written code calculate separation of 2 events follows:

def are_strikes_space_close(strike1,strike2,defclose=100,latpos=3,lonpos=4): #uses haversine formula calculate distance between points, returning tuple boolean closeness statement, , numerical distance     radlat1 = m.radians(strike1[1][latpos])     radlon1 = m.radians(strike1[1][lonpos])     radlat2 = m.radians(strike2[1][latpos])     radlon2 = m.radians(strike2[1][lonpos])      a=(m.sin((radlat1-radlat2)/2)**2) + m.cos(radlat1)*m.cos(radlat2)*(m.sin((radlon1-radlon2)/2)**2)     c=2*m.atan2((a**0.5),((1-a)**0.5))     r=6371 #earth radius in km     d=r*c #distance between points in km     if d <= defclose:         return (true,d)     else:         return (false,d)  

and time:

def getdatetime(series,timelabel=2,datelabel=1,timeformat="%x.%f",dateformat="%y%m%d"):     time = dt.datetime.strptime(series[1][timelabel][:15], timeformat)     date = dt.datetime.strptime(str(series[1][datelabel]), dateformat)     datetime = dt.datetime.combine(date.date(),time.time())     return datetime   def are_strikes_time_close(strike1,strike2,defclose=dt.timedelta(0,7200,0)):     dt1=getdatetime(strike1)     dt2=getdatetime(strike2)     timediff=abs(dt1-dt2)     if timediff<=defclose:         return(true, timediff)     else:         return(false, timediff) 

the real problem how efficiently compare events other events determine how many of them space_close , time_close.

note not events need checked, ordered respect datetime, if there way check events 'middle out' , stop when events no longer close in time, save lot of operations, dont know how this.

at moment, (nonfunctional) attempt looks this:

def extrisolfixes(data,filtereddata,defisol=4):      strike1 in data.iterrows():         near_strikes=-1 #-1 account self counting once on each loop         strike2 in data.iterrows():             if are_strikes_space_close(strike1,strike2)[0]==true , are_strikes_time_close(strike1,strike2)[0]==true:                 near_strikes+=1         if near_strikes<=defisol:             filtereddata=filtereddata.append(strike1) 

thanks help! happy provide clarification if needed.

this answer might not efficient. i'm facing similar problem , looking more efficient because still takes 1 hour compute on dataframe (600k rows).

i first suggest don't think using for loops do. might not able avoid 1 (which using apply), second can (must) vectorized.

the idea of technique create new column in dataframe storing whether there strike nearby (temporarly , spatially).

first let's create function calculating (with numpy package) distances between 1 strike (reference) , others:

def get_distance(reference,other_strikes):      radius = 6371.00085 #radius of earth     # lats , longs in radians, compute deltas:     lat1 = np.radians(other_strikes.lat)     lat2 = np.radians(reference[0])     dlat = lat2-lat1     dlon = np.radians(reference[1]) - np.radians(other_strikes.lon)     # , compute distance (in km)     = np.sin(dlat / 2.0) ** 2 + np.cos(lat1) * np.cos(lat2) * np.sin(dlon / 2.0) ** 2     return 2 * np.arcsin(np.minimum(1, np.sqrt(a))) * radius 

then create function check whether, 1 given strike, there @ least nearby:

def is_there_a_strike_nearby(date_ref, lat_ref, long_ref, delta_t, delta_d, other_strikes):     dmin = date_ref - np.timedelta64(delta_t,'d')     dmax = date_ref + np.timedelta64(delta_t,'d')      #let's first find strikes within temporal range     ind = other_strikes.date.searchsorted([date_ref-delta_t,date_ref+delta_t])     nearby_strikes = other_strikes.loc[ind[0]:ind[1]-1].copy()      if len(nearby_strikes) == 0:         return false      #let's compute spatial distance now:     nearby_strikes['distance'] = get_distance([lat_ref,long_ref], nearby_strikes[['lat','lon']])      nearby_strikes = nearby_strikes[nearby_strikes['distance']<=delta_d]      return (len(nearbystrikes)>0) 

now functions ready, can use apply on dataframe:

data['presence of nearby strike'] = data[['date','lat','lon']].apply(lambda x: is_there_a_strike_nearby(x['date'],x['lat'],x['long'], delta_t, delta_d,data) 

and that's it, have created new column in dataframe indicates whether strike isolated (false) or not (true), creating new dataframe easy.

the problem of method still long turn. there ways make faster, instance change is_there_a_strike_nearby take other arguments data sorted lat , long, , using other searchsorted filter on lat , long before computing distance (for instance if want strikes within range of 10km, can filter delta_lat of 0.09).

any feedback on method more welcome!


Comments

Popular posts from this blog

sequelize.js - Sequelize group by with association includes id -

java - Android raising EPERM (Operation not permitted) when attempting to send UDP packet after network connection -

c++ - Migration from QScriptEngine to QJSEngine -