python - How to efficiently compare rows in a pandas DataFrame? -
i have pandas dataframe containing record of lightning strikes timestamps , global positions in following format:
index date time lat lon fix? 0 1 20160101 00:00:00.9962692 -7.1961 -60.7604 1 1 2 20160101 00:00:01.0646207 -7.0518 -60.6911 1 2 3 20160101 00:00:01.1102066 -25.3913 -57.2922 1 3 4 20160101 00:00:01.2018573 -7.4842 -60.5129 1 4 5 20160101 00:00:01.2942750 -7.3939 -60.4992 1 5 6 20160101 00:00:01.4431493 -9.6386 -62.8448 1 6 8 20160101 00:00:01.5226157 -23.7089 -58.8888 1 7 9 20160101 00:00:01.5932412 -6.3513 -55.6545 1 8 10 20160101 00:00:01.6736350 -23.8019 -58.9382 1 9 11 20160101 00:00:01.6957858 -24.5724 -57.7229 1 actual dataframe contains millions of rows. wish separate out events happened far away in space , time other events, , store them in new dataframe isolated_fixes. have written code calculate separation of 2 events follows:
def are_strikes_space_close(strike1,strike2,defclose=100,latpos=3,lonpos=4): #uses haversine formula calculate distance between points, returning tuple boolean closeness statement, , numerical distance radlat1 = m.radians(strike1[1][latpos]) radlon1 = m.radians(strike1[1][lonpos]) radlat2 = m.radians(strike2[1][latpos]) radlon2 = m.radians(strike2[1][lonpos]) a=(m.sin((radlat1-radlat2)/2)**2) + m.cos(radlat1)*m.cos(radlat2)*(m.sin((radlon1-radlon2)/2)**2) c=2*m.atan2((a**0.5),((1-a)**0.5)) r=6371 #earth radius in km d=r*c #distance between points in km if d <= defclose: return (true,d) else: return (false,d) and time:
def getdatetime(series,timelabel=2,datelabel=1,timeformat="%x.%f",dateformat="%y%m%d"): time = dt.datetime.strptime(series[1][timelabel][:15], timeformat) date = dt.datetime.strptime(str(series[1][datelabel]), dateformat) datetime = dt.datetime.combine(date.date(),time.time()) return datetime def are_strikes_time_close(strike1,strike2,defclose=dt.timedelta(0,7200,0)): dt1=getdatetime(strike1) dt2=getdatetime(strike2) timediff=abs(dt1-dt2) if timediff<=defclose: return(true, timediff) else: return(false, timediff) the real problem how efficiently compare events other events determine how many of them space_close , time_close.
note not events need checked, ordered respect datetime, if there way check events 'middle out' , stop when events no longer close in time, save lot of operations, dont know how this.
at moment, (nonfunctional) attempt looks this:
def extrisolfixes(data,filtereddata,defisol=4): strike1 in data.iterrows(): near_strikes=-1 #-1 account self counting once on each loop strike2 in data.iterrows(): if are_strikes_space_close(strike1,strike2)[0]==true , are_strikes_time_close(strike1,strike2)[0]==true: near_strikes+=1 if near_strikes<=defisol: filtereddata=filtereddata.append(strike1) thanks help! happy provide clarification if needed.
this answer might not efficient. i'm facing similar problem , looking more efficient because still takes 1 hour compute on dataframe (600k rows).
i first suggest don't think using for loops do. might not able avoid 1 (which using apply), second can (must) vectorized.
the idea of technique create new column in dataframe storing whether there strike nearby (temporarly , spatially).
first let's create function calculating (with numpy package) distances between 1 strike (reference) , others:
def get_distance(reference,other_strikes): radius = 6371.00085 #radius of earth # lats , longs in radians, compute deltas: lat1 = np.radians(other_strikes.lat) lat2 = np.radians(reference[0]) dlat = lat2-lat1 dlon = np.radians(reference[1]) - np.radians(other_strikes.lon) # , compute distance (in km) = np.sin(dlat / 2.0) ** 2 + np.cos(lat1) * np.cos(lat2) * np.sin(dlon / 2.0) ** 2 return 2 * np.arcsin(np.minimum(1, np.sqrt(a))) * radius then create function check whether, 1 given strike, there @ least nearby:
def is_there_a_strike_nearby(date_ref, lat_ref, long_ref, delta_t, delta_d, other_strikes): dmin = date_ref - np.timedelta64(delta_t,'d') dmax = date_ref + np.timedelta64(delta_t,'d') #let's first find strikes within temporal range ind = other_strikes.date.searchsorted([date_ref-delta_t,date_ref+delta_t]) nearby_strikes = other_strikes.loc[ind[0]:ind[1]-1].copy() if len(nearby_strikes) == 0: return false #let's compute spatial distance now: nearby_strikes['distance'] = get_distance([lat_ref,long_ref], nearby_strikes[['lat','lon']]) nearby_strikes = nearby_strikes[nearby_strikes['distance']<=delta_d] return (len(nearbystrikes)>0) now functions ready, can use apply on dataframe:
data['presence of nearby strike'] = data[['date','lat','lon']].apply(lambda x: is_there_a_strike_nearby(x['date'],x['lat'],x['long'], delta_t, delta_d,data) and that's it, have created new column in dataframe indicates whether strike isolated (false) or not (true), creating new dataframe easy.
the problem of method still long turn. there ways make faster, instance change is_there_a_strike_nearby take other arguments data sorted lat , long, , using other searchsorted filter on lat , long before computing distance (for instance if want strikes within range of 10km, can filter delta_lat of 0.09).
any feedback on method more welcome!
Comments
Post a Comment