python - How to efficiently compare rows in a pandas DataFrame? -
i have pandas dataframe containing record of lightning strikes timestamps , global positions in following format:
index date time lat lon fix? 0 1 20160101 00:00:00.9962692 -7.1961 -60.7604 1 1 2 20160101 00:00:01.0646207 -7.0518 -60.6911 1 2 3 20160101 00:00:01.1102066 -25.3913 -57.2922 1 3 4 20160101 00:00:01.2018573 -7.4842 -60.5129 1 4 5 20160101 00:00:01.2942750 -7.3939 -60.4992 1 5 6 20160101 00:00:01.4431493 -9.6386 -62.8448 1 6 8 20160101 00:00:01.5226157 -23.7089 -58.8888 1 7 9 20160101 00:00:01.5932412 -6.3513 -55.6545 1 8 10 20160101 00:00:01.6736350 -23.8019 -58.9382 1 9 11 20160101 00:00:01.6957858 -24.5724 -57.7229 1
actual dataframe contains millions of rows. wish separate out events happened far away in space , time other events, , store them in new dataframe isolated_fixes
. have written code calculate separation of 2 events follows:
def are_strikes_space_close(strike1,strike2,defclose=100,latpos=3,lonpos=4): #uses haversine formula calculate distance between points, returning tuple boolean closeness statement, , numerical distance radlat1 = m.radians(strike1[1][latpos]) radlon1 = m.radians(strike1[1][lonpos]) radlat2 = m.radians(strike2[1][latpos]) radlon2 = m.radians(strike2[1][lonpos]) a=(m.sin((radlat1-radlat2)/2)**2) + m.cos(radlat1)*m.cos(radlat2)*(m.sin((radlon1-radlon2)/2)**2) c=2*m.atan2((a**0.5),((1-a)**0.5)) r=6371 #earth radius in km d=r*c #distance between points in km if d <= defclose: return (true,d) else: return (false,d)
and time:
def getdatetime(series,timelabel=2,datelabel=1,timeformat="%x.%f",dateformat="%y%m%d"): time = dt.datetime.strptime(series[1][timelabel][:15], timeformat) date = dt.datetime.strptime(str(series[1][datelabel]), dateformat) datetime = dt.datetime.combine(date.date(),time.time()) return datetime def are_strikes_time_close(strike1,strike2,defclose=dt.timedelta(0,7200,0)): dt1=getdatetime(strike1) dt2=getdatetime(strike2) timediff=abs(dt1-dt2) if timediff<=defclose: return(true, timediff) else: return(false, timediff)
the real problem how efficiently compare events other events determine how many of them space_close , time_close.
note not events need checked, ordered respect datetime, if there way check events 'middle out' , stop when events no longer close in time, save lot of operations, dont know how this.
at moment, (nonfunctional) attempt looks this:
def extrisolfixes(data,filtereddata,defisol=4): strike1 in data.iterrows(): near_strikes=-1 #-1 account self counting once on each loop strike2 in data.iterrows(): if are_strikes_space_close(strike1,strike2)[0]==true , are_strikes_time_close(strike1,strike2)[0]==true: near_strikes+=1 if near_strikes<=defisol: filtereddata=filtereddata.append(strike1)
thanks help! happy provide clarification if needed.
this answer might not efficient. i'm facing similar problem , looking more efficient because still takes 1 hour compute on dataframe (600k rows).
i first suggest don't think using for
loops do. might not able avoid 1 (which using apply
), second can (must) vectorized.
the idea of technique create new column in dataframe storing whether there strike nearby (temporarly , spatially).
first let's create function calculating (with numpy
package) distances between 1 strike (reference
) , others:
def get_distance(reference,other_strikes): radius = 6371.00085 #radius of earth # lats , longs in radians, compute deltas: lat1 = np.radians(other_strikes.lat) lat2 = np.radians(reference[0]) dlat = lat2-lat1 dlon = np.radians(reference[1]) - np.radians(other_strikes.lon) # , compute distance (in km) = np.sin(dlat / 2.0) ** 2 + np.cos(lat1) * np.cos(lat2) * np.sin(dlon / 2.0) ** 2 return 2 * np.arcsin(np.minimum(1, np.sqrt(a))) * radius
then create function check whether, 1 given strike, there @ least nearby:
def is_there_a_strike_nearby(date_ref, lat_ref, long_ref, delta_t, delta_d, other_strikes): dmin = date_ref - np.timedelta64(delta_t,'d') dmax = date_ref + np.timedelta64(delta_t,'d') #let's first find strikes within temporal range ind = other_strikes.date.searchsorted([date_ref-delta_t,date_ref+delta_t]) nearby_strikes = other_strikes.loc[ind[0]:ind[1]-1].copy() if len(nearby_strikes) == 0: return false #let's compute spatial distance now: nearby_strikes['distance'] = get_distance([lat_ref,long_ref], nearby_strikes[['lat','lon']]) nearby_strikes = nearby_strikes[nearby_strikes['distance']<=delta_d] return (len(nearbystrikes)>0)
now functions ready, can use apply
on dataframe:
data['presence of nearby strike'] = data[['date','lat','lon']].apply(lambda x: is_there_a_strike_nearby(x['date'],x['lat'],x['long'], delta_t, delta_d,data)
and that's it, have created new column in dataframe indicates whether strike isolated (false
) or not (true
), creating new dataframe easy.
the problem of method still long turn. there ways make faster, instance change is_there_a_strike_nearby
take other arguments data
sorted lat , long, , using other searchsorted
filter on lat
, long
before computing distance (for instance if want strikes within range of 10km, can filter delta_lat
of 0.09).
any feedback on method more welcome!
Comments
Post a Comment