java - Efficiently merging and re-sorting sorted lists -


this isn't classic "merging 2 sorted" lists questions, fairly trivial in linear time.

what i'm trying merge 2 lists of (key, value) pairs, sorted value, there objects same key in both lists: such objects should have values merged (added), may change sort order. i'm interested in how sort can efficiently performed using information sorted lists, since sort slowest part of algorithm.

let's take concrete example. imagine list of student objects:

class student {   final string name;   final int score;   ... } 

given input 2 list<student> sorted score, i'd create new merged list of students, student (identified student.name) appearing in both lists appears once in final list, score equal sum of score in both lists. original lists should left unmodified.

e.g.,

list 1: {"bob", 20} {"john", 15} {"mark", 14}  list 2: {"bill", 11} {"mark", 9} {"john", 1}  result: {"mark", 23} {"bob", 20} {"john", 16} {"bill", 11} 

the merging (identifying students appear in both lists) can done in expected o(1) time using o(1) lookup/insert structure such hashmap. i'm interested in sort step (although don't exclude solutions merging , sorting @ same time).

the question though, how efficiently re-sort such list? ordering of existing lists puts constraints on final position of elements in merged list. example, if student @ position i in first list , j in second, must appear among first i + j students in merged list simple argument analyzing maximum number of students have higher score. it's not clear if information useful in sorting list, however.

you can assume in many cases students score highly in 1 list score highly in other. algorithm should work when not case, gives additional information distribution may useful, in addition fact lists sorted.

it seems type of operation common type of distributed query + sorting implementation. example, imagine "select state,count(*) group state" type of query issue against distributed system (to count number of records in each state) - naturally you'd sorted list of (state, count) objects each node, , you'd want merge , re-sort during reduce operation. seems silly throw away work done on distributed nodes.

quantitative notes

i'm interested in case lists merged , re-sorted small: around 256 entries. range of scores varies, 0 100 in cases, 0 - 10,000,000 in others. of course, given small number of elements, each operation fast in absolute time, naive algorithms - performed billions of times, adds up.

in fact, 1 of answers below has proven can't, in general, better plain sort increasing list sizes (i.e., taking n combined list size) - i'm more interested in doing many times, fixed size lists, empirical performance.

it sounds need use adaptive sort algorithm.

"a sorting algorithm falls adaptive sort family if takes advantage of existing order in input. benefits presortedness in input sequence – or limited amount of disorder various definitions of measures of disorder – , sorts faster. adaptive sorting performed modifying existing sorting algorithms." - wikipedia article linked above.

examples include insertion sort , timsort; see article above more. note in java 8, arrays.sort(object[]) library method uses modified timsort.


i not aware of published algorithm deals specific requirements of example, here idea:

  1. perform classic merge on 2 input lists l1 , l2:

    • when merge pair of objects , changes keys determine ordering, put merged object temporary list a.
    • otherwise put objects temporary list b ... remain ordered.
  2. sort temporary list a.

  3. merge lists , b.

assuming that:

  • the lengths of original lists l1 & l2 m & n respectively, ,
  • the number of merged objects keys changed r (which less max(m, n)),

then overall complexity o(m + n + rlogr). if r small relative m + n, should improvement.


in example, every case there match between elements in input lists is likely move element in order. if moves element, move later in order (and never earlier). idea three-way merge between original 2 lists , priority queue. when match, merge counts , add result priority queue.

the complexity similar previous, avoid pass merge lists. , rlogr becomes rloga average size of priority queue.


keep in mind i'm interested in case r approximately equal max(m,n), , m == n.

(you didn't state in question! and, in fact doesn't make sense r > min(m,n)!)

in case, maybe use priority queue incremental sorter. throw merged records , records cannot merged queue, , pull our records if when have key / score less current heads of 2 lists. assuming m , n list lengths, , average priority queue size, complexity max(m,n) * log a). whether improvement on simple re-sort depend on whether average (in big o terms) less max(m,n). depend on inputs ... , merging function.


the number (n) varies, 256 1,000 typical. perhaps as 10,000.

for lists of typical size, down @ level complexity analysis not going helpful. also, down @ level optimization becomes pointless ... unless doing operation many, many times, or on tight "time budget".


this approximate, , maths "sketchy" @ best.

a proper investigation entails hundreds of hours research, code, test, benchmark, analyze various alternatives ... , we'd still answer depends on input data set size , distribution.


Comments

Popular posts from this blog

sequelize.js - Sequelize group by with association includes id -

delphi - Take screenshot in webcam using VFrames in Console Application -

ubuntu - Executors lost when starting pyspark in YARN client mode -