Python speeding up the search for a value in a dictionary of ranges -


i have file column of values use compare dictionary contains 2 values form range.

for instance: file a:

chr1   200    .... chr3   300     

file b:

chr1    200    300    ... chr2    300    350    ... 

for created dictionary of values file b:

for line in fileb:         lineb = line.strip('\n').split('\t')         ranges[chr].append(lineb) 

for comparison:

for line in methylationfile:         line = line.strip("\n")         info = line.split("\t")         chr = info[0]         location = int(info[1])         annotation = ""         i, r in enumerate(ranges[chr]):             n = + 1             while (n < len(ranges[chr])):                     if (int(ranges[chr][i][1]) <= location <= int(ranges[chr][i][2])):                         annotation = '\t'.join(ranges[chr][i][4:])                     n +=1             outfile.write(line + '\t' + annotation  + '\n') 

if leave while loop program not seem run (or running slow results) since have on 7,000 values in each dictionary. if change while loop if loop program runs @ incredibly slow pace.

i'm looking way make program faster , more efficient

dictionaries great when want key exact match. in particular, hash of lookup key has same hash of stored key.

if ranges consistent, fake writing hash function returns same value range, , every value within range. if they're not, hash function have keep track of of known ranges, takes same problem you're starting with.

in case, right data structure here kind of sorted collection. if need build collection, , use many times without ever modifying it, sorting list , using bisect module you. if need modify collection after creation, you'll want built around binary tree or b-tree variant of kind, blist or bintrees.

this reduce time find range n/2 log2(n). so, if you've got 10000 ranges, instead of 5000 comparisons, you'll 14.

while we're @ it, convert range start , stop values ints once, instead of doing each time. also, if want use stdlib bisect, unfortunately can't pass key functions, let's reorganize ranges comparable order too. so:

for line in fileb:     lineb = line.strip('\n').split('\t')     ranges[chr].append(int(lineb[1]), int(lineb[2]), [lineb[0])  r in ranges:     r.sort() 

now, instead of loop:

for i, r in enumerate(ranges[chr]):     # ... 

do this:

i = bisect.bisect(ranges[chr], (location, location, none)) if i:     r = ranges[chr][i-1]     if r[0] <= location < r[1]:         # whatever wanted r     else:         # there no range includes location else:     # location before ranges 

you have careful thinking bisect, , it's possible i've got wrong on first attempt, so… read docs on does, , experiment data (printing out results of bisect function), before trusting this.


if ranges can overlap, , want able find ranges contain value rather one, you'll need bit more keep things efficient. there's no way fully-order overlapping ranges, bisect won't cut it.

if you're expecting more log n matches per average lookup, can 2 sorted lists , bisect.

but otherwise, need more complex data structure, , more complex code. example, if can spare n^2 space, can keep time @ log n having, each range in first list, second list, sorted end, of values matching start.

and @ point, think it's getting complex enough want library you.


however, might want consider different solution.

if use numpy or database instead of pure python, can't cut algorithmic complexity n log n… but can cut constant overhead factor of 10 or so, may enough. in fact, if you're doing tons of searches on medium-small list, may better.

plus, looks lot simpler, , once used array operations or sql, may more readable. so:

rangearrays = [np.array(a[:2] in value) value in ranges] 

… or, if ranges dict mapping strings values, instead of list:

rangearrays = {key: np.array(a[:2] in value) key, value in ranges.items()} 

then, instead of this:

for i, r in enumerate(ranges[chr]):     # ... 

do:

comparisons = location < rangearrays[chr] matches = comparisons[:,0] < comparisons[:,1] indices = matches.nonzero()[0] index in indices:     r = ranges[indices[0]]     # stuff r 

(you can of course make things more concise, it's worth doing way , printing out of intermediate steps see why works.)

or, using database:

cur = db.execute('''select start, stop, chr ranges                      start <= ? , stop > ?''', (location, location)) (start, stop, chr) in cur:     # stuff 

Comments

Popular posts from this blog

php - cannot display multiple markers in google maps v3 from traceroute result -

c# - DetailsView in ASP.Net - How to add another column on the side/add a control in each row? -

javascript - firefox memory leak -