Python speeding up the search for a value in a dictionary of ranges -
i have file column of values use compare dictionary contains 2 values form range.
for instance: file a:
chr1 200 .... chr3 300
file b:
chr1 200 300 ... chr2 300 350 ...
for created dictionary of values file b:
for line in fileb: lineb = line.strip('\n').split('\t') ranges[chr].append(lineb)
for comparison:
for line in methylationfile: line = line.strip("\n") info = line.split("\t") chr = info[0] location = int(info[1]) annotation = "" i, r in enumerate(ranges[chr]): n = + 1 while (n < len(ranges[chr])): if (int(ranges[chr][i][1]) <= location <= int(ranges[chr][i][2])): annotation = '\t'.join(ranges[chr][i][4:]) n +=1 outfile.write(line + '\t' + annotation + '\n')
if leave while loop program not seem run (or running slow results) since have on 7,000 values in each dictionary. if change while loop if loop program runs @ incredibly slow pace.
i'm looking way make program faster , more efficient
dictionaries great when want key exact match. in particular, hash of lookup key has same hash of stored key.
if ranges consistent, fake writing hash function returns same value range, , every value within range. if they're not, hash function have keep track of of known ranges, takes same problem you're starting with.
in case, right data structure here kind of sorted collection. if need build collection, , use many times without ever modifying it, sort
ing list , using bisect
module you. if need modify collection after creation, you'll want built around binary tree or b-tree variant of kind, blist
or bintrees
.
this reduce time find range n/2 log2(n). so, if you've got 10000 ranges, instead of 5000 comparisons, you'll 14.
while we're @ it, convert range start , stop values ints once, instead of doing each time. also, if want use stdlib bisect
, unfortunately can't pass key
functions, let's reorganize ranges comparable order too. so:
for line in fileb: lineb = line.strip('\n').split('\t') ranges[chr].append(int(lineb[1]), int(lineb[2]), [lineb[0]) r in ranges: r.sort()
now, instead of loop:
for i, r in enumerate(ranges[chr]): # ...
do this:
i = bisect.bisect(ranges[chr], (location, location, none)) if i: r = ranges[chr][i-1] if r[0] <= location < r[1]: # whatever wanted r else: # there no range includes location else: # location before ranges
you have careful thinking bisect
, , it's possible i've got wrong on first attempt, so… read docs on does, , experiment data (printing out results of bisect
function), before trusting this.
if ranges can overlap, , want able find ranges contain value rather one, you'll need bit more keep things efficient. there's no way fully-order overlapping ranges, bisect
won't cut it.
if you're expecting more log n matches per average lookup, can 2 sorted lists , bisect
.
but otherwise, need more complex data structure, , more complex code. example, if can spare n^2 space, can keep time @ log n having, each range in first list, second list, sorted end, of values matching start.
and @ point, think it's getting complex enough want library you.
however, might want consider different solution.
if use numpy
or database instead of pure python, can't cut algorithmic complexity n log n… but can cut constant overhead factor of 10 or so, may enough. in fact, if you're doing tons of searches on medium-small list, may better.
plus, looks lot simpler, , once used array operations or sql, may more readable. so:
rangearrays = [np.array(a[:2] in value) value in ranges]
… or, if ranges
dict mapping strings values, instead of list:
rangearrays = {key: np.array(a[:2] in value) key, value in ranges.items()}
then, instead of this:
for i, r in enumerate(ranges[chr]): # ...
do:
comparisons = location < rangearrays[chr] matches = comparisons[:,0] < comparisons[:,1] indices = matches.nonzero()[0] index in indices: r = ranges[indices[0]] # stuff r
(you can of course make things more concise, it's worth doing way , printing out of intermediate steps see why works.)
or, using database:
cur = db.execute('''select start, stop, chr ranges start <= ? , stop > ?''', (location, location)) (start, stop, chr) in cur: # stuff
Comments
Post a Comment