algorithm - python mmap regex searching common entries in two files -


i have 2 huge xml files. 1 around 40gb, other around 2gb. assume xml format this

< xml >     ...     < page >         < id > 123 < /id >         < title > abc < /title >         < text > .....             .....             .....         < /text >     < /page >     ... < /xml > 

i have created index file both file 1 , file 2 using mmap.
each of index files complies format:

id  <page>_byte_position    </page>_byte_position    

so, given id, index files, know tag starts id , ends i.e. tag byte pos.

now, need is: - need able figure out each id in smaller index file (for 2gb), if id exists in larger index file - if id exists, need able _byte_pos , _byte_pos id larger index file (for 40gbfile )

my current code awfully slow. guess doing o(m*n) algorithm assuming m size of larger file , n of smaller file.

with open(smaller_idx_file, "r+b") f_small_idx:     line in f_small_idx.readlines():         split = line.split(" ")         open(larger_idx_file, "r+b") f_large_idx:             line2 in f_large_idx.readlines():                 split2 = line2.split(" ")                 if split[0] in split2:                     print split[0]                      print split2[1] + "  " + split2[2] 

this awfully slow !!!!
better suggestions ??

basically, given 2 huge files, how search if each word in particular column in smaller file exists in huge file , if does, need extract other relevant fields well.

any suggestions appreciated!! : )

don't have time elaborate answer right should work (assuming temporary dict fit memory):

  1. iterate on smaller file , put words of relevant column in dict (lookup in dict has average case performance of o(1))
  2. iterate on larger file , each word in dict storing relevant information either directly dict entries or elsewhere.

if not work suggest sorting (or filtering) files first chunks can processed independently (i.e. compare starts b...)


Comments

Popular posts from this blog

c# - DetailsView in ASP.Net - How to add another column on the side/add a control in each row? -

javascript - firefox memory leak -

Trying to import CSV file to a SQL Server database using asp.net and c# - can't find what I'm missing -