regex - Write partly tab-delimited data to MySQL database -
i have mysql-database 7 columns (chr
, pos
, num
, ia
, ib
, ic
, id
) , file contains 40 million lines each containing dataset. each line has 4 tab delimited columns, whereas first 3 columns contain data, , fourth column can contain 3 different key=value
pairs separated semicolon
chr pos num info 1 10203 3 ia=0.34;ib=nerv;ic=45;id=dskf12586 1 10203 4 ia=0.44;ic=45;id=dsf12586;ib=nerv 1 10203 5 1 10213 1 ib=nerv;ic=49;ia=0.14;id=dskf12586 1 10213 2 ia=0.34;ib=nerv;id=cap1486 1 10225 1 id=dscf12586
the key=value pairs in column info have no specific order. i'm not sure if key can occur twice (i hope not).
i'd write data database. first 3 columns no problem, extractiong values info-columns puzzles me, since key=value pairs unordered , not every key has in line. similar dataset (with ordered info-column) used java-programm in connection regular expressions, allowed me (1) check , (2) extract data, i'm stranded.
how can resolve task, preferably bash-script or directly in mysql?
you did not mention how want write data. below example awk
shows how can each individual id , key in each line. instead of printf
, can use own logic write data
[[bash_prompt$]]$ cat test.sh; echo "###########"; awk -f test.sh log { if(length($4)) { split($4,array,";"); print "in " $1, $2, $3; for(element in array) { key=substr(array[element],0,index(array[element],"=")); value=substr(array[element],index(array[element],"=")+1); printf("found %s key , %s value %d line %s\n",key,value,nr,array[element]); } } } ########### in 1 10203 3 found id= key , dskf12586 value 1 line id=dskf12586 found ia= key , 0.34 value 1 line ia=0.34 found ib= key , nerv value 1 line ib=nerv found ic= key , 45 value 1 line ic=45 in 1 10203 4 found ib= key , nerv value 2 line ib=nerv found ia= key , 0.44 value 2 line ia=0.44 found ic= key , 45 value 2 line ic=45 found id= key , dsf12586 value 2 line id=dsf12586 in 1 10213 1 found id= key , dskf12586 value 4 line id=dskf12586 found ib= key , nerv value 4 line ib=nerv found ic= key , 49 value 4 line ic=49 found ia= key , 0.14 value 4 line ia=0.14 in 1 10213 2 found ia= key , 0.34 value 5 line ia=0.34 found ib= key , nerv value 5 line ib=nerv found id= key , cap1486 value 5 line id=cap1486 in 1 10225 1 found id= key , dscf12586 value 6 line id=dscf12586
Comments
Post a Comment