hadoop - Will mapreduce use node where most of the gzip'd file is? -
i'm having hdfs cluster big gzip
'd files.
i made sure blocks of gzip
'd files on same datanode, writing them hdfs data node.
for in {1..10}; scp file$i.gz datanode1: ssh datanode$i hadoop fs -put file$i.gz /data/ done
now want run mapreduce task on files.
i expect jobtracker put job processing file1
on datanode1
, blocks are.
indeed, if datanode dies lose locality, work until dies?
if doesn't work that, can write fileinputformat
that?
gzip isn't splittable compression format (well if stack gzip files end end), firstly make sure block size of gzip files same / bigger actual file sizes.
as gzip file isn't splittable, if have 1g gzip file 256m block size, there chance not of blocks file reside on same datanode (even if upload 1 of datanodes, there no guarantee on time failures , sure, blocks not moved around other nodes). in case job tracker never report local map task if of blocks non-resident on node task running.
as task assignments - if have 4 map slots on datanode1, 100 files process job tracker not going run 100 tasks on datanode1. try , run task on datanode1 if there free task slot on node , map task has split locations on node, if once 4 slots in use, jt instruct other task trackers run jobs (if have free slots) rather wait run 100 tasks on same node.
yes if datanode dies you'll lost data locality if block size smaller files (for reason mentioned in first sentence), if block size same or bigger file you'll have data locality on data node has replica of block.
Comments
Post a Comment