public class InterleavedFastqInputFormat
extends org.apache.hadoop.mapreduce.lib.input.FileInputFormat<Void,org.apache.hadoop.io.Text>
This class is a Hadoop reader for "interleaved fastq" -- that is,
fastq with paired reads in the same file, interleaved, rather than
in two separate files. This makes it much easier to Hadoopily slice
up a single file and feed the slices into an aligner.
The format is the same as fastq, but records are expected to alternate
between /1 and /2. As a precondition, we assume that the interleaved
FASTQ files are always uncompressed; if the files are compressed, they
cannot be split, and thus there is no reason to use the interleaved
format.
This reader is based on the FastqInputFormat that's part of Hadoop-BAM,
found at https://github.com/HadoopGenomics/Hadoop-BAM/blob/master/src/main/java/org/seqdoop/hadoop_bam/FastqInputFormat.java
- Author:
- Jeremy Elson (jelson@microsoft.com)