Fastq and Fasta

Fastq and Fasta files are text-based files containing biological sequence information — usually this is the nucleotide sequence.

The only difference between the two files is: Fastq files contain base quality information encoded in ASCII characters while Fasta files do not contain this information.

There are no regulations on file extensions but generally the following can be seen for Fastq files:


.fastq

.fq

And some examples of Fasta file extensions are:


.fasta

.fa

.fna

.faa

Note: .fna files represent that the file is a Fasta file of nucleotide information. Similarily, .faa files contain amino acid information

Fastq

Each entry in a Fastq file consists of four lines:

  1. sequence identifier
  2. sequence
  3. quality score identifier starting with a ‘+’
  4. quality score

The first line is the sequence identifier which has information pertaining to the process and is of the format:

@<instrument>:<run number>:<flowcell ID>:<lane>:<tile>:<x-pos>:<y-pos> <read>:<is filtered>:<control number>:<index sequence>

An example of a valid entry:

 

@EAS139:136:FC706VJ:2:5:1000:12850 1:Y:18:ATCACG

ATGAGTAGCGCCTAGAGATCGAGATCGATAGCGCATA

+

BBAACC?<<<<AAAAA????@@@@@@@BBBAACCCC

Fasta

Fasta files are just Fastq files without the quality scores.

The following is an excerpt from one of the two complete genomes of the Bordetella hinzii available for download from the NCBI database.

>NZ_CP012076.1 Bordetella hinzii strain F582, complete genome
ATGGATTACCCCCGCGAATTTGACGTCATCGTCGTCGGTGGCGGCCACGCCGGCACCGAGGCGGCGTTGGCAGCTGCCCG
CACAGGCGCCCAGACCCTGCTGCTGACGCACAATATCGAGACCCTAGGCCAGATGTCCTGCAACCCCTCCATTGGGGGGA

From .fastq to .fasta

In order to convert a fastq file to a fasta file?

In terminal,

 cat input.fastq |paste - - - - |awk '{print ">" $1 "\n" $3}' > output.fasta 

To explain the code a little:

You want to take the 4 lines from the input.fastq file and pipe line 1 and line 3 into an output.fasta file.

For aligning your reads against a reference genome using tools like the Burrows-Wheeler Aligner (BWA), you will need your nucleotide sequences in a Fasta file format however, a lot of sequencing techniques return a Fastq file like the Illumina sequencing technology.

Advertisements

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s