Understanding FASTQ, FASTA, GTF, GFF, BAM and SAM Files
Recommended Post: 【Bioinformatics】 Table of Contents for Bioinformatics Analysis
1. FASTQ Files
2. FASTA Files
3. GTF Files
4. GFF Files
5. BAM files
6. SAM files
1. FASTQ (fast-Q)
@SEQ_ID
GATTTGGGGTTCAAAGCAGTATCGATCAAATAGTAAATCCATTTGTTCAACTCACAGTTT
+
!''*((((***+))%%%++)(%%%%).1***-+*''))**55CCF>>>>>>CCCCCCC65
⑴ Stores sequence information of a sample.
⑵ Line 1: SEQ_ID, which is @ + sequence identifier + optional description.
① Example 1: @HWUSI-EAS100R:6:73:941:1973#0/1
Table. 1. Example of SEQ_ID (ref)
② Example 2: @EAS139:136:FC706VJ:2:2104:15343:197393 1:Y:18:ATCACG
Table. 2. Example of SEQ_ID (ref)
⑶ Line 2: raw sequence
⑷ Line 3: “+” + ( optional ) sequence identifier
⑸ Line 4: Quality scores for the sequence in Line 2.
① Quality score Q = -log P (where P is the error probability).
② Represented as ASCII characters, the number of characters matches the length of the raw sequence.
③ Type 1: PHRED 33 encoding: Currently the most widely used format.
Table. 3. PHRED 33 encoding
④ Type 2: PHRED 64 encoding
Table. 4. PHRED 64 encoding
2. FASTA (fast-A)
⑴ Stores sequence information of a reference.
⑵ Example: FASTA file for GFP.
>L29345.1 Aequorea victoria green-fluorescent protein (GFP) mRNA, complete cds
TACACACGAATAAAAGATAACAAAGATGAGTAAAGGAGAAGAACTTTTCACTGGAGTTGTCCCAATTCTT
GTTGAATTAGATGGCGATGTTAATGGGCAAAAATTCTCTGTCAGTGGAGAGGGTGAAGGTGATGCAACAT
ACGGAAAACTTACCCTTAAATTTATTTGCACTACTGGGAAGCTACCTGTTCCATGGCCAACACTTGTCAC
TACTTTCTCTTATGGTGTTCAATGCTTTTCAAGATACCCAGATCATATGAAACAGCATGACTTTTTCAAG
AGTGCCATGCCCGAAGGTTATGTACAGGAAAGAACTATATTTTACAAAGATGACGGGAACTACAAGACAC
GTGCTGAAGTCAAGTTTGAAGGTGATACCCTTGTTAATAGAATCGAGTTAAAAGGTATTGATTTTAAAGA
AGATGGAAACATTCTTGGACACAAAATGGAATACAACTATAACTCACATAATGTATACATCATGGCAGAC
AAACCAAAGAATGGAATCAAAGTTAACTTCAAAATTAGACACAACATTAAAGATGGAAGCGTTCAATTAG
CAGACCATTATCAACAAAATACTCCAATTGGCGATGGCCCTGTCCTTTTACCAGACAACCATTACCTGTC
CACACAATCTGCCCTTTCCAAAGATCCCAACGAAAAGAGAGATCACATGATCCTTCTTGAGTTTGTAACA
GCTGCTGGGATTACACATGGCATGGATGAACTATACAAATAAATGTCCAGACTTCCAATTGACACTAAAG
TGTCCGAACAATTACTAAATTCTCAGGGTTCCTGGTTAAATTCAGGCTGAGACTTTATTTATATATTTAT
AGATTCATTAAAATTTTATGAATAATTTATTGATGTTATTAATAGGGGCTATTTTCTTATTAAATAGGCT
ACTGGAGTGTAT
3. GTF (gene transfer format)
⑴ Stores annotation information of a reference.
⑵ Example: Contents of a GTF file for the MUC1 gene and one transcript.
NC_000001.11 BestRefSeq gene 155185824 155192915 . - . gene_id "MUC1"; transcript_id ""; db_xref "GeneID:4582"; db_xref "HGNC:HGNC:7508"; db_xref "MIM:158340"; description "mucin 1, cell surface associated"; gbkey "Gene"; gene "MUC1"; gene_biotype "protein_coding"; gene_synonym "ADMCKD"; gene_synonym "ADMCKD1"; gene_synonym "ADTKD2"; gene_synonym "CA 15-3"; gene_synonym "Ca15-3"; gene_synonym "CD227"; gene_synonym "EMA"; gene_synonym "H23AG"; gene_synonym "KL-6"; gene_synonym "MAM6"; gene_synonym "MCD"; gene_synonym "MCKD"; gene_synonym "MCKD1"; gene_synonym "MUC-1"; gene_synonym "MUC-1/SEC"; gene_synonym "MUC-1/X"; gene_synonym "MUC1/ZD"; gene_synonym "PEM"; gene_synonym "PEMT"; gene_synonym "PUM";
NC_000001.11 BestRefSeq transcript 155185824 155192915 . - . gene_id "MUC1"; transcript_id "NM_001204291.1"; db_xref "GeneID:4582"; gbkey "mRNA"; gene "MUC1"; product "mucin 1, cell surface associated, transcript variant 15"; transcript_biotype "mRNA";
NC_000001.11 BestRefSeq exon 155192786 155192915 . - . gene_id "MUC1"; transcript_id "NM_001204291.1"; db_xref "GeneID:4582"; gene "MUC1"; product "mucin 1, cell surface associated, transcript variant 15"; transcript_biotype "mRNA"; exon_number "1";
NC_000001.11 BestRefSeq exon 155192183 155192310 . - . gene_id "MUC1"; transcript_id "NM_001204291.1"; db_xref "GeneID:4582"; gene "MUC1"; product "mucin 1, cell surface associated, transcript variant 15"; transcript_biotype "mRNA"; exon_number "2";
NC_000001.11 BestRefSeq exon 155188008 155188063 . - . gene_id "MUC1"; transcript_id "NM_001204291.1"; db_xref "GeneID:4582"; gene "MUC1"; product "mucin 1, cell surface associated, transcript variant 15"; transcript_biotype "mRNA"; exon_number "3";
NC_000001.11 BestRefSeq exon 155187722 155187858 . - . gene_id "MUC1"; transcript_id "NM_001204291.1"; db_xref "GeneID:4582"; gene "MUC1"; product "mucin 1, cell surface associated, transcript variant 15"; transcript_biotype "mRNA"; exon_number "4";
NC_000001.11 BestRefSeq exon 155187455 155187576 . - . gene_id "MUC1"; transcript_id "NM_001204291.1"; db_xref "GeneID:4582"; gene "MUC1"; product "mucin 1, cell surface associated, transcript variant 15"; transcript_biotype "mRNA"; exon_number "5";
NC_000001.11 BestRefSeq exon 155187225 155187374 . - . gene_id "MUC1"; transcript_id "NM_001204291.1"; db_xref "GeneID:4582"; gene "MUC1"; product "mucin 1, cell surface associated, transcript variant 15"; transcript_biotype "mRNA"; exon_number "6";
NC_000001.11 BestRefSeq exon 155185824 155186209 . - . gene_id "MUC1"; transcript_id "NM_001204291.1"; db_xref "GeneID:4582"; gene "MUC1"; product "mucin 1, cell surface associated, transcript variant 15"; transcript_biotype "mRNA"; exon_number "7";
NC_000001.11 BestRefSeq CDS 155192786 155192843 . - 0 gene_id "MUC1"; transcript_id "NM_001204291.1"; db_xref "CCDS:CCDS72934.1"; db_xref "GeneID:4582"; gbkey "CDS"; gene "MUC1"; note "isoform 15 precursor is encoded by transcript variant 15"; product "mucin-1 isoform 15 precursor"; protein_id "NP_001191220.1"; exon_number "1";
NC_000001.11 BestRefSeq CDS 155192183 155192310 . - 2 gene_id "MUC1"; transcript_id "NM_001204291.1"; db_xref "CCDS:CCDS72934.1"; db_xref "GeneID:4582"; gbkey "CDS"; gene "MUC1"; note "isoform 15 precursor is encoded by transcript variant 15"; product "mucin-1 isoform 15 precursor"; protein_id "NP_001191220.1"; exon_number "2";
NC_000001.11 BestRefSeq CDS 155188008 155188063 . - 0 gene_id "MUC1"; transcript_id "NM_001204291.1"; db_xref "CCDS:CCDS72934.1"; db_xref "GeneID:4582"; gbkey "CDS"; gene "MUC1"; note "isoform 15 precursor is encoded by transcript variant 15"; product "mucin-1 isoform 15 precursor"; protein_id "NP_001191220.1"; exon_number "3";
NC_000001.11 BestRefSeq CDS 155187722 155187858 . - 1 gene_id "MUC1"; transcript_id "NM_001204291.1"; db_xref "CCDS:CCDS72934.1"; db_xref "GeneID:4582"; gbkey "CDS"; gene "MUC1"; note "isoform 15 precursor is encoded by transcript variant 15"; product "mucin-1 isoform 15 precursor"; protein_id "NP_001191220.1"; exon_number "4";
NC_000001.11 BestRefSeq CDS 155187455 155187576 . - 2 gene_id "MUC1"; transcript_id "NM_001204291.1"; db_xref "CCDS:CCDS72934.1"; db_xref "GeneID:4582"; gbkey "CDS"; gene "MUC1"; note "isoform 15 precursor is encoded by transcript variant 15"; product "mucin-1 isoform 15 precursor"; protein_id "NP_001191220.1"; exon_number "5";
NC_000001.11 BestRefSeq CDS 155187225 155187374 . - 0 gene_id "MUC1"; transcript_id "NM_001204291.1"; db_xref "CCDS:CCDS72934.1"; db_xref "GeneID:4582"; gbkey "CDS"; gene "MUC1"; note "isoform 15 precursor is encoded by transcript variant 15"; product "mucin-1 isoform 15 precursor"; protein_id "NP_001191220.1"; exon_number "6";
NC_000001.11 BestRefSeq CDS 155186138 155186209 . - 0 gene_id "MUC1"; transcript_id "NM_001204291.1"; db_xref "CCDS:CCDS72934.1"; db_xref "GeneID:4582"; gbkey "CDS"; gene "MUC1"; note "isoform 15 precursor is encoded by transcript variant 15"; product "mucin-1 isoform 15 precursor"; protein_id "NP_001191220.1"; exon_number "7";
NC_000001.11 BestRefSeq start_codon 155192841 155192843 . - 0 gene_id "MUC1"; transcript_id "NM_001204291.1"; db_xref "CCDS:CCDS72934.1"; db_xref "GeneID:4582"; gbkey "CDS"; gene "MUC1"; note "isoform 15 precursor is encoded by transcript variant 15"; product "mucin-1 isoform 15 precursor"; protein_id "NP_001191220.1"; exon_number "1";
NC_000001.11 BestRefSeq stop_codon 155186135 155186137 . - 0 gene_id "MUC1"; transcript_id "NM_001204291.1"; db_xref "CCDS:CCDS72934.1"; db_xref "GeneID:4582"; gbkey "CDS"; gene "MUC1"; note "isoform 15 precursor is encoded by transcript variant 15"; product "mucin-1 isoform 15 precursor"; protein_id "NP_001191220.1"; exon_number "7";
① NM_000001.11: Reference accession number. “NM_000001” refers to chromosome 1, “.11” is the 11th version.
② BestRefSeq, RefSeq, Gnomon, HAVANA, etc.: Types of references.
③ GTF rows include gene, transcript, exon, CDS, domain, group, start_codon, stop_codon, etc.
④ The numbers 155185825 and 155192915 in the first row indicate that the MUC1 gene spans from the 155185825th base to the 155192915th base in the FASTA sequence.
⑤ +, -: (+) indicates the gene is on the forward (= plus, sense) strand, (-) indicates it’s on the reverse (= minus, antisense) strand.
⑥ 0, 1, 2: In CDS, 0, 1, 2 correspond to the 1st, 2nd, and 3rd bases of the decoding frame.
⑦ One gene can have multiple transcripts: Gene and transcript are linked through gene_id, gene, etc.
⑧ Each transcript can have multiple exon features: Transcript and exon are linked through transcript_id.
⑨ CDS (protein coding sequence) is usually a subset of exons: Some exons are identical to CDS.
⑩ Some genes may lack start_codon or stop_codon (e.g., LOC102724389).
4. GFF (general feature format)
⑴ Overview
① Stores annotation information of a reference. GFF has slight formatting differences from GTF.
② Hierarchical relationship between gene_id and transcript_id is not preserved in GFF.
import sys
inFile = open(sys.argv[1],'r')
for line in inFile:
#skip comment lines that start with the '#' character
if line[0] != '#':
#split line into columns by tab
data = line.strip().split('\t')
#parse the transcript/gene ID. I suck at using regex, so I usually just do a series of splits.
transcriptID = data[-1].split('transcript_id')[-1].split(';')[0].strip()[1:-1]
geneID = data[-1].split('gene_id')[-1].split(';')[0].strip()[1:-1]
#replace the last column with a GFF formatted attributes columns
#I added a GID attribute just to conserve all the GTF data
data[-1] = "ID=" + transcriptID + ";GID=" + geneID
#print out this new GFF line
print '\t'.join(data)
import sys
inFile = open(sys.argv[1],'r')
for line in inFile:
#skip comment lines that start with the '#' character
if line[0] != '#':
#split line into columns by tab
data = line.strip().split('\t')
ID = ''
#if the feature is a gene
if data[2] == "gene":
#get the id
ID = data[-1].split('ID=')[-1].split(';')[0]
#if the feature is anything else
else:
# get the parent as the ID
ID = data[-1].split('Parent=')[-1].split(';')[0]
#modify the last column
data[-1] = 'gene_id "' + ID + '"; transcript_id "' + ID
#print out this new GTF line
print '\t'.join(data)
5. BAM (Binary Alignment Map)
⑴ A file that stores the results of mapping FASTQ files to a reference file (e.g., GTF).
6. SAM (Sequence Alignment/MAP Format)
⑴ A file generated by viewing a BAM file with SAMtools view and performing sorting/selection.
⑵ Interpretation: Consists of continuous lines as follows.
A01192:688:HM3FVDMXY:2:1372:11614:10488 16 chr1 3630777 255 1S89M * 0 0 TTTTTTTTTTTTTTTTTTTGGAGTTTAAAACCAAGTATTTAATGTTTTAATTAAATTGTTTCAATACAATTTCAA ACAAATGCCAACTGG FF:FFFFF:FFFFFFF:F:FFFF:FFFFFFFFF,FF:FFFFFFFFFFF,FFFF::F:FFFFFF,F:,F,FFF,FFFFFFFFFFFFF:FFF NH:i:1 HI:i:1 AS:i:71 nM:i:8 RG:Z:CR_23_15103_TS_R_SSV_1:0: 1:HM3FVDMXY:2 RE:A:N xf:i:0 CR:Z:AACTCGATAAACACGT CY:Z:FFFFFFFFFFFF:FFF CB:Z:AACTCGATAAACACGT-1 UR:Z:ACTAGTACATGT UY:Z:FFFFFFFF:FFF UB:Z:ACTAGTACATGT
① QNAME (Query template NAME): In the case of paired-end sequencing, the QNAME of each pair is identical.
② FLAG
③ RNAME (Reference sequence NAME)
④ POS (1-based leftmost mapping Position)
⑤ MAPQ (Mapping Quality)
⑥ CIGAR (Concise Idiosyncratic Gapped Alignment Report) string
Op BAM Description
M 0 alignment match (can be a sequence match or mismatch)
I 1 insertion to the reference
D 2 deletion from the reference
N 3 skipped region from the reference
S 4 soft clipping (clipped sequences present in SEQ)
H 5 hard clipping (clipped sequences NOT present in SEQ)
P 6 padding (silent deletion from padded reference)
= 7 sequence match
X 8 sequence mismatch
⑦ RNEXT
⑧ PNEXT
⑨ TLEN (observed Template LENgth)
⑩ SEQ (segment SEQuence)
⑪ QUAL (quality)
⑫ NH:i: The number of reported alignments that contain the read.
⑬ HI:i: The hit index.
⑭ AS:i: Alignment score.
⑮ nM:i: Number of mismatches.
⑯ ts:i: An additional tag, possibly specific to the aligner or analysis pipeline.
⑰ RG:Z: Read group identifier.
⑱ TX:Z, GX:Z, GN:Z, fx:Z: Tags related to the gene or transcript the read is aligned to your query.
⑲ xf:i: An additional flag used by specific software.
⑳ CR:Z, CY:Z, UR:Z, UY:Z, UB:Z: Fields related to cell barcodes and unique molecular identifiers (UMIs), which are important in single-cell sequencing technologies.
Input: 2023.08.03 17:05