Korean, Edit

Understanding FASTQ, FASTA, GTF, GFF, BAM and SAM Files

Recommended Post : 【Bioinformatics】 Table of Contents for Bioinformatics Analysis


1. FASTQ Files

2. FASTA Files

3. GTF Files

4. GFF Files

5. BAM files

6. SAM files



1. FASTQ (fast-Q)

@SEQ_ID
GATTTGGGGTTCAAAGCAGTATCGATCAAATAGTAAATCCATTTGTTCAACTCACAGTTT
+
!''*((((***+))%%%++)(%%%%).1***-+*''))**55CCF>>>>>>CCCCCCC65

⑴ Stores sequence information of a sample.

Line 1 : SEQ_ID, which is @ + sequence identifier + optional description.

Example 1: @HWUSI-EAS100R:6:73:941:1973#0/1

Table. 1. Example of SEQ_ID (ref)

Example 2: @EAS139:136:FC706VJ:2:2104:15343:197393 1:Y:18:ATCACG

Table. 2. Example of SEQ_ID (ref)

Line 2 : raw sequence

Line 3 : “+” + ( optional ) sequence identifier

Line 4 : Quality scores for the sequence in Line 2.

① Quality score Q = -log P (where P is the error probability).

② Represented as ASCII characters, the number of characters matches the length of the raw sequence.

Type 1: PHRED 33 encoding : Currently the most widely used format.

Table. 3. PHRED 33 encoding

Type 2: PHRED 64 encoding

Table. 4. PHRED 64 encoding



2. FASTA (fast-A)

⑴ Stores sequence information of a reference.

⑵ Example : FASTA file for GFP.

>L29345.1 Aequorea victoria green-fluorescent protein (GFP) mRNA, complete cds
TACACACGAATAAAAGATAACAAAGATGAGTAAAGGAGAAGAACTTTTCACTGGAGTTGTCCCAATTCTT
GTTGAATTAGATGGCGATGTTAATGGGCAAAAATTCTCTGTCAGTGGAGAGGGTGAAGGTGATGCAACAT
ACGGAAAACTTACCCTTAAATTTATTTGCACTACTGGGAAGCTACCTGTTCCATGGCCAACACTTGTCAC
TACTTTCTCTTATGGTGTTCAATGCTTTTCAAGATACCCAGATCATATGAAACAGCATGACTTTTTCAAG
AGTGCCATGCCCGAAGGTTATGTACAGGAAAGAACTATATTTTACAAAGATGACGGGAACTACAAGACAC
GTGCTGAAGTCAAGTTTGAAGGTGATACCCTTGTTAATAGAATCGAGTTAAAAGGTATTGATTTTAAAGA
AGATGGAAACATTCTTGGACACAAAATGGAATACAACTATAACTCACATAATGTATACATCATGGCAGAC
AAACCAAAGAATGGAATCAAAGTTAACTTCAAAATTAGACACAACATTAAAGATGGAAGCGTTCAATTAG
CAGACCATTATCAACAAAATACTCCAATTGGCGATGGCCCTGTCCTTTTACCAGACAACCATTACCTGTC
CACACAATCTGCCCTTTCCAAAGATCCCAACGAAAAGAGAGATCACATGATCCTTCTTGAGTTTGTAACA
GCTGCTGGGATTACACATGGCATGGATGAACTATACAAATAAATGTCCAGACTTCCAATTGACACTAAAG
TGTCCGAACAATTACTAAATTCTCAGGGTTCCTGGTTAAATTCAGGCTGAGACTTTATTTATATATTTAT
AGATTCATTAAAATTTTATGAATAATTTATTGATGTTATTAATAGGGGCTATTTTCTTATTAAATAGGCT
ACTGGAGTGTAT



3. GTF (gene transfer format)

⑴ Stores annotation information of a reference.

⑵ Example : Contents of a GTF file for the MUC1 gene and one transcript.

NC_000001.11	BestRefSeq	gene	155185824	155192915	.	-	.	gene_id "MUC1"; transcript_id ""; db_xref "GeneID:4582"; db_xref "HGNC:HGNC:7508"; db_xref "MIM:158340"; description "mucin 1, cell surface associated"; gbkey "Gene"; gene "MUC1"; gene_biotype "protein_coding"; gene_synonym "ADMCKD"; gene_synonym "ADMCKD1"; gene_synonym "ADTKD2"; gene_synonym "CA 15-3"; gene_synonym "Ca15-3"; gene_synonym "CD227"; gene_synonym "EMA"; gene_synonym "H23AG"; gene_synonym "KL-6"; gene_synonym "MAM6"; gene_synonym "MCD"; gene_synonym "MCKD"; gene_synonym "MCKD1"; gene_synonym "MUC-1"; gene_synonym "MUC-1/SEC"; gene_synonym "MUC-1/X"; gene_synonym "MUC1/ZD"; gene_synonym "PEM"; gene_synonym "PEMT"; gene_synonym "PUM"; 
NC_000001.11	BestRefSeq	transcript	155185824	155192915	.	-	.	gene_id "MUC1"; transcript_id "NM_001204291.1"; db_xref "GeneID:4582"; gbkey "mRNA"; gene "MUC1"; product "mucin 1, cell surface associated, transcript variant 15"; transcript_biotype "mRNA"; 
NC_000001.11	BestRefSeq	exon	155192786	155192915	.	-	.	gene_id "MUC1"; transcript_id "NM_001204291.1"; db_xref "GeneID:4582"; gene "MUC1"; product "mucin 1, cell surface associated, transcript variant 15"; transcript_biotype "mRNA"; exon_number "1"; 
NC_000001.11	BestRefSeq	exon	155192183	155192310	.	-	.	gene_id "MUC1"; transcript_id "NM_001204291.1"; db_xref "GeneID:4582"; gene "MUC1"; product "mucin 1, cell surface associated, transcript variant 15"; transcript_biotype "mRNA"; exon_number "2"; 
NC_000001.11	BestRefSeq	exon	155188008	155188063	.	-	.	gene_id "MUC1"; transcript_id "NM_001204291.1"; db_xref "GeneID:4582"; gene "MUC1"; product "mucin 1, cell surface associated, transcript variant 15"; transcript_biotype "mRNA"; exon_number "3"; 
NC_000001.11	BestRefSeq	exon	155187722	155187858	.	-	.	gene_id "MUC1"; transcript_id "NM_001204291.1"; db_xref "GeneID:4582"; gene "MUC1"; product "mucin 1, cell surface associated, transcript variant 15"; transcript_biotype "mRNA"; exon_number "4"; 
NC_000001.11	BestRefSeq	exon	155187455	155187576	.	-	.	gene_id "MUC1"; transcript_id "NM_001204291.1"; db_xref "GeneID:4582"; gene "MUC1"; product "mucin 1, cell surface associated, transcript variant 15"; transcript_biotype "mRNA"; exon_number "5"; 
NC_000001.11	BestRefSeq	exon	155187225	155187374	.	-	.	gene_id "MUC1"; transcript_id "NM_001204291.1"; db_xref "GeneID:4582"; gene "MUC1"; product "mucin 1, cell surface associated, transcript variant 15"; transcript_biotype "mRNA"; exon_number "6"; 
NC_000001.11	BestRefSeq	exon	155185824	155186209	.	-	.	gene_id "MUC1"; transcript_id "NM_001204291.1"; db_xref "GeneID:4582"; gene "MUC1"; product "mucin 1, cell surface associated, transcript variant 15"; transcript_biotype "mRNA"; exon_number "7"; 
NC_000001.11	BestRefSeq	CDS	155192786	155192843	.	-	0	gene_id "MUC1"; transcript_id "NM_001204291.1"; db_xref "CCDS:CCDS72934.1"; db_xref "GeneID:4582"; gbkey "CDS"; gene "MUC1"; note "isoform 15 precursor is encoded by transcript variant 15"; product "mucin-1 isoform 15 precursor"; protein_id "NP_001191220.1"; exon_number "1"; 
NC_000001.11	BestRefSeq	CDS	155192183	155192310	.	-	2	gene_id "MUC1"; transcript_id "NM_001204291.1"; db_xref "CCDS:CCDS72934.1"; db_xref "GeneID:4582"; gbkey "CDS"; gene "MUC1"; note "isoform 15 precursor is encoded by transcript variant 15"; product "mucin-1 isoform 15 precursor"; protein_id "NP_001191220.1"; exon_number "2"; 
NC_000001.11	BestRefSeq	CDS	155188008	155188063	.	-	0	gene_id "MUC1"; transcript_id "NM_001204291.1"; db_xref "CCDS:CCDS72934.1"; db_xref "GeneID:4582"; gbkey "CDS"; gene "MUC1"; note "isoform 15 precursor is encoded by transcript variant 15"; product "mucin-1 isoform 15 precursor"; protein_id "NP_001191220.1"; exon_number "3"; 
NC_000001.11	BestRefSeq	CDS	155187722	155187858	.	-	1	gene_id "MUC1"; transcript_id "NM_001204291.1"; db_xref "CCDS:CCDS72934.1"; db_xref "GeneID:4582"; gbkey "CDS"; gene "MUC1"; note "isoform 15 precursor is encoded by transcript variant 15"; product "mucin-1 isoform 15 precursor"; protein_id "NP_001191220.1"; exon_number "4"; 
NC_000001.11	BestRefSeq	CDS	155187455	155187576	.	-	2	gene_id "MUC1"; transcript_id "NM_001204291.1"; db_xref "CCDS:CCDS72934.1"; db_xref "GeneID:4582"; gbkey "CDS"; gene "MUC1"; note "isoform 15 precursor is encoded by transcript variant 15"; product "mucin-1 isoform 15 precursor"; protein_id "NP_001191220.1"; exon_number "5"; 
NC_000001.11	BestRefSeq	CDS	155187225	155187374	.	-	0	gene_id "MUC1"; transcript_id "NM_001204291.1"; db_xref "CCDS:CCDS72934.1"; db_xref "GeneID:4582"; gbkey "CDS"; gene "MUC1"; note "isoform 15 precursor is encoded by transcript variant 15"; product "mucin-1 isoform 15 precursor"; protein_id "NP_001191220.1"; exon_number "6"; 
NC_000001.11	BestRefSeq	CDS	155186138	155186209	.	-	0	gene_id "MUC1"; transcript_id "NM_001204291.1"; db_xref "CCDS:CCDS72934.1"; db_xref "GeneID:4582"; gbkey "CDS"; gene "MUC1"; note "isoform 15 precursor is encoded by transcript variant 15"; product "mucin-1 isoform 15 precursor"; protein_id "NP_001191220.1"; exon_number "7"; 
NC_000001.11	BestRefSeq	start_codon	155192841	155192843	.	-	0	gene_id "MUC1"; transcript_id "NM_001204291.1"; db_xref "CCDS:CCDS72934.1"; db_xref "GeneID:4582"; gbkey "CDS"; gene "MUC1"; note "isoform 15 precursor is encoded by transcript variant 15"; product "mucin-1 isoform 15 precursor"; protein_id "NP_001191220.1"; exon_number "1"; 
NC_000001.11	BestRefSeq	stop_codon	155186135	155186137	.	-	0	gene_id "MUC1"; transcript_id "NM_001204291.1"; db_xref "CCDS:CCDS72934.1"; db_xref "GeneID:4582"; gbkey "CDS"; gene "MUC1"; note "isoform 15 precursor is encoded by transcript variant 15"; product "mucin-1 isoform 15 precursor"; protein_id "NP_001191220.1"; exon_number "7";

① NM_000001.11 : Reference accession number. “NM_000001” refers to chromosome 1, “.11” is the 11th version.

② BestRefSeq, RefSeq, Gnomon, HAVANA, etc. : Types of references.

③ GTF rows include gene, transcript, exon, CDS, domain, group, start_codon, stop_codon, etc.

④ The numbers 155185825 and 155192915 in the first row indicate that the MUC1 gene spans from the 155185825th base to the 155192915th base in the FASTA sequence.

⑤ +, - : (+) indicates the gene is on the forward (= plus, sense) strand, (-) indicates it’s on the reverse (= minus, antisense) strand.

⑥ 0, 1, 2 : In CDS, 0, 1, 2 correspond to the 1st, 2nd, and 3rd bases of the decoding frame.

⑦ One gene can have multiple transcripts : Gene and transcript are linked through gene_id, gene, etc.

⑧ Each transcript can have multiple exon features : Transcript and exon are linked through transcript_id.

⑨ CDS (protein coding sequence) is usually a subset of exons : Some exons are identical to CDS.

⑩ Some genes may lack start_codon or stop_codon (e.g., LOC102724389).



4. GFF (general feature format)

⑴ Overview

① Stores annotation information of a reference. GFF has slight formatting differences from GTF.

② Hierarchical relationship between gene_id and transcript_id is not preserved in GFF.

GTF to GFF Conversion

import sys

inFile = open(sys.argv[1],'r')

for line in inFile:
  #skip comment lines that start with the '#' character
  if line[0] != '#':
    #split line into columns by tab
    data = line.strip().split('\t')

    #parse the transcript/gene ID. I suck at using regex, so I usually just do a series of splits.
    transcriptID = data[-1].split('transcript_id')[-1].split(';')[0].strip()[1:-1]
    geneID = data[-1].split('gene_id')[-1].split(';')[0].strip()[1:-1]

    #replace the last column with a GFF formatted attributes columns
    #I added a GID attribute just to conserve all the GTF data
    data[-1] = "ID=" + transcriptID + ";GID=" + geneID

    #print out this new GFF line
    print '\t'.join(data)

GFF to GTF Conversion

import sys

inFile = open(sys.argv[1],'r')

for line in inFile:
  #skip comment lines that start with the '#' character
  if line[0] != '#':
    #split line into columns by tab
    data = line.strip().split('\t')

    ID = ''

    #if the feature is a gene 
    if data[2] == "gene":
      #get the id
      ID = data[-1].split('ID=')[-1].split(';')[0]

    #if the feature is anything else
    else:
      # get the parent as the ID
      ID = data[-1].split('Parent=')[-1].split(';')[0]

    #modify the last column
    data[-1] = 'gene_id "' + ID + '"; transcript_id "' + ID

    #print out this new GTF line
    print '\t'.join(data)



5. BAM (Binary Alignment Map)

⑴ A file that stores the results of mapping FASTQ files to a reference file (e.g., GTF).



6. SAM (Sequence Alignment/MAP Format)

⑴ A file generated by viewing a BAM file with SAMtools view and performing sorting/selection.

⑵ Interpretation: Consists of continuous lines as follows.

A01192:688:HM3FVDMXY:2:1372:11614:10488 16      chr1    3630777 255     1S89M   *       0       0       TTTTTTTTTTTTTTTTTTTGGAGTTTAAAACCAAGTATTTAATGTTTTAATTAAATTGTTTCAATACAATTTCAA ACAAATGCCAACTGG      FF:FFFFF:FFFFFFF:F:FFFF:FFFFFFFFF,FF:FFFFFFFFFFF,FFFF::F:FFFFFF,F:,F,FFF,FFFFFFFFFFFFF:FFF      NH:i:1  HI:i:1  AS:i:71 nM:i:8  RG:Z:CR_23_15103_TS_R_SSV_1:0: 1:HM3FVDMXY:2     RE:A:N  xf:i:0  CR:Z:AACTCGATAAACACGT   CY:Z:FFFFFFFFFFFF:FFF   CB:Z:AACTCGATAAACACGT-1 UR:Z:ACTAGTACATGT       UY:Z:FFFFFFFF:FFF       UB:Z:ACTAGTACATGT

① QNAME (Query template NAME)

② FLAG

③ RNAME (Reference sequence NAME)

④ POS (1-based leftmost mapping Position)

⑤ MAPQ (Mapping Quality)

⑥ CIGAR (Concise Idiosyncratic Gapped Alignment Report) string


Op BAM Description
M 0 alignment match (can be a sequence match or mismatch)
I 1 insertion to the reference
D 2 deletion from the reference
N 3 skipped region from the reference
S 4 soft clipping (clipped sequences present in SEQ)
H 5 hard clipping (clipped sequences NOT present in SEQ)
P 6 padding (silent deletion from padded reference)
= 7 sequence match
X 8 sequence mismatch


⑦ RNEXT

⑧ PNEXT

⑨ TLEN (observed Template LENgth)

⑩ SEQ (segment SEQuence)

⑪ QUAL (quality)

⑫ NH:i : The number of reported alignments that contain the read.

⑬ HI:i : The hit index.

⑭ AS:i : Alignment score.

⑮ nM:i : Number of mismatches.

⑯ ts:i : An additional tag, possibly specific to the aligner or analysis pipeline.

⑰ RG:Z : Read group identifier.

⑱ TX:Z, GX:Z, GN:Z, fx:Z : Tags related to the gene or transcript the read is aligned to your query.

⑲ xf:i : An additional flag used by specific software.

⑳ CR:Z, CY:Z, UR:Z, UY:Z, UB:Z : Fields related to cell barcodes and unique molecular identifiers (UMIs), which are important in single-cell sequencing technologies.



Input: 2023.08.03 17:05

results matching ""

    No results matching ""