Understanding Data Format
Recommended Post: 【Bioinformatics】 Table of Contents for Bioinformatics Analysis
1. FASTQ Files
2. FASTA Files
3. GFF Files
4. GTF Files
5. BAM Files
6. SAM Files
7. BED Files
8. Loom Files
9. VCF Files
10. Other File Types
1. FASTQ (fast-Q)
@SEQ_ID
GATTTGGGGTTCAAAGCAGTATCGATCAAATAGTAAATCCATTTGTTCAACTCACAGTTT
+
!''*((((***+))%%%++)(%%%%).1***-+*''))**55CCF>>>>>>CCCCCCC65
⑴ Stores sequence information of a sample.
⑵ Line 1: SEQ_ID, which is @ + sequence identifier + optional description.
① Example 1: @HWUSI-EAS100R:6:73:941:1973#0/1
HWUSI-EAS100R | the unique instrument name |
---|---|
6 | flowcell lane |
73 | tile number within the flowcell lane |
941 | ‘x’-coordinate of the cluster within the tile |
1973 | ‘y’-coordinate of the cluster within the tile |
#0 | index number for a multiplexed sample (0 for no indexing) |
/1 | the member of a pair, /1 or /2 (paired-end or mate-pair reads only) |
Table 1. Example of SEQ_ID (ref)
② Example 2: @EAS139:136:FC706VJ:2:2104:15343:197393 1:Y:18:ATCACG
EAS139 | the unique instrument name |
---|---|
136 | the run id |
FC706VJ | the flowcell id |
2 | flowcell lane |
2104 | tile number within the flowcell lane |
15343 | ‘x’-coordinate of the cluster within the tile |
197393 | ‘y’-coordinate of the cluster within the tile |
1 | the member of a pair, 1 or 2 (paired-end or mate-pair reads only) |
Y | Y if the read is filtered (did not pass), N otherwise |
18 | 0 when none of the control bits are on, otherwise it is an even number |
ATCACG | index sequence |
Table 2. Example of SEQ_ID (ref)
⑶ Line 2: raw sequence
⑷ Line 3: “+” + ( optional ) sequence identifier
⑸ Line 4: Quality scores for the sequence in Line 2.
① Phred quality score = Q = Qsanger = -10 log10 P (where P is the base call error probability).
○ Example 1. 1 error in 1000 = Qsanger 30
○ Example 2. 1 error in 10000 = Qsanger 40
② Represented as ASCII characters, the number of characters matches the length of the raw sequence.
③ Type 1: PHRED 33 encoding
Table 3. PHRED 33 encoding
○ Currently the most widely used format.
○ Represented as an ASCII code after adding 33 to the Phred score. That is, map 0-93 to ASCII 33-126.
④ Type 2: PHRED 64 encoding
Table 4. PHRED 64 encoding
2. FASTA (fast-A)
⑴ Overview
① Stores sequence information of the reference
② The header line starts with the “>” symbol.
③ Applicable to DNA, RNA, and protein.
⑵ Example: FASTA file for GFP.
>L29345.1 Aequorea victoria green-fluorescent protein (GFP) mRNA, complete cds
TACACACGAATAAAAGATAACAAAGATGAGTAAAGGAGAAGAACTTTTCACTGGAGTTGTCCCAATTCTT
GTTGAATTAGATGGCGATGTTAATGGGCAAAAATTCTCTGTCAGTGGAGAGGGTGAAGGTGATGCAACAT
ACGGAAAACTTACCCTTAAATTTATTTGCACTACTGGGAAGCTACCTGTTCCATGGCCAACACTTGTCAC
TACTTTCTCTTATGGTGTTCAATGCTTTTCAAGATACCCAGATCATATGAAACAGCATGACTTTTTCAAG
AGTGCCATGCCCGAAGGTTATGTACAGGAAAGAACTATATTTTACAAAGATGACGGGAACTACAAGACAC
GTGCTGAAGTCAAGTTTGAAGGTGATACCCTTGTTAATAGAATCGAGTTAAAAGGTATTGATTTTAAAGA
AGATGGAAACATTCTTGGACACAAAATGGAATACAACTATAACTCACATAATGTATACATCATGGCAGAC
AAACCAAAGAATGGAATCAAAGTTAACTTCAAAATTAGACACAACATTAAAGATGGAAGCGTTCAATTAG
CAGACCATTATCAACAAAATACTCCAATTGGCGATGGCCCTGTCCTTTTACCAGACAACCATTACCTGTC
CACACAATCTGCCCTTTCCAAAGATCCCAACGAAAAGAGAGATCACATGATCCTTCTTGAGTTTGTAACA
GCTGCTGGGATTACACATGGCATGGATGAACTATACAAATAAATGTCCAGACTTCCAATTGACACTAAAG
TGTCCGAACAATTACTAAATTCTCAGGGTTCCTGGTTAAATTCAGGCTGAGACTTTATTTATATATTTAT
AGATTCATTAAAATTTTATGAATAATTTATTGATGTTATTAATAGGGGCTATTTTCTTATTAAATAGGCT
ACTGGAGTGTAT
3. GFF (general feature format)
⑴ Overview
① Stores annotation information from the reference. There are slight format differences from GTF.
② 1-start, fully closed
○ However, in the case of web browsers such as the UCSC Genome Browser, they use a 0-start, half-open system.
rs782519173 (hg38) | start | end |
---|---|---|
positioned in web browser (1-start, fully-closed) | 133255708 | 133255708 |
stored in table (0-start, half-open) | 133255707 | 133255708 |
Table 5. zero-based vs one-based
③ The first eight GFF fields are seqname(#seqid), source, feature(type), start, end, score, strand, frame(phase), attributes: same as GTF.
○ seqname: Name of the chromosome or scaffold.
○ source: Name of the program that generated this feature, or the data source.
○ feature - Feature type name, e.g. Gene, Variation, Similarity.
○ start: Start position of the feature, with sequence numbering starting at 1.
○ end: End position of the feature (inclusive), with sequence numbering starting at 1.
○ score: A floating point value.
○ strand: Defined as + (forward) or - (reverse).
○ frame - One of ‘0’, ‘1’ or ‘2’. ‘0’ indicates that the first base of the feature is the first base of a codon, ‘1’ that the second base is the first base of a codon, etc.
○ attribute: A semicolon-separated list of tag-value pairs, providing additional information about each feature.
④ Unlike GTF, there are no additional fields: For example, the hierarchical relationship between gene_id and transcript_id is not preserved in GFF.
import sys
inFile = open(sys.argv[1],'r')
for line in inFile:
#skip comment lines that start with the '#' character
if line[0] != '#':
#split line into columns by tab
data = line.strip().split('\t')
#parse the transcript/gene ID. I suck at using regex, so I usually just do a series of splits.
transcriptID = data[-1].split('transcript_id')[-1].split(';')[0].strip()[1:-1]
geneID = data[-1].split('gene_id')[-1].split(';')[0].strip()[1:-1]
#replace the last column with a GFF formatted attributes columns
#I added a GID attribute just to conserve all the GTF data
data[-1] = "ID=" + transcriptID + ";GID=" + geneID
#print out this new GFF line
print '\t'.join(data)
import sys
inFile = open(sys.argv[1],'r')
for line in inFile:
#skip comment lines that start with the '#' character
if line[0] != '#':
#split line into columns by tab
data = line.strip().split('\t')
ID = ''
#if the feature is a gene
if data[2] == "gene":
#get the id
ID = data[-1].split('ID=')[-1].split(';')[0]
#if the feature is anything else
else:
# get the parent as the ID
ID = data[-1].split('Parent=')[-1].split(';')[0]
#modify the last column
data[-1] = 'gene_id "' + ID + '"; transcript_id "' + ID
#print out this new GTF line
print '\t'.join(data)
4. GTF (gene transfer format)
⑴ Overview
① Save the annotation information of the reference
② The first 8 fields are identical to GFF
③ GTF includes 5UTR, 3UTR, inter, inter_CNS, and intron_CNS in the feature column in addition to GFF
④ The group field is a list of attributes: each attribute ends with a semicolon and is separated by exactly one space
⑵ Example: Contents of a GTF file for the MUC1 gene and one transcript.
NC_000001.11 BestRefSeq gene 155185824 155192915 . - . gene_id "MUC1"; transcript_id ""; db_xref "GeneID:4582"; db_xref "HGNC:HGNC:7508"; db_xref "MIM:158340"; description "mucin 1, cell surface associated"; gbkey "Gene"; gene "MUC1"; gene_biotype "protein_coding"; gene_synonym "ADMCKD"; gene_synonym "ADMCKD1"; gene_synonym "ADTKD2"; gene_synonym "CA 15-3"; gene_synonym "Ca15-3"; gene_synonym "CD227"; gene_synonym "EMA"; gene_synonym "H23AG"; gene_synonym "KL-6"; gene_synonym "MAM6"; gene_synonym "MCD"; gene_synonym "MCKD"; gene_synonym "MCKD1"; gene_synonym "MUC-1"; gene_synonym "MUC-1/SEC"; gene_synonym "MUC-1/X"; gene_synonym "MUC1/ZD"; gene_synonym "PEM"; gene_synonym "PEMT"; gene_synonym "PUM";
NC_000001.11 BestRefSeq transcript 155185824 155192915 . - . gene_id "MUC1"; transcript_id "NM_001204291.1"; db_xref "GeneID:4582"; gbkey "mRNA"; gene "MUC1"; product "mucin 1, cell surface associated, transcript variant 15"; transcript_biotype "mRNA";
NC_000001.11 BestRefSeq exon 155192786 155192915 . - . gene_id "MUC1"; transcript_id "NM_001204291.1"; db_xref "GeneID:4582"; gene "MUC1"; product "mucin 1, cell surface associated, transcript variant 15"; transcript_biotype "mRNA"; exon_number "1";
NC_000001.11 BestRefSeq exon 155192183 155192310 . - . gene_id "MUC1"; transcript_id "NM_001204291.1"; db_xref "GeneID:4582"; gene "MUC1"; product "mucin 1, cell surface associated, transcript variant 15"; transcript_biotype "mRNA"; exon_number "2";
NC_000001.11 BestRefSeq exon 155188008 155188063 . - . gene_id "MUC1"; transcript_id "NM_001204291.1"; db_xref "GeneID:4582"; gene "MUC1"; product "mucin 1, cell surface associated, transcript variant 15"; transcript_biotype "mRNA"; exon_number "3";
NC_000001.11 BestRefSeq exon 155187722 155187858 . - . gene_id "MUC1"; transcript_id "NM_001204291.1"; db_xref "GeneID:4582"; gene "MUC1"; product "mucin 1, cell surface associated, transcript variant 15"; transcript_biotype "mRNA"; exon_number "4";
NC_000001.11 BestRefSeq exon 155187455 155187576 . - . gene_id "MUC1"; transcript_id "NM_001204291.1"; db_xref "GeneID:4582"; gene "MUC1"; product "mucin 1, cell surface associated, transcript variant 15"; transcript_biotype "mRNA"; exon_number "5";
NC_000001.11 BestRefSeq exon 155187225 155187374 . - . gene_id "MUC1"; transcript_id "NM_001204291.1"; db_xref "GeneID:4582"; gene "MUC1"; product "mucin 1, cell surface associated, transcript variant 15"; transcript_biotype "mRNA"; exon_number "6";
NC_000001.11 BestRefSeq exon 155185824 155186209 . - . gene_id "MUC1"; transcript_id "NM_001204291.1"; db_xref "GeneID:4582"; gene "MUC1"; product "mucin 1, cell surface associated, transcript variant 15"; transcript_biotype "mRNA"; exon_number "7";
NC_000001.11 BestRefSeq CDS 155192786 155192843 . - 0 gene_id "MUC1"; transcript_id "NM_001204291.1"; db_xref "CCDS:CCDS72934.1"; db_xref "GeneID:4582"; gbkey "CDS"; gene "MUC1"; note "isoform 15 precursor is encoded by transcript variant 15"; product "mucin-1 isoform 15 precursor"; protein_id "NP_001191220.1"; exon_number "1";
NC_000001.11 BestRefSeq CDS 155192183 155192310 . - 2 gene_id "MUC1"; transcript_id "NM_001204291.1"; db_xref "CCDS:CCDS72934.1"; db_xref "GeneID:4582"; gbkey "CDS"; gene "MUC1"; note "isoform 15 precursor is encoded by transcript variant 15"; product "mucin-1 isoform 15 precursor"; protein_id "NP_001191220.1"; exon_number "2";
NC_000001.11 BestRefSeq CDS 155188008 155188063 . - 0 gene_id "MUC1"; transcript_id "NM_001204291.1"; db_xref "CCDS:CCDS72934.1"; db_xref "GeneID:4582"; gbkey "CDS"; gene "MUC1"; note "isoform 15 precursor is encoded by transcript variant 15"; product "mucin-1 isoform 15 precursor"; protein_id "NP_001191220.1"; exon_number "3";
NC_000001.11 BestRefSeq CDS 155187722 155187858 . - 1 gene_id "MUC1"; transcript_id "NM_001204291.1"; db_xref "CCDS:CCDS72934.1"; db_xref "GeneID:4582"; gbkey "CDS"; gene "MUC1"; note "isoform 15 precursor is encoded by transcript variant 15"; product "mucin-1 isoform 15 precursor"; protein_id "NP_001191220.1"; exon_number "4";
NC_000001.11 BestRefSeq CDS 155187455 155187576 . - 2 gene_id "MUC1"; transcript_id "NM_001204291.1"; db_xref "CCDS:CCDS72934.1"; db_xref "GeneID:4582"; gbkey "CDS"; gene "MUC1"; note "isoform 15 precursor is encoded by transcript variant 15"; product "mucin-1 isoform 15 precursor"; protein_id "NP_001191220.1"; exon_number "5";
NC_000001.11 BestRefSeq CDS 155187225 155187374 . - 0 gene_id "MUC1"; transcript_id "NM_001204291.1"; db_xref "CCDS:CCDS72934.1"; db_xref "GeneID:4582"; gbkey "CDS"; gene "MUC1"; note "isoform 15 precursor is encoded by transcript variant 15"; product "mucin-1 isoform 15 precursor"; protein_id "NP_001191220.1"; exon_number "6";
NC_000001.11 BestRefSeq CDS 155186138 155186209 . - 0 gene_id "MUC1"; transcript_id "NM_001204291.1"; db_xref "CCDS:CCDS72934.1"; db_xref "GeneID:4582"; gbkey "CDS"; gene "MUC1"; note "isoform 15 precursor is encoded by transcript variant 15"; product "mucin-1 isoform 15 precursor"; protein_id "NP_001191220.1"; exon_number "7";
NC_000001.11 BestRefSeq start_codon 155192841 155192843 . - 0 gene_id "MUC1"; transcript_id "NM_001204291.1"; db_xref "CCDS:CCDS72934.1"; db_xref "GeneID:4582"; gbkey "CDS"; gene "MUC1"; note "isoform 15 precursor is encoded by transcript variant 15"; product "mucin-1 isoform 15 precursor"; protein_id "NP_001191220.1"; exon_number "1";
NC_000001.11 BestRefSeq stop_codon 155186135 155186137 . - 0 gene_id "MUC1"; transcript_id "NM_001204291.1"; db_xref "CCDS:CCDS72934.1"; db_xref "GeneID:4582"; gbkey "CDS"; gene "MUC1"; note "isoform 15 precursor is encoded by transcript variant 15"; product "mucin-1 isoform 15 precursor"; protein_id "NP_001191220.1"; exon_number "7";
① NM_000001.11: Reference accession number. “NM_000001” refers to chromosome 1, “.11” is the 11th version.
② BestRefSeq, RefSeq, Gnomon, HAVANA, etc.: Types of references.
③ GTF rows include gene, transcript, exon, CDS, domain, group, start_codon, stop_codon, etc.
④ The numbers 155185825 and 155192915 in the first row indicate that the MUC1 gene spans from the 155185825th base to the 155192915th base in the FASTA sequence.
⑤ +, -: (+) indicates the gene is on the forward (= plus, sense) strand, (-) indicates it’s on the reverse (= minus, antisense) strand.
⑥ 0, 1, 2: In CDS, 0, 1, 2 correspond to the 1st, 2nd, and 3rd bases of the decoding frame.
⑦ One gene can have multiple transcripts: Gene and transcript are linked through gene_id, gene, etc.
⑧ Each transcript can have multiple exon features: Transcript and exon are linked through transcript_id.
⑨ CDS (protein coding sequence) is usually a subset of exons: Some exons are identical to CDS.
⑩ Some genes may lack start_codon or stop_codon (e.g., LOC102724389).
5. SAM (Sequence Alignment/MAP Format)
⑴ A file that stores the results of mapping FASTQ files to a reference file (e.g., GTF).
⑵ Interpretation: Consists of continuous lines as follows.
A01192:688:HM3FVDMXY:2:1372:11614:10488 16 chr1 3630777 255 1S89M * 0 0 TTTTTTTTTTTTTTTTTTTGGAGTTTAAAACCAAGTATTTAATGTTTTAATTAAATTGTTTCAATACAATTTCAA ACAAATGCCAACTGG FF:FFFFF:FFFFFFF:F:FFFF:FFFFFFFFF,FF:FFFFFFFFFFF,FFFF::F:FFFFFF,F:,F,FFF,FFFFFFFFFFFFF:FFF NH:i:1 HI:i:1 AS:i:71 nM:i:8 RG:Z:CR_23_15103_TS_R_SSV_1:0: 1:HM3FVDMXY:2 RE:A:N xf:i:0 CR:Z:AACTCGATAAACACGT CY:Z:FFFFFFFFFFFF:FFF CB:Z:AACTCGATAAACACGT-1 UR:Z:ACTAGTACATGT UY:Z:FFFFFFFF:FFF UB:Z:ACTAGTACATGT
① QNAME (Query template NAME): Name of query read. In the case of paired-end sequencing, the QNAME of each pair is identical.
② FLAG: Bitwise flag (pairing, strand, etc.)
③ RNAME (Reference sequence NAME): Reference sequence name
④ POS (1-based leftmost mapping Position): 1-based leftmost position of alignment
⑤ MAPQ (Mapping Quality): Phred-scaled
⑥ CIGAR (Concise Idiosyncratic Gapped Alignment Report) string (operation: MIDNSHP)
Op BAM Description
M 0 alignment match (can be a sequence match or mismatch)
I 1 insertion to the reference
D 2 deletion from the reference
N 3 skipped region from the reference
S 4 soft clipping (clipped sequences present in SEQ)
H 5 hard clipping (clipped sequences NOT present in SEQ)
P 6 padding (silent deletion from padded reference)
= 7 sequence match
X 8 sequence mismatch
⑦ RNEXT
⑧ PNEXT
⑨ TLEN (observed Template LENgth)
⑩ SEQ (segment SEQuence)
⑪ QUAL (quality)
⑫ NH:i: The number of reported alignments that contain the read.
⑬ HI:i: The hit index.
⑭ AS:i: Alignment score.
⑮ nM:i: Number of mismatches.
⑯ ts:i: An additional tag, possibly specific to the aligner or analysis pipeline.
⑰ RG:Z: Read group identifier.
⑱ TX:Z, GX:Z, GN:Z, fx:Z: Tags related to the gene or transcript the read is aligned to your query.
⑲ xf:i: An additional flag used by specific software.
⑳ CR:Z, CY:Z, UR:Z, UY:Z, UB:Z: Fields related to cell barcodes and unique molecular identifiers (UMIs), which are important in single-cell sequencing technologies.
㉑ MRNM: Name of mate (Ref name of the mate: * if NA; = if identical)
㉒ MPOS: 1-based leftmost position of mate
㉓ ISIZE: Inferred insert size (leftmost of the upstream read to rightmost of the downstream read)
㉔ SEQQuery: Sequence on the reference (same strand)
㉕ QUAL: Query quality (Phred-scaled)
6. BAM (Binary Alignment Map)
⑴ BAM is the binary version of SAM (not human readable). Uses less space, so is often preferred.
⑵ Often, SAM or BAM files are required to be sorted.
7. BED Files
Figure 1. BED Format
⑴ The minimal format for representing alignment results
⑵ Very useful for representing features of interest such as enhancers, SNPs, ChIP-seq peaks, exons, etc.
⑶ Tab-delimited.
⑷ 3 required fields: chrom, chromStart, chromEnd
① chrom: Chromosome name
② chromStart: 0-offset (starts from ‘0’). Start of feature
③ chromEnd: 1-offset. End of feature
⑸ 9 additional fields: name, score, strand, thickStart, thickEnd, itemRgb, blockCount, blockSizes, blockStarts
① score: A score between 0-1000
② strand: Mapping quality (Phred-scaled)
8. Loom Files
⑴ Gene expression data: Content of the .h5
file
⑵ (Optional) Layer for spliced and unspliced RNA transcripts: In case an RNA velocity-aware tool is used
⑶ (Optional) Layer for cell metadata
⑷ (Optional) Layer for gene metadata
9. VCF Files
⑴ Sample → Raw sequence (FASTA / FASTQ) → Aligned read (BAM / SAM) → Variant call (VCF)
⑵ File Structure
Table 6. Structure of VCF File
① #CHROM: Chromosome identifier. Examples include 7, chr7, X, or chrX.
② POS: Reference position. Sorted in ascending order within each chromosome.
③ ID: Unique identifier separated by semicolons. Spaces are not allowed.
④ REF: Reference base (A, C, G, T). Insertions may be represented by a dot (.).
⑤ ALT: Alternative base separated by semicolons (A, C, G, T). Deletions are represented by a dot (.).
⑥ QUAL: Quality score represented on a log scale. A score of 100 indicates an error probability of 1 in 1010.
⑦ FILTER: Denotes failed filters, separated by semicolons. Can be labeled as PASS or MISSING.
⑧ INFO: Position-level information (excluding sample) in a semicolon-separated name-value format.
○ NS (number of samples): Number of samples in which the variant has been detected.
○ DP (depth): Read depth at this position. DP=14 means a total of 14 sequences were read at this position.
○ AF (allele frequency): Frequency of the allele.
○ AA (ancestral allele): Ancestral allele.
○ DB (dbSNP): Indicates that the variant is registered in dbSNP.
○ H2 (HapMap2): Indicates that the variant is included in the HapMap2 project database.
⑨ FORMAT: Declaration of sample-level field names separated by semicolons.
○ GT (genotype): Represents alleles separated by a slash (/, unphased) or a vertical bar ( , phased).
○ GQ: Genotype quality expressed as a single integer.
○ DP: Read depth expressed as a single integer.
○ HQ: Haploid quality consisting of two integers, separated by a comma.
⑩
: Sample-level field data corresponding to the FORMAT field declaration, separated by semicolons.
⑶ Interpretation
① All variants occur on chromosome 20 in NCBI36 (hg18).
② Five SNP positions are identified (14370, 17330, 1110696, 1230237, 1234567).
③ Three variants have IDs, including two dbSNP records (rs6054257, rs6040355).
④ The first two positions (14370, 17330) are simple single nucleotide polymorphisms.
⑤ The third position contains two alternative alleles (G and T) replacing the reference base (A).
⑥ The fourth position represents a deletion of T with no alternative allele (“.”).
⑦ The fifth entry contains two alternative alleles: the first is a deletion of TC, and the second is an insertion of T.
10. Other File Types
⑴ HDF (Hierarchical Data Format)
⑵ CRAM
⑶ Flat
⑷ AGP
⑸ GB/GBK
⑹ BEDgraph: Related to variant calling.
Figure 2. BEDgraph Format
⑺ Wiggle: Related to variant calling. The format of storing a Wiggle file in binary form is called the bigWig format.
Figure 3. Wiggle Format
⑻ GFA (Graphical Fragment Assembly Format): A file for representing assembly graphs. Not frequently used.
① H (header): No fixed values
② S (segment): Represents a vertex and its complement. The fixed values are segName, segSeq.
③ L (overlap): Represents an edge and its complement. The fixed values are segName1, segOri1, segName2, segOri2, CIGAR.
⑼ FASTG: A file format for representing assembly graphs prior to GFA.
① Terminology is somewhat different: actual vertices are expressed as edges, and edges are expressed as adjacencies.
② Can represent subgraphs using nesting: The lack of algorithms to handle this is pointed out as a major limitation of FASTG.
⑽ PAF
⑾ tagAlign
Input: 2023.08.03 17:05