Linux Programming (bash programming)
Recommended Post: 【Operating System】 Linux
1. Syntax
1. Syntax
⑴ Variable
① Overview: Useful for representing long strings or for bash programming like loops
② Variable Assignment: Use the
$
symbol
i="Hello World!"
echo $i
③ To reduce ambiguity, you can use curly braces
i="a"
i¡="b"
echo $ii (will give b)
echo ${i}i (will give ai)
⑵ for Loop
for varname in list:
do
command1
command2
..
done
2. Bioinformatics Example
⑴ wc -l filename
① Interpretation: Count the number of lines
② Be cautious when using the wildcard (wc) with rm, cp, and mv
⑵ grep -v ^# gencode.v19.long_noncoding_RNAs.gtf | wc -l
① Role of ^ (caret): ^ represents the beginning of a line in regular expressions. ^# means “lines starting with #”
② Role of -v: -v means “invert match.” Without -v, only lines starting with # will be printed
③ Role of (vertical bar): Pass the output of the previous command as input to the next command (wc -l)
④
grep -v ^# gencode.v19.long_noncoding_RNAs.gtf
: Search all lines except those starting with #
⑤
wc -l
: Output the number of lines passed from the previous command
⑥ Interpretation: Output the number of lines excluding those starting with #
⑶ grep -w gene gencode.v19.long_noncoding_RNAs.gtf | wc -l
① Interpretation: Output the number of lines containing the standalone word “gene”
② Without -w, grep gene searches all lines containing “gene” as part of another word: gene, genetics, progene, etc.
⑶ for i in seq 1 3
; do mkdir sample${i}; done
① Role of backtick (`): Executes the command inside the backticks and returns the output
⑷ grep "^chr6" gencode.v19.long_noncoding_RNAs.gtf | perl -lane 'if ($F[3]>=28000000 & ($F[4]<=34000000)){print $_;}' | grep -w "gene"
① -l: Removes the newline character at the end of each line
② -a: Autosplit mode – splits each input line by whitespace and stores it in the @F array
○ The first, second, … elements can be accessed as $F[0], $F[1], …
○ If the file is tab-separated, the command can be written as
perl -F'\t' -lane 'print $F[1]' input.csv
③ -n: Represents each line as $_
④ -e: Allows the perl program to be written on the command line
⑤
grep "^chr6" gencode.v19.long_noncoding_RNAs.gtf
: Search for all lines starting with “chr6”
⑥
perl -lane 'if ($F[3]>=28000000 & ($F[4]<=34000000)){print $_;}'
: Implement Perl programming on the command line and search for lines where the third value (e.g., start) is ≥ 28M and the fourth value (e.g., end) is ≤ 34M
⑦
grep -w "gene"
: Search for “gene” in the lines passed from the previous command
⑧ Interpretation: Search for genes between 28M and 34M on chr6
⑸ grep -w "gene" gencode.v19.long_noncoding_RNAs.gtf | perl -lane 'm/gene_name \"(.*?)\"/; print $1' | less -S
① m/ gene_name "(.*)"/: Pattern matching
② m/gene_name *“(.*)*“ /: Handles quotation marks
③ m/gene_name "(.* ) "/: Captured group can be referenced as $1
○ .: Matches any character except a newline
○ *: Matches the preceding character or pattern zero or more times
○ a*: Matches empty string (“”), a, aa, aaa, etc.
○ .*: Matches any string (zero or more characters)
○ “(.*?)”: Non-greedy match – captures the shortest possible match inside the quotes; without ?, it would match until the end of the line
⑹ j='echo $i | sed 's/.bam//g"
① sed: Represents regular expressions within ‘ ‘
② s/.bam//: Substitution – s/pattern/replacement/
③ s/.bam //: Since the replacement string is empty, .bam is deleted
④ g: If .bam appears multiple times in a line, g removes all occurrences. Without g, only the first occurrence is removed
⑺ samtools view -h $i | perl -lane 'if (($F[0]=~/^@/ | ($F[6] eq"=")) {print $_;}' | samtools view -S -b - > $(j}.new.bam
① samtools
○ -h: Include SAM header
○ -S: Indicates that input is in SAM format (not required in newer samtools versions)
○ -b: Output in BAM format
② $F[0]=~ /^@/: If $F[0] starts with @ (header line)
③ $F[6] is “=”: If aligned to itself
④ Interpretation: Remove reads mapped to different chromosomes in paired mapping
⑻ plink --bfile input_prefix --geno X --mind Y --hwe p --make-bed --out output_prefix
① bfile: Use the prefix of a binary PLINK format file as input
② geno: Filter markers with a missing rate exceeding X%
③ mind: Filter individuals with a missing rate exceeding Y%
④ hwe: Filter markers that fail the Hardy-Weinberg equilibrium
⑤ make-bed: Output in binary PLINK format
⑼ plink --bfile input_prefix --record vcf-iid bgz --threads 3 --out output_prefix
① Convert to VCF format: Combine the
.bed
file and.bim
file.
② Existing
.bim
files follow the order of minor/major alleles, but VCF files follow the order of major/minor alleles.
⑽ plink2 --vcf input.vcf.gz dosage=DS --maf 0.05 --extract-if-info "R2>=0.70" --export vcf-4.2 vcf-dosage=DS-force --out output.r7.dose
① PLINK command for extracting well-imputed markers
② dosage: By default, dosage info is not imported. We can specify the field to be imported as dosage.
③ vcf-dosage: Export dosage in output file.
④ extract-if-info: Removes all markers that fail this comparison on an INFO key.
Input: 2024.01.16 00:28