Linux Programming (bash programming)

Recommended Post: 【Operating System】 Linux

1. Syntax

⑴ Variable

① Overview: Useful for representing long strings or for bash programming like loops

② Variable Assignment: Use the $ symbol

i="Hello World!"
echo $i

③ To reduce ambiguity, you can use curly braces

i="a"
i¡="b"
echo $ii (will give b)
echo ${i}i (will give ai)

⑵ for Loop

for varname in list:
do 
    command1
    command2
    ..
done

2. Bioinformatics Example

⑴ wc -l filename

① Interpretation: Count the number of lines

② Be cautious when using the wildcard (wc) with rm, cp, and mv

⑵ grep -v ^# gencode.v19.long_noncoding_RNAs.gtf | wc -l

① Role of ^ (caret): ^ represents the beginning of a line in regular expressions. ^# means “lines starting with #”

② Role of -v: -v means “invert match.” Without -v, only lines starting with # will be printed

③ Role of (vertical bar): Pass the output of the previous command as input to the next command (wc -l)

④ grep -v ^# gencode.v19.long_noncoding_RNAs.gtf: Search all lines except those starting with #

⑤ wc -l: Output the number of lines passed from the previous command

⑥ Interpretation: Output the number of lines excluding those starting with #

⑶ grep -w gene gencode.v19.long_noncoding_RNAs.gtf | wc -l

① Interpretation: Output the number of lines containing the standalone word “gene”

② Without -w, grep gene searches all lines containing “gene” as part of another word: gene, genetics, progene, etc.

⑶ for i in seq 1 3; do mkdir sample${i}; done

① Role of backtick (`): Executes the command inside the backticks and returns the output

⑷ grep "^chr6" gencode.v19.long_noncoding_RNAs.gtf | perl -lane 'if ($F[3]>=28000000 & ($F[4]<=34000000)){print $_;}' | grep -w "gene"

① -l: Removes the newline character at the end of each line

② -a: Autosplit mode – splits each input line by whitespace and stores it in the @F array

○ The first, second, … elements can be accessed as $F[0], $F[1], …

○ If the file is tab-separated, the command can be written as perl -F'\t' -lane 'print $F[1]' input.csv

③ -n: Represents each line as $_

④ -e: Allows the perl program to be written on the command line

⑤ grep "^chr6" gencode.v19.long_noncoding_RNAs.gtf: Search for all lines starting with “chr6”

⑥ perl -lane 'if ($F[3]>=28000000 & ($F[4]<=34000000)){print $_;}': Implement Perl programming on the command line and search for lines where the third value (e.g., start) is ≥ 28M and the fourth value (e.g., end) is ≤ 34M

⑦ grep -w "gene": Search for “gene” in the lines passed from the previous command

⑧ Interpretation: Search for genes between 28M and 34M on chr6

⑸ grep -w "gene" gencode.v19.long_noncoding_RNAs.gtf | perl -lane 'm/gene_name \"(.*?)\"/; print $1' | less -S

① m/ gene_name "(.*)"/: Pattern matching

② m/gene_name *“(.*)*“ /: Handles quotation marks

③ m/gene_name "(.* ) "/: Captured group can be referenced as $1

○ .: Matches any character except a newline

○ *: Matches the preceding character or pattern zero or more times

○ a*: Matches empty string (“”), a, aa, aaa, etc.

○ .*: Matches any string (zero or more characters)

○ “(.*?)”: Non-greedy match – captures the shortest possible match inside the quotes; without ?, it would match until the end of the line

⑹ j='echo $i | sed 's/.bam//g"

① sed: Represents regular expressions within ‘ ‘

② s/.bam//: Substitution – s/pattern/replacement/

③ s/.bam //: Since the replacement string is empty, .bam is deleted

④ g: If .bam appears multiple times in a line, g removes all occurrences. Without g, only the first occurrence is removed

⑺ samtools view -h $i | perl -lane 'if (($F[0]=~/^@/ | ($F[6] eq"=")) {print $_;}' | samtools view -S -b - > $(j}.new.bam

① samtools

○ -h: Include SAM header

○ -S: Indicates that input is in SAM format (not required in newer samtools versions)

○ -b: Output in BAM format

② $F[0]=~ /^@/: If $F[0] starts with @ (header line)

③ $F[6] is “=”: If aligned to itself

④ Interpretation: Remove reads mapped to different chromosomes in paired mapping

⑻ plink --bfile input_prefix --geno X --mind Y --hwe p --make-bed --out output_prefix

① bfile: Use the prefix of a binary PLINK format file as input

② geno: Filter markers with a missing rate exceeding X%

③ mind: Filter individuals with a missing rate exceeding Y%

④ hwe: Filter markers that fail the Hardy-Weinberg equilibrium

⑤ make-bed: Output in binary PLINK format

⑼ plink --bfile input_prefix --record vcf-iid bgz --threads 3 --out output_prefix

① Convert to VCF format: Combine the .bed file and .bim file.

② Existing .bim files follow the order of minor/major alleles, but VCF files follow the order of major/minor alleles.

⑽ plink2 --vcf input.vcf.gz dosage=DS --maf 0.05 --extract-if-info "R2>=0.70" --export vcf-4.2 vcf-dosage=DS-force --out output.r7.dose

① PLINK command for extracting well-imputed markers

② dosage: By default, dosage info is not imported. We can specify the field to be imported as dosage.

③ vcf-dosage: Export dosage in output file.

④ extract-if-info: Removes all markers that fail this comparison on an INFO key.

Input: 2024.01.16 00:28

2059

Linux Programming (bash programming)

1. Syntax

2. Bioinformatics Example

results matching ""

No results matching ""