Linux 编程(bash 编程)
推荐帖子:【操作系统】Linux
1. 语法
2. 生物信息学示例
1.语法
⑴ 变量
① 概述:对于表示长字符串或循环之类的 bash 编程很有用
② 变量赋值:使用
$符号
受保护_0
③ 为了减少歧义,可以使用花括号
受保护_1
⑵ for 循环
受保护_2
2.生物信息学示例
⑴ 受保护_4
① 解释:统计行数
② 在 rm、cp 和 mv 中使用通配符 (wc) 时要小心
⑵ 受保护_5
①^(插入符号)的作用:^在正则表达式中表示行的开头。 ^# 表示“以 # 开头的行”
② -v 的作用:-v 表示“反向匹配”。如果没有 -v,则仅打印以 # 开头的行
③ 的作用(竖线):将上一个命令的输出作为下一个命令的输入传递(wc -l)
④
grep -v ^# gencode.v19.long_noncoding_RNAs.gtf:搜索除#开头的所有行
⑤
wc -l:输出上一个命令传递过来的行数
⑥ 解释:输出除去#开头的行数
⑶受保护_8
① 解释:输出包含独立单词“gene”的行数
② 如果没有 -w,grepgene 会搜索所有包含“gene”作为另一个单词的一部分的行:gene、Genetics、progene 等。
⑶ 对于 seq 1 3 中的 i;做 mkdir 样本${i};完成
① 反引号的作用 (`): Executes the command inside the backticks and returns the output
⑷ grep "^chr6" gencode.v19.long_noncoding_RNAs.gtf | perl -lane 'if ($F[3]>=28000000 & ($F[4]<=34000000)){print $_;}' | grep -w "gene"
① -l: Removes the newline character at the end of each line
② -a: Autosplit mode – splits each input line by whitespace and stores it in the @F array
○ The first, second, … elements can be accessed as $F[0], $F[1], …
○ If the file is tab-separated, the command can be written as
perl -F'\t' -lane 'print $F[1]' input.csv
③ -n: Represents each line as $_
④ -e: Allows the perl program to be written on the command line
⑤
grep "^chr6" gencode.v19.long_noncoding_RNAs.gtf: Search for all lines starting with “chr6”
⑥
perl -lane 'if ($F[3]>=28000000 & ($F[4]<=34000000)){print $_;}': Implement Perl programming on the command line and search for lines where the third value (e.g., start) is ≥ 28M and the fourth value (e.g., end) is ≤ 34M
⑦
grep -w "gene": Search for “gene” in the lines passed from the previous command
⑧ Interpretation: Search for genes between 28M and 34M on chr6
⑸ grep -w "gene" gencode.v19.long_noncoding_RNAs.gtf | perl -lane 'm/gene_name \"(.*?)\"/; print $1' | less -S
① m/ gene_name "(.*)"/: Pattern matching
② m/gene_name *“(.*)*“ /: Handles quotation marks
③ m/gene_name "(.* ) "/: Captured group can be referenced as $1
○ .: Matches any character except a newline
○ *: Matches the preceding character or pattern zero or more times
○ a*: Matches empty string (“”), a, aa, aaa, etc.
○ .*: Matches any string (zero or more characters)
○ “(.*?)”: Non-greedy match – captures the shortest possible match inside the quotes; without ?, it would match until the end of the line
⑹ j='echo $i | sed 's/.bam//g"
① sed: Represents regular expressions within ‘ ‘
② s/.bam//: Substitution – s/pattern/replacement/
③ s/.bam //: Since the replacement string is empty, .bam is deleted
④ g: If .bam appears multiple times in a line, g removes all occurrences. Without g, only the first occurrence is removed
⑺ samtools view -h $i | perl -lane 'if (($F[0]=~/^@/ | ($F[6] eq"=")) {print $_;}' | samtools view -S -b - > $(j}.new.bam
① samtools
○ -h: Include SAM header
○ -S: Indicates that input is in SAM format (not required in newer samtools versions)
○ -b: Output in BAM format
② $F[0]=~ /^@/: If $F[0] starts with @ (header line)
③ $F[6] is “=”: If aligned to itself
④ Interpretation: Remove reads mapped to different chromosomes in paired mapping
⑻ plink --bfile input_prefix --geno X --mind Y --hwe p --make-bed --out output_prefix
① bfile: Use the prefix of a binary PLINK format file as input
② geno: Filter markers with a missing rate exceeding X%
③ mind: Filter individuals with a missing rate exceeding Y%
④ hwe: Filter markers that fail the Hardy-Weinberg equilibrium
⑤ make-bed: Output in binary PLINK format
⑼ plink --bfile input_prefix --record vcf-iid bgz --threads 3 --out output_prefix
① Convert to VCF format: Combine the
.bedfile and.bimfile.
② Existing
.bimfiles follow the order of minor/major alleles, but VCF files follow the order of major/minor alleles.
⑽ plink2 --vcf input.vcf.gz 剂量=DS --maf 0.05 --extract-if-info "R2>=0.70" --export vcf-4.2 vcf-剂量=DS-force --out 输出.r7.剂量
① 用于提取正确估算标记的 PLINK 命令
② 用量:默认不导入用量信息。我们可以指定要导入的字段作为剂量。
③ vcf-dosage:导出输出文件中的剂量。
④ extract-if-info:删除 INFO 键上未通过此比较的所有标记。
输入:2024.01.16 00:28