SNaQ进行系统发育网构建
2025/12/23大约 4 分钟
SNaQ 是一种用于系统发育网络分析的统计推断方法。该流程包括分枝长度和遗传概率的数值优化和系统发育网络空间的启发式搜索。
物种网络的最大伪似然估计:SNaQ
如果使用 SNaQ,请引用:
Claudia Solís-Lemus and Cécile Ané (2016). Inferring Phylogenetic Networks with Maximum Pseudolikelihood under Incomplete Lineage Sorting. PLoS Genet 12(3):e1005896. doi:10.1371/journal.pgen.1005896
安装
安装相关软件
# 配置SNaQ主要环境
conda install -c conda-forge mamba
conda create -n phylo python=3.8
conda activate phylo
mamba install -c bioconda -c conda-forge raxml julia
## julia install_julia.jl
wget http://pages.stat.wisc.edu/~ane/bucky/v1.4/bucky-1.4.4.tgz
tar -zxvf bucky-1.4.4.tgz
cd bucky-1.4.4/src
make
mv bucky ~/miniconda3/envs/phylo/bin/
mv mbsum ~/miniconda3/envs/phylo/bin/
cd ../..
wget http://research.haifa.ac.il/~ssagi/software/QMCN.tar.gz
tar -zxvf QMCN.tar.gz
mv find-cut-Linux-64 ~/miniconda3/envs/phylo/bin/find-cut-Linux-64
mv genTreeAndQuartets-Linux-64 ~/miniconda3/envs/phylo/bin/genTreeAndQuartets-Linux-64
mv quartet-agreement-Linux-64 ~/miniconda3/envs/phylo/bin/quartet-agreement-Linux-64
wget https://codeload.github.com/nstenz/TICR/zip/refs/heads/master
unzip TICR-master.zip
cd TICR-master/src
make
mv mdl ~/miniconda3/envs/phylo/bin/
cd ../..
conda deactivate
# 配置mrbayes环境
## 安装beagle提高运算效率
wget https://github.com/beagle-dev/beagle-lib/archive/refs/tags/v3.1.2.tar.gz
tar -zxvf beagle-lib-3.1.2.tar.gz
cd beagle-lib-3.1.2
./autogen.sh
./configure --prefix=/usr
make -j4
make install
## 安装mrbayes
wget https://github.com/NBISweden/MrBayes/releases/download/v3.2.7/mrbayes-3.2.7.tar.gz
tar -zxvf mrbayes-3.2.7.tar.gz
cd mrbayes-3.2.7/
./configure --prefix=/usr # mrbayes和beagle安装的位置必须能通过用户.bashrc直接调用,后续SNaQ的脚本批量执行时不能调用conda环境
make -j4
make install安装 PhyloNetworks 包(julia)
在 julia 中运行如下代码:
using Pkg # to use functions that manage packages
Pkg.add("PhyloNetworks") # to download & install package PhyloNetworks
Pkg.add("PhyloPlots")
Pkg.add("RCall") # packaage to call R from within julia
Pkg.add("CSV") # to read from / write to text files, e.g. csv files
Pkg.add("DataFrames") # to create & manipulate data frames
Pkg.add("StatsModels")# for regression formulas
using PhyloNetworks # may take some time: pre-compiles functions in that package
using PhyloPlots运行 TICR 流程
数据准备:
通过 fasta2nexus.R 将 fasta 序列转换成 nexus 格式,并且保证每个序列名为物种名。
fasta2nexus.R
if (!require(ape, quietly = TRUE)) install.packages('ape')
library(ape)
args <- commandArgs(TRUE)
if (length(args) >3 && length(args) <2) {
cat("usage: Rscript fasta2nexus.R dirname postfix filterout\n")
cat("filterout: joined with comma\n")
}else{
setwd(args[1])
if(length(args)==2) arg[3] <- ""
filter <- unlist(strsplit(arg[3],","))
for (i in grep(paste0(args[2], "$"), value = TRUE, list.files())) {
temp <- read.FASTA(i)
names(temp) <- gsub("@.*", "", names(temp))
temp <- temp[setdiff(names(temp),filter)]
write.nexus.data(temp, paste0(i, ".nex"))
}
}TICR 运行脚本
conda activate phylo
Rscript scripts/fasta2nexus.R data/ trimed
cd data/
tar -czvf ../input.tar.gz *.nex ##此处不能带路径名(如 xx/*.nex),否则后续报错
cd ..
scripts/mb.pl input.tar.gz -m scripts/mb-block.txt -o mbout # 批量执行mb构树
scripts/bucky.pl mbout/input.mb.tar -o buckyout # bucky分析一致性因子(CF)
scripts/get-pop-tree.pl buckyout/input.CFs.csv # 参考树
# 也可以直接用Astral构建的溯祖树为参考树
mkdir snaqSNaQ 分析
hmax 即网络中的最大杂交次数,需要逐级向上推导。模型选择工具对于估计杂交数(h)是必要的。在 h 达到最佳值之前,预期会有急剧的改善,之后会有缓慢的线性改善。 目前还没有明确的标准来确定什么是 pseudolikelihood 评分中较大且“显著”的改善。
提示
通常情况下,-log(pseudolikelihood)评分,即 network 对象的 net.loglik 属性, 越低越好。
在 julia 中运行:
using Distributed
addprocs(10) # tells Julia to add 10 threads
@everywhere using PhyloNetworks
buckyCF = readTableCF("buckyout/input.CFs.csv")
tre = readTopology("input.QMC.tre")
net0 = snaq!(tre, buckyCF, hmax=0, filename="snaq/net0_bucky", outgroup="outgroup", seed=123, run=10)
net1 = snaq!(net0, buckyCF, hmax=1, filename="snaq/net1_bucky", outgroup="outgroup", seed=123, run=10)
net2 = snaq!(net1, buckyCF, hmax=2, filename="snaq/net2_bucky", outgroup="outgroup", seed=123, run=10)
net3 = snaq!(net2, buckyCF, hmax=3, filename="snaq/net3_bucky", outgroup="outgroup", seed=123, run=10)
scores = [net0.loglik, net1.loglik, net2.loglik, net3.loglik]
using RCall
net0 = readSnaqNetwork("net0_bucky.out")
rootonedge!(net0, 7)
net1 = readSnaqNetwork("net1_bucky.out")
rootonedge!(net1, 8)
net2 = readSnaqNetwork("net2_bucky.out")
rootonedge!(net2, 8)
net3 = readSnaqNetwork("net3_bucky.out")
rootonedge!(net3, 6)
## 注:应根据network的具体情况调整root edge number
## 系统发育网络图
R"pdf"("plot4-net0123.pdf", width=8, height=8)
R"layout(matrix(1:4,2,2))"
plot(net0, :R);
plot(net1, :R, showGamma=true, style=:majortree);
plot(net2, :R, showGamma=true, style=:majortree);
plot(net3, :R, showGamma=true, style=:majortree);
R"dev.off()";
R"png"("plot4-net0123.png")
R"layout(matrix(1:4,2,2))"
plot(net0, :R);
plot(net1, :R, showGamma=true, style=:majortree);
plot(net2, :R, showGamma=true, style=:majortree);
plot(net3, :R, showGamma=true, style=:majortree);
R"dev.off()";
## 似然值图
scores = [net0.loglik, net1.loglik, net2.loglik, net3.loglik]
R"pdf"("score-vs-h.pdf", width=4, height=4);
R"plot"(x=0:3, y=scores, type="o", xlab="number of hybridizations h",
ylab="network score");
R"dev.off"();
R"png"("score-vs-h.png");
R"plot"(x=0:3, y=scores, type="o", xlab="number of hybridizations h",
ylab="network score");
R"dev.off"();
## 预期实际cf对比
inputCF = readTableCF("../buckyout/mb.CFs.csv")
net3 = readsnaqnetwork("net3.out");
topologymaxQpseudolik!(net3, inputCF);
df_long = fittedquartetCF(inputCF, :long);
### Adding jitter to the points for better visualization
using Distributions, Random, RCall
Random.seed!(123);
jitter = 0.005;
df_jittered = deepcopy(df_long);
df_jittered.obsCF .+= rand(Uniform(-jitter / 2, jitter / 2), nrow(df_long));
df_jittered.expCF .+= rand(Uniform(-jitter / 2, jitter / 2), nrow(df_long));
R"pdf"("expCFs_obsvsfitted.pdf", width=6, height=6);
R"plot"(df_jittered.obsCF, df_jittered.expCF, xlab="quartet CF observed in gene trees",
ylab="quartet CF expected from network", pch=16,
col="#008080",
cex=0.7);
R"abline"(0, 1, col="#008080", lwd=0.5);
R"dev.off"();
# Goodness of fit of the SNaQ networks
using QuartetNetworkGoodnessFit;
res1 = quarnetGoFtest!(net3, inputCF, true; seed=123, nsim=1000);
res1[[1,2,3]] # p-value, uncorrected z, σ