SNP数据计算距离矩阵

用到的工具是 VCF2Dis

VCF2Dis:一种新的简单高效的软件,用于计算 p 距离矩阵和基于树的变体调用格式

工具对应的github主页 https://github.com/BGI-shenzhen/VCF2Dis

按照软件主页的帮助文档 下载安装,没有遇到问题

1安装


点击下面跳转

hewm2008/VCF2Dis

下载

只需 [make] 或 [sh make.sh ] 来编译此软件,最终软件可以在 Dir [bin/VCF2Dis]
For linux /UnixmacOS 中找到

1
2
3
4
5
tar -zxvf  VCF2DisXXX.tar.gz            # if Link do not work ,Try re-install [zlib]library
cd VCF2DisXXX; # [zlib] and copy them to the library Dir
sh make.sh; # VCF2Dis-xx/src/include/zlib
./bin/VCF2Dis

注意:如果链接失败,请尝试重新安装库 *zlib*
注意::建议使用 apeggtree 的 R

2 没有 boostrap 的 nj-tree 示例


2.1参数说明:

1
2
3
4
5
6
7
8
9
10
11
Usage: VCF2Dis -InPut  <in.vcf>  -OutPut  <p_dis.mat>

-InPut <str> Input one or muti GATK VCF genotype File
-OutPut <str> OutPut Sample p-Distance matrix

-InList <str> Input GATK muti-chr VCF Path List
-SubPop <str> SubGroup SampleList of VCFFile [ALLsample]
-Rand <float> Probability (0-1] for each site to join Calculation [1]
-KeepMF Keep the Middle File diff & Use matrix

-help Show more help [hewm2008 v1.51s]

2.2创建p_distance矩阵并构造 nj-tree newick 树

1
2
3
4
5
6
# 2.1) To new all the sample p_distance matrix and newick tree based VCF, run VCF2Dis directly
./bin/VCF2Dis -InPut in.vcf.gz -OutPut p_dis.mat
# ./bin/VCF2Dis -InPut in.fa.gz -OutPut p_dis.mat -InFormat FA

# 2.2) To new sub group sample p_distance matrix and and newick tree ; put their sample name into File sample.list
./bin/VCF2Dis -InPut chr1.vcf.gz chr2.vcf.gz -OutPut p_dis.mat -SubPop sample.list

2.3可视化

使用MEGA或ITOLS可视化

在 VCF2Dis 之后获得 p_dis.nwk 树文件。运行 MEGA用于显示基于此文件的系统发育树 [p_dis.nwk]
,您可以看到相邻连接树并将其保存为 PDF 格式。如果没有NWK文件,以下是获取树文件的方法。

方法1,在线工具

方法2,使用 PHYLIPNEW 构建 nj-tree

1
2
PHYLIPNEW-3.69.650/bin/fneighbor  -datafile p_dis.matrix  -outfile tree.out1.txt -matrixtype s -treetype n -outtreefile tree.out2.tre

方法 3 , VCF2dis的R 脚本

1
Rscript  exemple/vistreecode.r    p_dis.mat

3 带有 boostrap 的 nj-tree 示例

使用回放采样对 nj-tree 进行多运行。

1
2
3
4
5
for X in {1..20};do
VCF2Dis -InPut in.vcf.gz -OutPut p_dis_X.mat -Rand 0.25

PHYLIPNEW-3.69.650/bin/fneighbor -datafile p_dis_X.matrix -outfile tree.out1_X.txt -matrixtype s -treetype n -outtreefile tree.out2_X.tre
done

合并所有回放采样 NJ-tree 并构造 boostrap nj-tree。

1
2
3
4
5
cat   tree.out2_*.tre   >  ALLtree_merge.tre

PHYLIPNEW-3.69.650/bin/fconsense -intreefile ALLtree_merge.tre -outfile out -treeprint Y

perl ./bin/percentageboostrapTree.pl ALLtree_merge.treefile NN Final_boostrap.tre

4其他的计算


要新建p_distance矩阵,请添加 VCF 文件。有关p_distance矩阵的详细信息,请参阅**此网站。**采用VCF SNPs数据集计算个体间的p距离,按照以下公式计算样本i和样本j的遗传距离:

1
D_ij=(1/L) * [(sum(d(l)_ij))]

其中 L 是可以识别 SNP 的区域的长度,给定位置 l 处的等位基因是 A/C:

1
2
3
4
5
d(l)_ij=0.0     if the genotypes of the two individuals were AA and AA;
d(l)_ij=0.5 if the genotypes of the two individuals were AA and AC;
d(l)_ij=0.0 if the genotypes of the two individuals were AC and AC;
d(l)_ij=1.0 if the genotypes of the two individuals were AA and CC;
d(l)_ij=0.0 if the genotypes of the two individuals were CC and CC;