一、Sprai簡介
Sprai (single-pass read accuracy improver) is a tool to correct sequencing errors in single-pass reads for de novo assembly. It is originally designed for correcting sequencing errors in single-molecule DNA sequencing reads, especially in Continuous Long Reads (CLRs) generated by PacBio RS sequencers. The goal of Sprai is not maximizing the accuracy of error-corrected reads; instead, Sprai aims at maximizing the continuity (i.e., N50 contig length) of assembled contigs after error correction.
官網(wǎng): http://zombie.cb.k.u-tokyo.ac.jp/sprai/README.html#introduction
二、安裝方法:
2.1 軟件需求
1. python 2.6 or newer
2. BLAST+ 2.2.27 or newer
3. Celera Assembler ver. 8.1 or newer (if you assemble reads after error-correction)
2.2 安裝方法:
2.2.1 CA 安裝過程:
CA 下載地址: https://sourceforge.net/projects/wgs-assembler/
bzip2 -dc wgs-8.3rc2.tar.bz2 | tar -xf -
cd wgs-8.3rc2
cd kmer
make install
cd ../src
make
cd ../..
2.2.2 安裝List-MoreUtils-0.415.tar.gz:
perl Makefile.PL
make
make install
2.2.3 安裝 Exporter-Tiny-0.042.tar.gz (注意需要先安裝該模塊,然后安裝下面的Statistics-Descriptive-3.0612.tar.gz模塊,才不會出錯)
tar -xzvf Exporter-Tiny-0.042.tar.gz
cd Exporter-Tiny-0.042/
perl Makefile.PL
make
make install
2.2.4 安裝 Statistics-Descriptive-3.0612.tar.gz
tar -xzvf Statistics-Descriptive-3.0612.tar.gz
cd Statistics-Descriptive-3.0612/
perl Build.PL
./Build
./Build test
./Build install
2.2.5 sprai安裝:
spri下載地址:http://zombie.cb.k.u-tokyo.ac.jp/sprai/Download.html
tar -xzvf sprai-0.9.9.17.tar.gz
cd sprai-0.9.9.17/
./waf configure
./waf build
./waf install
三、使用方法
3.1 輸入文件要求是subreads in FASTQ格式,如果文件是.bas.h5格式,則需要使用軟件bash5tools.py進行格式的轉(zhuǎn)換。PacBio GitHub (pbh5tools) 使用方法:
bash5tools.py --outFilePrefix example_output --readType subreads --outType fastq --minReadScore 0.75 example.bas.h5
如果是多個subreads,則需要將所有的文件合并成一個fastq文件作為輸入,注意輸入的fastq文件不能為壓縮文件。
3.2 創(chuàng)建一個文件夾 mkdir tmp; cd tmp ,并復(fù)制sprai路徑下的pbasm.spec和ec.spec文件到當前的路徑中
3.3 修改配置文件
(1) ec.spec是軟件Sprai的配置文件,根據(jù)實際情況修改該配置文件
#>- params -<#input_fastq all.fqestimated_genome_size 50000estimated_depth 100partition 12evalue 1e-50trim 42ca_path /path/to/your/wgs/Linux-amd64/bin/word_size 18
參數(shù)說明:
input_fastq is your input file name.
estimated_genome_size is the number of nucleotides of your target. If you do not know it, set large number. For example, set 1e+12.
estimated_depth is the depth of coverage of input_fastq of your target. If you do not know it, set 0.
partition is the number of processors Sprai uses.
evalue is used by blastn.
trim is the number of nucleotides Sprai cut from both sides of alignments.
ca_path is the path to your wgs-assembler (Celera Assembler) installed.
word_size is used by blastn.
(2) pbasm.spec 是組裝軟件Celera assembler的配置文件,如果僅做數(shù)據(jù)的糾錯,則不需要這個配置文件。該文件中設(shè)置組裝過程中所用到的一些參數(shù),包括CPU使用個數(shù)等。
3.4 運行方法:
(1)數(shù)據(jù)糾錯及組裝
ezez_vx1.pl ec.spec pbasm.spec > log.txt 2>&1 &
(2)僅做數(shù)據(jù)糾錯
ezez_vx1.pl ec.spec -ec_only > log 2>&1 &
或者
ezez_vx1.pl ec.spec > log 2>&1 &
即可
(3)僅做組裝
ca_ikki_v5.pl pbasm.spec estimated_genome_size \ -d directory in which fin.idfq.gzs exist \ -ca_path /path/to/your/wgs/Linux-amd64/bin \ -sprai_path the path to get_top_20x_fa.pl installed
3.5 輸出文件
(1)第一步,數(shù)據(jù)糾錯,輸出一個result_yyyymmdd_hhmmss的文件夾,處理后結(jié)果文件名稱為c01.fin.idfq.gz
(2)第二步,組裝,輸出的config文件為./CA/9-terminator/asm.ctg.fasta
(3)組裝統(tǒng)計結(jié)果,在CA/do_*_c01.fin.top20x.log 文件中
四. 軟件安裝過程中所遇問題
4.1 找不到/usr/bin/time 命令
解決方法:
a. 修改軟件中的代碼,將/usr/bin/time 修改為time
4.2 軟件運行過程中報"set Illegal option -o pipefail"
解決方法:
查看 sh調(diào)用的是什么,如果不是/bin/bash,則需要進行第二步的修改
(1)$ls -al /bin/sh
(2)直接修改 /bin/sh 鏈接文件,將其指定到 /bin/bash:
$sudo ln -fs /bin/bash /bin/sh