A Statistical Approach Designed for Finding Mathematically Defined Repeats in Shotgun Data and Determining the Length Distribution of Clone—Inserts

(整期优先)网络出版时间:2003-01-11
/ 1
Thelargeamountofrepeats,especiallyhighcopyrepeats,inthegenomesofhigheranimalsandplantsmakeswholegenomeassembly(WGA)quitedifficult.Inordertosolvethisproblem,wetriedtoidentifyrepeatsandmaskthempriortoassemblyevenatthestageofgenomesurvey.Itisknownthatrepeatsofdifferentcopynumberhavedifferentprobabilitiesofappearanceinshotgundata,sobasedonthisprinciple,weconstructedastatisticalmodelandinferredcriteriaformathematicallydefinedrepeats(MDRs)atdifferentshotguncoverages.Accordingtothesecriteria,wedevelopedsoftwareMDRmaskertoidentifyandmaskMDRsinshotgundata.Withrepeatsmaskedpriortoassembly,thespeedofassemblywasincreasedwithlowererrorprobability.Inaddition,clone-insertsizeaffectstheaccuracyofrepeatassemblyandscaffoldconstruction.Wealsodesignedlengthdistributionofclone-insertsusingourmodel.Inoursimulatedgenomesofhumanandrice,thelengthdistributionofrepeatsisdifferent,sotheiroptimallengthdistributionsofclone-insertswerenotthesame.Thuswithoptimallengthdistributionofclone-inserts,agivengenomecouldbeassembledbetteratlowercoverage.