protein prediction protein prediction Prediction of protein coding regions by the 3-base periodicity

ThisarticlewaspublishedinanElsevierjournal.Theattachedcopy

isfurnishedtotheauthorfornon-commercialresearchandeducationuse,includingforinstructionattheauthor’sinstitution,sharingwithcolleaguesandprovidingtoinstitutionadministration.Otheruses,includingreproductionanddistribution,orsellingorlicensingcopies,orpostingtopersonal,institutionalorthirdparty

websitesareprohibited.Inmostcasesauthorsarepermittedtoposttheirversionofthearticle(e.g.inWordorTexform)totheirpersonalwebsiteorinstitutionalrepository.AuthorsrequiringfurtherinformationregardingElsevier’sarchivingandmanuscriptpoliciesare

encouragedtovisit:

http://www.elsevier.com/copyright

Author's personal copy

JournalofTheoreticalBiology247(2007)687–694

www.elsevier.com/locate/yjtbi

Predictionofproteincodingregionsbythe3-baseperiodicityanalysisof

aDNAsequence

ChangchuanYin,StephenS.-T.Yau?

DepartmentofMathematics,StatisticsandComputerScience,TheUniversityofIllinoisatChicago,M/C249,Chicago,IL60607-7045,USA

Received22October2006;receivedinrevisedform24March2007;accepted26March2007

Availableonline10April2007

Abstract

Withtheexponentialgrowthofgenomicsequences,thereisanincreasingdemandtoaccuratelyidentifyproteincodingregions(exons)fromgenomicsequences.Despitemanyprogressesbeingmadeintheidenti?cationofproteincodingregionsbycomputationalmethodsduringthelasttwodecades,theperformancesandef?cienciesofthepredictionmethodsstillneedtobeimproved.Inaddition,itisindispensabletodevelopdifferentpredictionmethodssincecombiningdifferentmethodsmaygreatlyimprovethepredictionaccuracy.Anewmethodtopredictproteincodingregionsisdevelopedinthispaperbasedonthefactthatmostofexonsequenceshavea3-baseperiodicity,whileintronsequencesdonothavethisuniquefeature.Themethodcomputesthe3-baseperiodicityandthebackgroundnoiseofthestepwiseDNAsegmentsofthetargetDNAsequencesusingnucleotidedistributionsinthethreecodonpositionsoftheDNAsequences.Exonandintronsequencescanbeidenti?edfromtrendsoftheratioofthe3-baseperiodicitytothebackgroundnoiseintheDNAsequences.Casestudiesongenesfromdifferentorganismsshowthatthismethodisaneffectiveapproachforexonprediction.r2007ElsevierLtd.Allrightsreserved.

Keywords:Exon;Intron;3-Baseperiodicity;Fouriertransform

1.Introduction

Animportantstepingenomicannotationistoidentifyproteincodingregionsofgenomicsequences,whichisachallengingproblemespeciallyinthestudyofeukaryotegenomes.Inaneukaryotegenome,proteincodingregions(exons)areusuallynotcontinuous,butare?ankedbynoncodingregions(introns).Duetothelackofobvioussequencefeaturesbetweenexonsandintrons,effectivelydistinguishingproteincodingregionsfromnoncodingregionsisachallengingprobleminbioinformatics.

Duringthelasttwodecades,avarietyofcomputationalalgorithmshavebeendevelopedtopredictexons(forreviews,FicketandTung,1992;Fickett,1996;Zhang,

etal.,2002).Mostoftheexon-?nding2002;Mathe

algorithmsarebasedonstatisticsmethods,whichusuallyusetrainingdatasetsfromknownexonandintronsequencestocomputepredictionfunctions.Asexamples,GenScan

Correspondingauthor.Tel./fax:+13129963065.

E-mailaddress:yau@uic.edu(S.S.-T.Yau).

0022-5193/$-seefrontmatterr2007ElsevierLtd.Allrightsreserved.doi:10.1016/j.jtbi.2007.03.038

algorithm(BurgeandKarlin,1997)measureddistinctstatisticsfeaturesofexonsandintronswithingenomesandemployedtheminpredictionviahiddenMarkovmodel(HMM);MZFFmethod(Zhang,1997)wasdevelopedforpredictingproteincodingregionsusingquadraticdiscrimi-nantanalysisofdifferentsequencecharactersofexonsandintrons.Ascombiningdifferentgenepredictionmethodsmayincreasetheaccuracyofthepredictiongreatly,developmentofdifferenteffectivegenepredictionalgorithmsisoneofthefundamentaleffortsingenepredictionstudy.Duringrecentyears,signalprocessingapproacheshavebeenattractingsigni?cantattentionsingenomicDNAresearchandhavebecomeincreasinglyimportanttoelucidategenomestructuresbecausetheymayidentifyhiddenperiodicitiesandfeatureswhichcannotberevealedeasilybyconventionalstatisticsmethods.AfterconvertingsymbolDNAsequencestonumericalsequences,signalprocessingtools,typically,discreteFouriertransform(DFT)orwaveletanalysis,canbeappliedtothenumericalvectorstostudythefrequencydomainofthesequences(Anastassiou,2000;WangandJohnson,2002;Kauerand

688

C.Yin,S.S.-T.Yau/JournalofTheoreticalBiology247(2007)687–694

Blocker,2003;VaidanahanandYoon,2004).Usingthesignalprocessingmethods,avarietyofgenepredictionalgorithmshavebeendeveloped(Tiwarietal.,1997;Anastassiou,2000;KotlarandLavner,2003;Jin,2004;Gaoetal.,2005).Tiwarietal.(1997)exploredthemeasureofspectralcontent(SC)inDNAsequencesbasedonthefactthatthe3-baseperiodicity,identi?edasapronouncedpeakatthefrequencyN=3oftheFourierpowerspectrumoftheDNAsequences(NisthelengthoftheDNAsequence),isprevalentinmostproteincodingregions,butdoesnotexistinnoncodingregions(Tsonisetal.,1991;Voss,1992;ChechetkinandTurygin,1995;Dodinetal.,2000).Anastassiou(2000)presentedanoptimizedSCmeasureofDNAsequencesforgeneprediction.KotlarandLavner(2003)utilizedspectralrotationmeasurebasedontheargumentsoftheDFTtodevelopanovelgenepredictionalgorithm,whichwaslaterimprovedbyJin(2004).Gaoetal.(2005)combinedthe3-baseperiodicityandthefractalfeaturesofDNAsequencestoimprovegenepredictionmethods.

MostoftheDFTbasedgene?ndingalgorithmsuseashort-sequencewindowapproach(Tiwarietal.,1997;Yanetal.,1998;Anastassiou,2000),inwhicha?xed-lengthwindowisusedtoslideaDNAsequencetocomputetheFourierpowerspectrum.However,thisapproachhaslimitations.Asmallwindowframecausesmorestatisticaloscillation,resultinginmorepredictionerrors,whereasalargewindowframemaymisssmallexonsorintrons.Thearbitrarychoicesofwindowsizemadetheshort-sequencewindowFouriertechniquesubjecttobias.Furthermore,theshort-sequencewindowFouriertransformrequiresmuchlongerCPUtime.Itbecomesachallengingproblemwhen?ndinggenesforwholegenomesasdirectcomputa-tionofFouriertransformsistimeconsuming.

Itwasdemonstratedthatthe3-baseperiodicityinaDNAsequenceispartlycausedbytheunbalancednucleotidedistributionsinthethreecodingpositionsinthesequence(Fickett,1982;FicketandTung,1992;Tiwarietal.,1997;YinandYau,2005).Inanexonsequence,thenucleotidedistributioninthethreecodonpositionsisunbalanced,whileinanintronsequence,thenucleotidesdistributeuniformlyinthethreecodonpositions.Thereasonoftheunbalanceddistributionisthatproteinspreferspecialaminoacidcompositionsandthusnucleotideusageinacodingregionishighlybiased(FicketandTung,1992;Tiwarietal.,1997;YinandYau,2005).Thispaperpresentsanextensionofthecurrentgenepredictionalgorithms(Tiwarietal.,1997;Anastassiou,2000),calledEPNDmethod(exonpredictionvianucleotidedistribu-tions),whichisbasedonthepeakatthefrequencyofN=3oftheDFTandthefrequenciesofthenucleotidesinthethreecodonpositions(positionasymmetrymeasure)withinknowngenes.Thealgorithmistestedforidentifyingexons/intronswithinknowngenesfromseveralorganismsinthispaper.Casestudiesindicatethatthemethoddescribedinthispaperisaneffectiveproteincodingregionpredictionmethodintermsofaccuracyandef?ciency.

2.Methodsandalgorithms

2.1.FourierspectrumanalysisofDNAsequences

AsymbolicDNAsequence,denotedas,xe0T;xe1T;...;xeNà1T,is?rstconvertedtofourbinaryindicatorsequences,uAenT;uTenT;uCenT,anduGenT,whichindicatethepresenceorabsenceoffournucleotides,A,T,C,andG,atthenthposition,respectively(Voss,1992;Tiwarietal.,1997;Anastassiou,2000).Forinstance,theindicatorsequence,uAenT?0001010111...;indicatesthatthenucleotideAisinthepositions4,6,8,9,and10oftheDNAsequence.

TheDFTconvertsasignalinthesignaldomaintoasetofnewvaluesinthefrequencydomain.ForasignaloflengthN,fenT;n?0;1;...;Nà1,itsDFTisde?nedasfollows:NFekT?

Xà1fenTeài

2pnk

(2.1)

n?0

wherei?p??????à1?

.TheDFTpowerspectrumofasignalatthefrequencykisde?nedas:PSekT?jFekTj2;

k?0;1;2;...;N,

(2.2)

whereFekTisthekthDFTcoef?cient.

TheDFTpowerspectrumofaDNAsequenceisthesumofthepowerspectrumofitsfourbinaryindicatorsequences(SilvermanandLinsker,1986;Tiwarietal.,1997;Anastassiou,2000):

PSekT?PSAekTtPSTekTtPSCekTtPSGekT

(2.3)

wherePSAekT;PSTekT;PSCekTandPSGekTaretheFourierpowerspectrumofthefourindicatorsequencesuAenT;uTenT;uCenTanduGenT,respectively.DuetothesymmetrypropertyoftheDFTspectrumofrealnumbersignals,the?guresinthispaperonlyshowhalfoftheFourierspectrumofDNAsequences.

2.2.Computingthe3-baseperiodicityandbackgroundnoisefromnucleotidedistributionsofaDNAsequence

TheasymmetryinthenucleotidedistributionsinthethreecodonpositionsanditsconnectiontotheDFTpeakinN=3atcodingregionswereaddressedbyFicket(Fickett,1982;FicketandTung,1992).The3-baseperiodicitymagnitudeandbackgroundnoisecanbedirectlycomputedfromthenucleotidedistributions(FicketandTung,1992;YinandYau,2005).LetFx1;Fx2;Fx3betheoccurrencefrequenciesofthenucleotidex2fA;T;C;Gginthe?rst,thesecondandthethirdcodonpositions,respectively.Thenthe3-baseperiodicitymagnitudecanbecomputedasfollows:

PSeN=3T?X

?F2x1tF2x2tF2

x3

x?A;T;C;G

protein prediction protein prediction Prediction of protein coding regions by the 3-base periodicity

àeFx1?Fx2tFx1?Fx3tFx2?Fx3T??.

e2:4T

C.Yin,S.S.-T.Yau/JournalofTheoreticalBiology247(2007)687–694

689

ThebackgroundnoiseofaDNAsequenceoflengthN,representedastheaveragepowerspectrumEoverallthefrequencies,isdeterminedmainlybythelengthoftheDNAsequence(YinandYau,2005).Thus,theratioofthe3-baseperiodicitysignaltothebackgroundnoiseofaDNAsequence,denotedasSNeNT,isde?nedasfollows:SNeNT?PSeN=3T

N

.(2.5)

SNeNTcanbeinterpretedasstrengthofthe3-baseperiodicitypernucleotide.Basedonthecomputationalsimulationofcomputergeneratedsequencesandveri?edwith12exons/intronsfromYeastandC.elegans,itwasshownthatSNeNTisequaltoorlargerthan2formostexonsequences(YinandYau,2005,alsorefertoFig.3),whileitislessthan2formostintronsequences.Thethresholdvalueofthesignal-to-noiseissetto2inthegene?ndingalgorithminthispaper.

2.3.Algorithmforexonpredictionbynucleotidedistribution(EPND)

ForaDNAsequenceoflengthN,letDkdenotetheDNAwalksequenceoflengthk,i.e.,Dkisthesub-regionoftheDNAsequencerangingfromthebeginningtothepositionk.To?ndexonregionsandintronregionswithinthegivenDNAsequence,theEPNDalgorithmisdevel-opedasfollows,andthe?owchartofthealgorithmbelowisshowninFig.1.1.Setk?1.

2.ComputenucleotidedistributionsofDkinthethreecodonpositionsofFxi(x2fA;T;C;Gg,i2f1;2;3gT.ThenucleotidedistributionofaDNAwalksequenceoflengthkcanbeobtainedrecursivelyfromtheDNAwalksequenceoflengthkà1withtheoccurringfrequenciesofthenucleotidesonthepositionk.

3.Computethemagnitudeofthe3-baseperiodicityPSek=3TinDkbasedontheformula(2.4).

4.Computetheratioof3-baseperiodicitytobackgroundnoise,SNekT?PSek=3T=k,withintheDNAsequenceDk.5.Increasekby1andrepeatstep2tostep4,untilk?N.

6.ComputetheslopeofSNateachpositionontheSNplotasfollows:sincemostoftheexonorintronsequencesinagenomearelongerthan50basepairs,theslopeattheithpositioniscomputedaseSNeiTàSNeià50TT=50,whereiisfrom51toN.

7.Setthenucleotideateachpositiontoexonorintronregionasfollows:iftheslopeatthepositionislargerthan0andSNislargerthanorequalto2,setthenucleotideatthepositionasexonnucleotide;otherwise,setitasintronnucleotide.

8.Reducelocalnoise.IfaDNAregionlessthan50basepairsisidenti?edasanintronfromstep7,andis?ankedbytwoexonregions,thisregionisoftenafalsenegative,andisresetasexonregion;similarly,ifaDNAregionlessthan50basepairsisidenti?edasanexonfromstep7,andis

?ankedbytwointronregions,thisregionisoftenafalsepositive,andisresetasanintronregion.

2.4.Improvingpredictionaccuracyusingdifferentstartingpoints

ForalongDNAsequencethatmaycontainmorethantwoexons(ortwointrons),suchasexon–intron–exon,theaccumulatedsignal-to-noiseratioofthelastexonwillbecomelowespeciallywhentheintroninbetweenislong,whichmayaffecttheaccuracyoftheprediction.ItwouldimprovethealgorithmifwedivideaDNAsequenceintodifferentsubregions.Inaddition,toreducefalseexonsandfalseintrons,weapplythealgorithmatdifferentarbitrarystartingpointssothateachnucleotidemaybetestedmultipletimes.ThefollowingalgorithmisdevelopedtoimproveexonpredictionaccuracywhenusingEPNDmethod:

1.IfaDNAsequenceislongerthan2000basepairs(bp),divideittosub-sequencesof2000basepairs.

2.Foreach2000basepairssub-sequence,setP1?1;P2?401;P3?801;P4?1201;P5?1601andP6?2000bethesixeven-spacedpoints.

3.IdentifyexonorintronnucleotidesusingEPNDmethodonthesub-sequencebetweenpointPiandP6wherei?1;2;3;4;5.SoeachnucleotideafterpointsP3istestedatleastthreetimesusingEPNDmethodfromdifferentstartpoints.Anucleotideisidenti?edasanexonnucleotidewhenitispredicatedinanexonregioninthemajorityofthetests.

2.5.DatabaseandmeasuresforperformanceevaluationThedatasetusedfortheevaluationoftheperformanceoftheEPNDmethodisXpro(Gopalanetal.,2004),whichcontainstheeukaryoticproteincodingDNAsequencesfromGeneBankrelease139.ThedatasetwasdownloadedfromtheXprowebsiteas?at?les(Xproversion:v.1.2,2004,http://origin.bic.nus.edu.sg/xpro).One?le,exonse—intron—139.gz,containsproteincodingregions(exons),andtheother?le,intronseq—intron—139.gzcontainsnon-proteincodingregions(introns).Both?lesconsistofDNAsequencesandthecorrespondingheaderinfor-mationwhichindicatesgenelocus,organismthatthegenesbelongto,intronlengthsandtheircorrespondingpositionswithinthegenes.Basedontheintronpositionsintheheadersections,intronsareconjugatedwiththecorre-spondingexonstoformafulloriginalgenestructurebeginningwithastartcodonandendingwithastopcodon.Thefulllengthgenesareusedinthisstudytotestalgorithmperformance.

TheperformanceoftheEPNDalgorithmismeasuredintermsofsensitivity,speci?cityandaccuracy,whicharede?nedintheliteratureasfollows(BursetandGuigo,1996).Thesensitivity,Sn?TP=eTPtFNT,andthespeci?city,Sp?TN=eTNtFPT,whereTPisthetruepositive,whichisthelengthofnucleotidesofcorrectly


百度搜索“爱华网”,专业资料、生活学习,尽在爱华网!  

爱华网本文地址 » http://www.aihuau.com/a/345351/655342510482.html

更多阅读

看VCD的故事 看vcd影院

《钱江晚报》征文“看VCD的故事”,写《VCD之梦》应征:……当年看中央电视台播放这部上下集“美国大片”《埃及艳后》时,曾深为伊丽莎白·泰勒的演技折服;她浑身散发出的高傲和冷漠,令人佩服得五体投地,从此成了坚定的泰勒迷;后见VCD片《埃

误工时间 法定节假日 节日情结误人误己

  作者:胡东辉   创业板第一波洪峰让股市跌了100多点,第二波洪峰跌了不到100点,应该说股市的表现还不算太坏,跌幅小于大多数人的预期。有人说这显示市场平稳度过了创业板洪峰;也有人说这是因为创业板利空有限。但在笔者看来,跌幅不

形容形象的词语 反差萌 反差萌-形象表现,反差萌-形容萌

反差萌,是指ACGN人物表现出与原本形象不同的特征或多种互为矛盾的特征同时存在。这两种或多种萌点相互矛盾,产生反差却又相互衬托。表现形式多种多样,可大体归纳为时间空间反差,性格反差,外貌形象反差等。反差萌,萌属性的一种,是指ACGN人物

板栗生芽能种吗 板栗生芽还能吃吗

   板栗是一种受到大众普遍喜爱的食物,不仅因为它风味独特,肉质软糯,还因为它含有丰富的营养成分。那板栗生芽还能吃吗?  板栗生芽还能吃吗  一般的话生板栗发芽之后是不可以使用的,发芽之后就会产生黄曲霉毒素,食用过多就

声明:《protein prediction protein prediction Prediction of protein coding regions by the 3-base periodicity》为网友野战霸王分享!如侵犯到您的合法权益请联系我们删除