Bioinformatics - protocols  E

To locate the required tools, look for a purpleE in the list of links or use the shortcuts provided.

Construction and interpretation of a protein sequence alignment

Below you find sequences of several known FH2 domains of fungal and metazoan origin. Use CMBI CLUSTAL W or NPS@ClustalW to align them. Use the fast algorithm with default parameters (you may leave out the phylogenetic tree option in the CMBI version, if you want to speed the analysis up). Note the format of the sequences - this is the FASTA format.

>p140mDia
YKPEVQLRRPNWSKFVAEDLSQDCFWTKVKEDRFENNELFAKLTLAFSAQTKTSKAKKDQEGGEEKKSVQ
KKKVKELKVLDSKTAQNLSIFLGSFRMPYQEIKNVILEVNEAVLTESMIQNLIKQMPEPEQLKMLSELKE
EYDDLAESEQFGVVMGTVPRLRPRLNAILFKLQFSEQVENIKPEIVSVTAACEELRKSENFSSLLELTLL
VGNYMNAGSRNAGAFGFNISFLCKLRDTKSADQKMTLLHFLAELCENDHPEVLKFPDELAHVEKASRVSA
ENLQKSLDQMKKQIADVERDVQNFPAATDEKDKFVEKMTSFVKDAQEQYNKLRMMHSNMETLYKELGDYF
VFDPKKLSVEEFFMDLHNFRNMFLQAVKENQKRRETEEKMRRAKLAKEKAEKERLEKQQKREQLIDMNAE
GDETGVMDSLLEALQSGAAFRRKR
>Diaphanous Drosophila
WDVKNPMKRANWKAIVPAKMSDKAFWVKCQEDKLAQDDFLAELAVKFSSKPVKKEQKDAVDKPTTLTKKN
VDLRVLDSKTAQNLAIMLGGSLKHLSYEQIKICLLRCDTDILSSNILQQLIQYLPPPEQLKRLQEIKAKG
EPLPPIEQFAATIGEIKRLSPRLHNLNFKLTYADMVQDIKPDIVAGTAACEEIRNSKKFSKILELILLLG
NYMNSGSKNEAAFGFEISYLTKLSNTKDADNKQTLLHYLADLVEKKFPDALNFYDDLSHVNKASRVNMDA
IQKAMRQMNSAVKNLETDLQNNKVPQCDDDKFSEVMGKFAEECRQQVDVLGKMQLQMEKLYKDLSEYYAF
DPSKYTMEEFFADIKTFKDAFQAAHNDNVRVREELEKKRRLQEAREQSAREQQERQQRKKAVVDMDAPQT
QEGVMDSLLEALQTGSAFGQRNRQARRQRPAGAERRAQLSRSRSRTRVTNGQLMTREMILNEVLGSA
>Fugu formin
IKTKFRLPVFNWTALKPNQINGTVFNEIDDERELELERFEELFKTRAQGPIMDLSCTKSKVAQKAVNKVT
ILDANRSKNLAITLRKANKTFDLKTLPVDFVECLMRFLPTEMEVKALRQYERERRPLDQLAEEDRFMLLF
SKIERLTQRMNIITFIGNFSDNVAMLTPQLNAIIAASASVKSSPKLKRMLEIILALGNYMNSSKRGCVYG
FKLQSLDLLLDTKSTDRKMTLLHYIALIVKEKYPELANFYNELHFVDKAAAVSLENVLLDVRELGKGMDL
IRRECSLHDHSVLKGFLQASDTQLDKVQRDAKTAEEAFNNVVNYFGESAKTAPPSVFFPVFVRFLKAYKD
AVEENELRKKQEQAMREKLLAEEAKQQDPKVQAQKKRQQQHELIAELRKRQAKDHRPVYEGKDGTIEDII
TVLK
>Cappuccino Drosophila
PPTAPPATKEIWTEIEETPLDNIDEFTELFSRQAIAPVSKPKELKVKRAKSIKVLDPERSRNVGIIWRSL
HVPSSEIEHAIYHIDTSVVSLEALQHMSNIQATEDELQRIKEAAGGDIPLDHPEQFLLDISLISMASERI
SCIVFQAEFEESVTLLFRKLETVSQLSQQLIESEDLKLVFSIILTLGNYMNGGNRQRGQADGFNLDILGK
LKDVKSKESHTTLLHFIVRTYIAQRRKEGVHPLEIRLPIPEPADVERAAQMDFEEVQQQIFDLNKKFLGC
KRTTAKVLAASRPEIMEPFKSKMEEFVEGADKSMAKLHQSLDECRDLFLETMRFYHFSPKACTLTLAQCT
PDQFFEYWTNFTNDFKDIWKKEITSLLNELMKKSKQAQIESRRNVSTKVEKSGRISLKERMLMRRSKN
>Bni1 yeast
PRPHKKLKQLHWEKLDCTDNSIWGTGKAEKFADDLYEKGVLADLEKAFA
AREIKSLASKRKEDLQKITFLSRDISQQFGINLHMYSSLSVADLVKKILN
CDRDFLQTPSVVEFLSKSEIIEVSVNLARNYAPYSTDWEGVRNLEDAKPP
EKDPNDLQRADQIYLQLMVNLESYWGSRMRALTVVTSYEREYNELLAKLR
KVDKAVSALQESDNLRNVFNVILAVGNFMNDTSKQAQGFKLSTLQRLTFI
KDTTNSMTFLNYVEKIVRLNYPSFNDFLSELEPVLDVVKVSIEQLVNDCK
DFSQSIVNVERSVEIGNLSDSSKFHPLDKVLIKTLPVLPEARKKGDLLED
EVKLTIMEFESLMHTYGEDSGDKFAKISFFKKFADFINEYKKAQAQNLAA
EEEERLYIKHKKIVEEQQKRAQEKEKQKENSNSPSSEGNEEDEAEDRRAV
MDKLLEQLKNA
 

Sequence similarity searches and domain structure analysis

Basic sequence manipulation

The following DNA sequence corresponds to one of the A. thaliana genomic loci found in the previous excercise. The sequence was assembled from two files, each of them a different format, so you have to format it to FASTA first.

   1 TTTAATAAAA TAAAAATCCA CTCGCATTTT TATTTTCAAC ATTGTGCGTA    50
  51 CGGTGCAATT CAATGAACAG TGTTTACTTT CAGTGTGTAC ACTTCTGCGG   100
 101 ACTATTACAA AGTCCACGTC TTATCCTACG TGTTATAATC TCATATGTTA   150
 151 CTGTCTGAAA TGGACCCCAC TACGTAAAAA TAAAATTAAG AATCAACCAC   200
 201 TCTTCTTCCA TCACCTCTTT TGGCTTTCTC TCTACTCTCT CTACTACTCT   250
 251 CTCACCATCA CTGAGTTAAG AGAACAAACC AAAAACAAAA TTATCAAACC   300
 301 ATCACCAGCA GAATCTTAGC TGGATTCATC ACTCTATTCA AAAAGTTTCT   350
 351 CTCTTCTCTT TTCTCAGATC TTGAACTCTT GAAGAAGAAA GAAGAAGATA   400
 401 ACACAATGCT CTTCTTCTTA TTCTTCTTCT ACTTACTCTT ATCTTCATCC   450
 451 TCCGATCTAG TCTTCGCCGA CCGTCGTGTA CTCCACGAAC CATTCTTCCC   500
 501 TATAGATTCA CCACCACCGT CACCACCATC ACCACCACCA CTTCCTAAAC   550
 551 TACCATTCTC TTCAACCACT CCTCCATCTT CATCAGACCC AAATGCTTCT   600
 601 CCTTTCTTCC CTTTATACCC TTCATCTCCA CCACCACCTT CTCCAGCCTC   650
 651 CTTCGCTTCT TTTCCGGCGA ATATCTCATC TCTAATCGTC CCTCACGCCA   700
 701 CTAAATCCCC ACCTAACTCC AAAAAACTCC TTATCGTCGC TATCTCCGCC   750
 751 GTTTCCTCCG CTGCTTTAGT CGCTCTACTT ATCGCTTTAC TCTATTGGCG   800
 801 AAGAAGCAAA CGTAACCAAG ATCTTAACTT CTCCGATGAT AGCAAAACAT   850
 851 ACACCACCGA CAGTAGCCGC CGTGTCTACC CTCCTCCTCC GGCAACGGCG   900
 901 CCTCCAACAC GACGCAATGC GGAGGCTAGA AGTAAACAGA GGACCACCAC   950
 951 GAGCTCCACC AATAACAACA GCTCTGAGTT TCTTTACTTA GGAACAATGG  1000
1001 TGAATCAAAG AGGAATCGAT GAACAATCTC TTAGTAATAA TGGATCAAGC  1050
1051 TCAAGAAAAC TTGAATCTCC AGATCTTCAA CCACTTCCTC CATTGATGAA  1100
1101 ACGAAGTTTC CGTTTAAATC CAGATGTTGG TTCAATCGGA GAAGAAGATG  1150
1151 AAGAAGATGA GTTTTACTCT CCACGTGGCT CACAAAGCGG GCGAGAACCG  1200
1201 TTAAACCGGG TCGGACTTCC GGGTCAAAAT CCTAGATCTG TTAACAATGA  1250
1251 CACTATCTCT TGCTCATCTT CAAGCTCTGG TTCACCAGGA AGATCAACAT  1300
1301 TTATCAGTAT CTCTCCTTCA ATGAGTCCTA AGAGATCTGA ACCAAAACCG  1350
1351 CCGGTTATCT CCACACCAGA ACCGGCGGAG TTAACCGATT ATAGATTTGT  1400
1401 TCGGTCTCCG TCACTGTCGT TAGCTTCTTT ATCGTCGGGA TTGAAAAACT  1450
1451 CCGATGAAGT AGGATTGAAT CAAATCTTTA GATCTCCGAC GGTTACATCT  1500
1501 CTAACAACTT CACCGGAGAA TAACAAAAAA GAGAACTCTC CATTATCATC  1550
1551 TACTTCAACT TCACCGGAAC GACGACCAAA TGATACACCA GAAGCTTACT  1600
TGAGATCTCCGTCGCATTCTTCTGCTTCTACATCACCGTATAGATGTTTT
CAGAAATCTCCGGAGGTCTTACCGGCGTTTATGAGTAATCTCCGGCAAGG
TTTGCAATCTCAGTTACTATCTTCTCCTTCTAACTCTCATGGAGGACAAG
GTTTCCTTAAGCAGTTAGATGCATTACGTTCTCGTTCACCGTCGTCGTCT
TCTTCTTCTGTTTGTTCTTCACCGGAGAAAGCTTCTCATAAGTCACCAGT
TACATCTCCTAAGTTATCTTCCCGGAATTCGCAGTCTCTATCATCTTCTC
CGGATAGAGATTTTAGTCATAGCTTAGATGTATCACCACGGATATCGAAC
ATTTCACCTCAAATTTTACAGTCTCGTGTGCCTCCGCCTCCTCCTCCTCC
CCCACCGTTGCCGTTGTGGGGACGACGGAGTCAGGTGACTACTAAAGCGG
ACACAATCTCGAGACCGCCTTCTCTTACACCGCCTTCACATCCTTTTGTG
ATCCCATCTGAAAACTTACCAGTGACTTCGTCTCCTATGGAGACTCCAGA
GACGGTTTGTGCGAGTGAGGCGGCGGAGGAAACTCCGAAACCGAAGCTAA
AGGCGTTACATTGGGATAAAGTTAGAGCAAGTTCGGATCGTGAGATGGTT
TGGGATCATCTTCGATCAAGCTCTTTCAAGTGAGTTAATGTGACATACTC
GTTTATATGATACTATATGCTTTTAGTGAGAATGTGGTTGTTGAGATTAT
GAATGTGGTTTGCAGATTAGATGAGGAGATGATTGAGACGTTGTTTGTGG
CGAAGTCGTTAAACAACAAACCAAATCAGAGTCAGACAACTCCAAGATGT
GTTCTCCCGAGCCCGAACCAAGAGAACAGAGTCCTGGACCCGAAGAAGGC
TCAGAATATTGCCATCTTGCTTCGTGCACTTAATGTCACTATAGAAGAAG
TTTGTGAGGCTCTTCTTGAAGGTAAACTATGCTGTCACATACATAGTTTC
TCATTTTCTTCTCCTTTGATCTCCAGAATTAGAGTTCTTATGCATTTGTT
AATGGTTTTTCGATGATATGGTTGAGTTATTCTGAAAGCTTTGCTTCTTT
GATGGTGTGGAGATTCTTGGTTACATTGATGTTCTTAGTTATGCTTTTTC
AGGCAATGCTGATACACTGGGGACTGAACTTCTTGAGAGCTTACTGAAGA
TGGCACCGACAAAAGAAGAAGAGCGCAAGTTGAAAGCGTACAATGATGAT
TCGCCTGTTAAGCTTGGACATGCTGAGAAATTCCTTAAGGCAATGTTGGA
CATCCCTTTCGCCTTTAAAAGAGTTGATGCAATGCTCTATGTAGCCAACT
TTGAGTCCGAGGTTGAATACTTGAAGAAATCTTTTGAGACTCTTGAGGTA
TATATTACAAGCTATTCTCTCTCTTTTTACCATATGGTTGTATTGTAACA
GATTATGACTTCATTTCTATTGTTTGTGTAGGCTGCTTGTGAAGAACTGA
GGAACAGTAGGATGTTCTTAAAGCTTCTTGAAGCGGTTCTAAAGACAGGA
AACCGTATGAACGTTGGAACAAACCGAGGAGATGCACATGCGTTCAAGCT
TGATACACTTCTCAAGCTAGTCGATGTCAAAGGCGCTGATGGGAAAACAA
CTCTCTTGCATTTCGTTGTACAAGAGATAATCCGAGCAGAAGGCACACGT
CTCTCAGGTAACAATACACAAACAGATGACATTAAATGCCGGAAACTAGG
TCTCCAAGTTGTATCAAGTCTCTGTTCTGAGCTTAGTAACGTCAAGAAAG
CTGCTGCGATGGACTCAGAAGTACTAAGCAGCTACGTCTCCAAGCTTTCT
CAAGGCATTGCCAAGATCAACGAAGCAATCCAAGTCCAATCAACAATCAC
AGAAGAAAGCAACAGTCAGAGGTTTTCGGAATCGATGAAAACGTTTCTGA
AAAGAGCTGAGGAAGAGATCATCAGAGTACAAGCTCAAGAGAGCGTAGCG
TTATCACTTGTAAAAGAAATCACAGAGTATTTCCATGGAAACTCGGCTAA
AGAAGAAGCGCATCCGTTTAGAATATTCTTGGTGGTTAGAGACTTCCTTG
GAGTAGTAGACAGAGTTTGCAAAGAAGTAGGGATGATAAACGAAAGAACA
ATGGTTAGTTCTGCTCATAAGTTTCCTGTTCCAGTGAATCCAATGATGCC
ACAACCTCTTCCTGGACTCGTTGGACGAAGACAATCTTCTTCTTCTTCGT
CGTCGTCTTCAACCTCTTCGTCTGATGAAGACGAACATAACTCAATCTCA
TTAGTTTCTTAAGGTGAGATCTCAGCTTTGTCTGTGCATGTTGTTGTAAA
AAGTATCCAGTATTGGATTGTTTTGTCATAATAGATTTAAATATATATAT
ATAGAGGGAGGGAATTAATGACAGAAACAAAGAAGTGTTTTTCTTTTCTG
CATTTGTGTAAAAAAAATAATATAGGTTTACCTTAAAATTTGTTCATCTT
AAATTAATAATTTAAGAATCAAATAAATTTGTTTATCTGAACCGTGTGTA
CCACGAAAGAATGTGAGAGCAAACATATTACTTACTTACCCTTCGTTGCT
GAATATAATGATCATTATAAATCACTACCTCCAGTACCTTCTACCTTCTT
CAAAGAACCTTGTTGGATTTGAACCAAAGTTGGAACATAATTGACGAGAG
GTGAGCATCTAGATTCTGCATCGTGATGATGATCCACTTTTATCTATTTA

Gene building: searching for coding sequences in chromosomal DNA

Key protein sequence:
>gi|6691125|gb|AAF24497.1|AF213696_1 FH protein NFH2 [Nicotiana tabacum]
MVFPFFFFLLFLFCSTHCISFAAVSAHNRRVLHESFFPIDSPPPSQPPIPAPPAPPTPYPFQPSTPDNNN
PFFPTYRSPPPPPPPPSPSSLVSFPANISDINLPNTSKSKHVSSKLIITAITCVLAAIIVLSIAICLHAK
KRRRHFNDPKTQRSDNSNRLNHGSSKNDGNTNNSIPKLQQPSQTSSEFLYLGTIVNSHGGINSGSNPDTA
PSSRKMASPELRPLPPLNGRNLSQNYRNTRNDDDFYSTEESVGYIESSFGAGSLSRRGFAAVEVNKFVGS
SLSGSDSSSSSGSGSPNRSVSLSISPPVSVSPKRESCSRPKSPELIAVVTPPPPQRPPPPPPPFVHGPQV
KVTANESPVLISPMEKNDQNVENHSIEKNEEKSEEILKPKLKTLHWDKVRASSDCEMVWDQLKSSSFKLN
EEMIETLFVVKNPTLNTSATAKHFVVSSMSQENRVLDPKKSQNIAILLRVLNGTTEEICEAFLEGNAENI
GTELLEILLKMAPSKEEERKLKEYKDDSPFKLGPAEKFLKAVLDIPFAFKRIDAMLYISNFDYEVDYLGN
SFETLEAACEELRSSRMFLKLLEAVLKTGNRMNVGTNRGDAHAFKLDTLLKLVDVKGADGKTTLLHFVVQ
EIIKSEGARLSGGNQNHQQSTTNDDAKCKKLGLQVVSNISSELINVKKSAAMDSEVLHNDVLKLSKGIQN
IAEVVRSIEAVGLEESSIKRFSESMNRFMKVAEEKILRLQAQETLAMSLVKEITEYVHGDSAREEAHPFR
IFMVVKDFLMILDCVCKEVGTINERTIVSSAQKFPVPVNPNLQPVISGFRAKRLHSSSDEESSSP
 
back to top
back to the  bioinformatics excercise top page