Bioinformatics - protocols  B

To locate the required tools, look for a blue B in the list of links or use the shortcuts provided.

Basic sequence manipulation

You have just sequenced the following  Arabidopsis cDNA fragment. Note the format of the sequence: a header line (>mysequence) followed by the rest of the sequence, which means FASTA format.

>cDNA_B_At
CCGCGGTGGCGGCCGCTCTAGAACTAGTGGATCCCCCGGGCTGCAGGATC
CCGGTGACCCCGGCAAAGCTTGCTTAATCCGAAGACGTTTCTGTTTCATC
TTCTTAAATCCGGGCCAACNGCGTTTACGAGACTAAACGCGTTTCTCTTT
AGGGCTTAATTATTATCCAGAGATGGCTCATCATAGCAAATGTTTACAAA
CGTTGGATTTAGCTTGTAAAGAGCTGAGATCTCGTGGCTTGTTTGTGAAG
CTTTTGGAGGCAATACTTAAAGCTGGAAACAGAATGAACGCGGGTACCGC
GAGAGGAAACGCTCAAGCGTTTAATCTAACCGCGCTTTTGAAGCTTTCGG
ATGTTAAAAGCGTTGATGGGAAGACTTCTTTGCTTAACTTTGTAGTGGAG
GAAGTTGTTAGATCGGAAGGAAAACGTTGTGTTATGAATAGAAGAAGCCA
TAGCTTAACACGAAGCGGTAGTAGTAACTACAATGGTGGTAATAGTAGTC
TTCAGGTTATGTCGAAAGAAGAGCAAGAGAAAGAGTACTTGAAGCTTGGT
TTACCAGTTGTTGGTGGATTGAGCTCTGAGTTTTCAAACGTGAAGAAAGC
TGCTTGTGTGGACTATGAAACGGTTGTTGCAACTTGTTCTGCTCTTGCGG
TTAGAGCGAAAGATGCGAAAACGGTGATTGGAGAATGTGAAGATGGAGAA
GGAGGGAGGTTTGTGAAAACGATGATGACGTTTCNTGATTCGGTAGAGGA
AGAGGTGAAAATAGCGAAAGGTGAAGAGAGGAAAGTGATCCCTGA

Sequence similarity searches and domain structure analysis

Construction and interpretation of a protein sequence alignment

Go back to the window with the CD-search results and click one the conserved domain(s) found. Below you find the sequences that have been used to produce the first ("most similar 10") alignment from above. Save then into a text file with extension *.aa and create an alignment using MACAW. >Bprotein
SGGETSKQVKLKPLHWDKVNPDSDHSMVWDKIDRGSFSFDGDLMEALFGYVAVGKKSPEQ
GDEKNPKSTQIFILDPRKSQNTAIVLKSLGMTREELVESLIEGNDFVPDTLERLARIAPT
KEEQSAILEFDGDTAKLADAETFLFHLLKSVPTAFTRLNAFLFRANYYPEMAHHSKCLQT
LDLACKELRSRGLFVKLLEAILKAGNRMNAGTARGNAQAFNLTALLKLSDVKSVDGKTSL
LNFVVEEVVRSEGKRCVMNRRSHSLTRSGSSNYNGGNSSLQVMSKEEQEKEYLKLGLPVV
GGLSSEFSNVKKAACVDYETVVATCSALAVRAKDAKTVIGECEDGEGGRFVKTMMTFLDS
VEEEVKIAKGEERKVMELVKRTTDYYQAGAVTKGKNPLHLFVIVRDFLAMVDKVCLDIMR
NMQRRK
>gi|6691125 Nicotiana tabacum NFH2
EKSEEILKPKLKTLHWDKVRASSDCEMVWDQLKSSSFKLNEEMIETLFVVKNPTLNTSAT
AKHFVVSSMSQENRVLDPKKSQNIAILLRVLNGTTEEICEAFLEGNAENIGTELLEILLK
MAPSKEEERKLKEYKDDSPFKLGPAEKFLKAVLDIPFAFKRIDAMLYISNFDYEVDYLGN
SFETLEAACEELRSSRMFLKLLEAVLKTGNRMNVGTNRGDAHAFKLDTLLKLVDVKGADG
KTTLLHFVVQEIIKSEGARLSGGNQNHQQSTTNDDAKCKKLGLQVVSNISSELINVKKSA
AMDSEVLHNDVLKLSKGIQNIAEVVRSIEAVGLEESSIKRFSESMNRFMKVAEEKILRLQ
AQETLAMSLVKEITEYVHGDSAREEAHPFRIFMVVKDFLMILDCVCKEVGTINERTI
>gi|6041849 Arabidopsis BAC F21O3
EGTTDRPKPKLKPLPWDKVRPSSRRTNTWDRLPYNSSNANSKQRSLSCDLPMLNQESKVL
DPRKSQNVAVLLTTLKLTTNDVCQALRDGHYDALGVELLESLARVAPSEEEEKKLISYSD
DSVIKLAPSERFLKELLNVPFVFKRVDALLSVASFDSKVKHLKRSFSVIQAACEALRNSR
MLLRLVGATLEAGMKSGNAHDFKLEALLGLVDIKSSDGRTSILDSVVQKITESEGIKGLQ
VVRNLSSVLNDAKKSAELDYGVVRMNVSKLYEEVQKISEVLRLCEETGHSEEHQWWKFRE
SVTRFLETAAEEIKKIEREEGSTLFAVKKITEYFHVDPAKEEAQLLKVFVIVRDFLKILE
GVCKKMEVTSSLA
>gi|6225268 DIAPHANOUS PROTEIN HOMOLOG 1
PKKLYKPEVQLRRPNWSKLVAEDLSQDCFWTKVKEDRFENNELFAKLTLTFSAQTKTKKD
QEGGEEKKSVQKKKVKELKVLDSKTAQNLSIFLGSFRMPYQEIKNVILEVNEAVLTESMI
QNLIKQMPEPEQLKMLSELKDEYDDLAESEQFGVVMGTVPRLRPRLNAILFKLQFSEQVE
NIKPEIVSVTAACEELRKSESFSNLLEITLLVGNYMNAGSRNAGAFGFNISFLCKLRDTK
STDQKMTLLHFLAELCENDYPDVLKFPDELAHVEKASRVSAENLQKNLDQMKKQISDVER
DVQNFPAATDEKDKFVEKMTSFVKDAQEQYNKLRMMHSNMETLYKELGEYFLFDPKKLSV
EEFFMDLHNFRNMFLQAVKENQKRRKTEEKMRRAKLAKEKAEKERLEKQQKREQLIDMNA
EGDETGVMDSLLEALQSGAAFRR
>gi|544344 FORMIN 4 (LIMB DEFORMITY PROTEIN).
RKPAIEPSCPMKPLYWTRIQINDKSQDAAPTLWDSLEEPHIRDTSEFEYLFSKDTTQQKK
KPLSEAYEKKNKVKKIIKLLDGKRSQTVGILISSLHLEMKDIQQAIFTVDDSVVDLETLA
ALYENRAQEDELTKIRKYYETSKEEDLKLLDKPEQFLHELAQIPNFAERAQCIIFRAVFS
EGITSLHRKVEIVTRASKGLLHMKSVKDILALILAFGNYMNGGNRTRGQADGYSLEILPK
LKDVKSRDNGMNLVDYVVKYYLRYYDQEAGTDKSVFPLPEPQDFFLASQVKFEDLLKDLR
KLKRQLEASEQQMKLVCKESPREYLQPFKDKLEEFFKKAKKEHKMEESHLENAQKSFETT
VGYFGMKPKTGEKEVTPSYVFMVWFEFCSDFKTIWKRESKNISKER
>gi|2281090 unknown protein [Arabidopsis thaliana]
EKKVETMKPKLKTLHWDKVRASSSRVMVWDQIKSNSFQVNEEMIETLFKVNDPTSRTRDG
VVQSVSQENRFLDPRKSHNIAILLRALNVTADEVCEALIEGNSDTLGPELLECLLKMAPT
KEEEDKLKELKDDDDGSPSKIGPAEKFLKALLNIPFAFKRIDAMLYIVKFESEIEYLNRS
FDTLEAATGELKNTRMFLKLLEAVLKTGNRMNIGTNRGDAHAFKLDTLLKLVDIKGADGK
TTLLHFVVQEIIKFEGARVPFTPSQSHIGDNMAEQSAFQDDLELKKLGLQVVSGLSSQLI
NVKKAAAMDSNSLINETAEIARGIAKVKEVITELKQETGVERFLESMNSFLNKGEKEITE
LQSHGDNVMKMVKEVTEYFHGNSETHPFRIFAVVRDFLTILDQVCKEVGRVNERTV
>gi|1061334 Drosophila melanogaster cappuccino
RKSAVNPPKPMRPLYWTRIVTSAPPAPRPPSVANSTDSTENSGSSPDEPPAANGADAPPT
APPATKEIWTEIEETPLDNIDEFTELFSRQAIAPVSKPKELKVKRAKSIKVLDPERSRNV
GIIWRSLHVPSSEIEHAIYHIDTSVVSLEALQHMSNIQATEDELQRIKEAAGGDIPLDHP
EQFLLDISLISMASERISCIVFQAEFEESVTLLFRKLETVSQLSQQLIESEDLKLVFSII
LTLGNYMNGGNRQRGQADGFNLDILGKLKDVKSKESHTTLLHFIVRTYIAQRRKEGVHPL
EIRLPIPEPADVERAAQMDFEEVQQQIFDLNKKFLGCKRTTAKVLAASRPEIMEPFKSKM
EEFVEGADKSMAKLHQSLDECRDLFLETMRFYHFSPKACTLTLAQCTPDQFFEYWTNFTN
DFKDIWKKEITSLLNEL
>gi|5080823 Hypothetical protein [Arabidopsis thaliana]
GKTEDPTQPKLKPLHWDKMNPDASRSMVWHKIDGGSFNFDGDLMEALFGYVARKPSESNS
VPQNQTVSNSVPHNQTYILDPRKSQNKAIVLKSLGMTKEEIIDLLTEGHDAESDTLEKLA
GIAPTPEEQTEIIDFDGEPMTLAYADSLLFHILKAVPSAFNRFNVMLFKINYGSEVAQQK
GSLLTLESACNELRARGLFMKLLEAILKAGNRMNAGTARGNAQAFNLTALRKLSDVKSVD
AKTTLLHFVVEEVVRSEGKRAAMNKNMMSSDNGSGENADMSREEQEIEFIKMGLPIIGGL
SSEFTNVKKAAGIDYDSFVATTLALGTRVKETKRLLDQSKGKEDGCLTKLRSFFESAEEE
LKVITEEQLRIMELVKKTTNYYQAGALKERNLFQLFVIIRDFLGMVDNACSEIARNQRKQ
Q

Gene building: searching for coding sequences in chromosomal DNA

The following DNA sequence corresponds to the sense strand of the locus in the A. thaliana Chromosome 1 from which your initial cDNA has been derived:

>locusB
cccctataaaaagtattaaaaaggactgatacaataatgtatataaatat
cctaaaagatcttaattttgtaaatttattgttgtatattctaaacccgc
aatattagaatgatgatttagtaaacaagaaagacaaaataaataattaa
ttttagctagaaaagatgaaataaacactcatgatttaagccatacaaat
cgaagccccttgggttcagcatttctcaccaagtaaataccatcacctct
ggaaacccatttacgtacttgaccacatcttttattagcggctcctctgt
atgctctccatatgttatacacactatgatgccttaagatttattcacga
cgatttaatcagatacgcttatggattgccaaagatgatgccatctactt
agagaaaaacaatggaaagcgagaacgcatgtataattggaataaaaatt
aatatggttttcatatatctaaaaaattggacatttgaagccttaataaa
ttatactatgtaaaaatacttgtttatgaatgtaaattataataaattac
gatttaattagggaaatattgactatatatttcacccaaatattgaatgt
aaattttattttccaatacttttgcacatttaagaaattttcggatgtat
ttcctaaagaatattaccttttttgttttttaaaccatgcctttttgttt
tacacgttcataaatgcatgttccatacgcattaccataatttaatttga
acttaattttctctaggaatggtgatgatccactaccactatcattgatt
tcattccatattcctttgaccgactgaaattacgttggaaatagtatatt
ttgatgaataatttatttactcggaaaaaagaggtcaagttattaatagt
aagtacatatacattatcaattaagaattcaattgagttttaaggaaaat
cctattaatttgtttggtattcggtatttgttagttctaaggaattgaat
ttcccgattatacatcattataacgttctcaagttccaaacttgcaaccc
acattttgtcgatattctcaaatgtgaattcattcaatttcccatagaaa
acataaatttgcacttaaagttaacaattgaaatcgtatctaaatgggaa
tgtttttggcttttagtgttagacttccaaagcgtcaaaaatatttctag
aaagagcacaaaaaataagcaacgccactacttttggacaaagtcaacga
taacacacatcaaccgcaccagctccataaaagtccatctcacgaaaacg
attctagtcaaactacctaaaacacccttatatttacatacaacccaatc
ccactaacaagggtattttcgtcaatcacaaaatttatcaccgacccggg
aagaagaagaagaacagatcaactaatttctgctttcaactccacattaa
accaaaacctccaaaaagaatcatttatttaaattatcttcccgttttaa
gttcctgagatttttgggaattgtaaatttgaagaaaattaaacaaagac
gtgttttcatttttttttttgtttcctttattgatctctctctatctctc
taaatgagctaaatcgttaatggctgccatgtttaatcatccatggccta
atttaaccctaatttacttcttcttcatcgtcgttttaccattccaatca
ctttctcaatttgattctcctcaaaatatcgaaactttcttccccatctc
ttcactctcccctgttccaccaccgcttcttccaccttcgtcaaacccat
ctccgccgtcgaataattcatcatcttcggataaaaaaacaatcaccaaa
gctgtccttataacagcagcaagtactttacttgtagctggagttttctt
cttctgcctccaaagatgtatcatcgcacggagacggagagacagagttg
gaccagtcagagtcgaaaacactttacctccgtatcctcctcctccgatg
acgtcggcggcggtgactacgactactttggctagagaaggattcacgag
gtttggtggtgtgaaaggtttgattcttgatgagaatggtcttgatgtgt
tgtattggagaaagctacagagtcagagagaaagaagtgggagtttcagg
aaacagatcgtcaccggagaagaagaagacgagaaagaagttatttatta
caagaacaagaagaaaacagagcccgttacagagattcctcttcttagag
gaagatcatctacttctcacagtgttatccataacgaagatcatcagccg
ccaccgcaggtgaaacagagtgaaccaacaccaccaccgccaccaccgtc
aattgcggtgaaacagagtgcaccaacgccatcgccacctcctccgatta
agaagggttcttcaccatcgccaccgccacctccaccggtgaaaaaggtt
ggagctttatcatcatcagcttcgaaaccaccacctgcgccggttagagg
agcaagtggaggagagacttcgaaacaagtaaagttgaagcctttacatt
gggataaagtaaaccctgattccgatcattcaatggtttgggacaaaatc
gatcgtggatcattcaggtatatatttatttcgaaagttagggcttttgc
ttcaatcaattgaaaaaaccctaatttgtttttgtttcttctcagtttcg
atggcgatttaatggaagctctgtttggatacgttgccgtggggaagaaa
tcaccagaacaaggcgatgagaaaaaccctaaatcaacgcaaatattcat
acttgatccgagaaagtctcaaaacacagcgattgtgctcaaatcattag
gtatgacacgtgaagagcttgttgaatcactcatagaaggaaacgatttc
gtgccagacactcttgagaggttagctagaatagctccaacgaaagaaga
acaatcagccattcttgaattcgacggtgacacggcaaagcttgctgatg
cggagacgtttctgtttcatcttcttaaatccgtgccaaccgcgtttacg
agactaaacgcgtttctctttagggctaattattatccagagatggctca
tcatagcaaatgtttacaaacgttggatttagcttgtaaagagctgagat
ctcgtggcttgtttgtgaagcttttggaggcaatacttaaagctggaaac
agaatgaacgcgggtaccgcgagaggaaacgctcaagcgtttaatctaac
cgcgcttttgaagctttcggatgttaaaagcgttgatgggaagacttctt
tgcttaactttgtagtggaggaagttgttagatcggaaggaaaacgttgt
gttatgaatagaagaagccatagcttaacacgaagcggtagtagtaacta
caatggtggtaatagtagtcttcaggttatgtcgaaagaagagcaagaga
aagagtacttgaagcttggtttaccagttgttggtggattgagctctgag
ttttcaaacgtgaagaaagctgcttgtgtggactatgaaacggttgttgc
aacttgttctgctcttgcggttagagcgaaagatgcgaaaacggtgattg
gagaatgtgaagatggagaaggagggaggtttgtgaaaacgatgatgacg
tttcttgattcggtagaggaagaggtgaaaatagcgaaaggtgaagagag
gaaagtgatggagcttgtgaaacgtacaacggattattatcaagcaggag
ctgttacaaaggggaagaatccacttcatttgtttgttatcgttagagat
tttcttgccatggttgataaagtttgcttagatattatgagaaatatgca
gaggaggaaggttggtagtccgatatcgccttcttcgcagcggaatgcgg
tgaaattcccggttttgcctccgaatttcatgtcggacagagcttggagt
gattctggtgggtcggattctgatatgtgagagtcaagatttgttatatg
taaatactaaatagtagaagcattttgggtattgattagcattgaaagat
gttgaattgtttatagatttatcagtccaaagcattggacttgagtataa
tttgttccttgtataaataaacaattttgctttaagacctttccatgttt
atgaacatgtcttctttaacttcacatagaccttttgtttacgtaagaac
taataatactaaattgtttgataattctaaatgtgaaagtgaaccactat
atagtgtgaacttggctttattgaattctttttaaaaaaatttctccaga
gctttagatgtaggagttaatattttcacctaacatagcctcttttttat
gtttctctatcaactaacactaaatttgtggatgaagactaaattaacat
aagtttatctattaactaacaacctaccagtttgatgcttgtaaatatga
aacttcaacgttataaagactatatggtgtgaactttttatccatcttta
ttgacttttaaaattttcttaatttgagtaaacaaaagcagaagcttttt
aaaggatgcaggagttgatttttgtatatgaacaaaacatatacttctcc
cttagacgaatttggagctatcattcttggtttcaaactttttaataatt
tgagctttaaagcaaaatggcaactttatattgatcactagtccacaaca
ctttctctgccttttcctcaatagcaacgcgtagtcaagaagaagaacgt
gtttaacatggaccaatcttgattaagataatagtatgatcaaatgctta
tataaacacactaaaaaggaatcaaatttaaccattccacaaatcaccaa
caaaatttaatgaatcatgtctctgcttctaaagatgttattattttcct
tattcttcttctatatggcttcaatttctcaatgctcagacccaaccggt
ggacagtttagcttcaacggttacttgtacaccgatggagttgcggatct
aaacccggacggtttgttcaaactcataacttcaaagaca

Key protein sequence:
>gi|6691125|gb|AAF24497.1|AF213696_1 FH protein NFH2 [Nicotiana tabacum]
MVFPFFFFLLFLFCSTHCISFAAVSAHNRRVLHESFFPIDSPPPSQPPIPAPPAPPTPYPFQPSTPDNNN
PFFPTYRSPPPPPPPPSPSSLVSFPANISDINLPNTSKSKHVSSKLIITAITCVLAAIIVLSIAICLHAK
KRRRHFNDPKTQRSDNSNRLNHGSSKNDGNTNNSIPKLQQPSQTSSEFLYLGTIVNSHGGINSGSNPDTA
PSSRKMASPELRPLPPLNGRNLSQNYRNTRNDDDFYSTEESVGYIESSFGAGSLSRRGFAAVEVNKFVGS
SLSGSDSSSSSGSGSPNRSVSLSISPPVSVSPKRESCSRPKSPELIAVVTPPPPQRPPPPPPPFVHGPQV
KVTANESPVLISPMEKNDQNVENHSIEKNEEKSEEILKPKLKTLHWDKVRASSDCEMVWDQLKSSSFKLN
EEMIETLFVVKNPTLNTSATAKHFVVSSMSQENRVLDPKKSQNIAILLRVLNGTTEEICEAFLEGNAENI
GTELLEILLKMAPSKEEERKLKEYKDDSPFKLGPAEKFLKAVLDIPFAFKRIDAMLYISNFDYEVDYLGN
SFETLEAACEELRSSRMFLKLLEAVLKTGNRMNVGTNRGDAHAFKLDTLLKLVDVKGADGKTTLLHFVVQ
EIIKSEGARLSGGNQNHQQSTTNDDAKCKKLGLQVVSNISSELINVKKSAAMDSEVLHNDVLKLSKGIQN
IAEVVRSIEAVGLEESSIKRFSESMNRFMKVAEEKILRLQAQETLAMSLVKEITEYVHGDSAREEAHPFR
IFMVVKDFLMILDCVCKEVGTINERTIVSSAQKFPVPVNPNLQPVISGFRAKRLHSSSDEESSSP
 
 
 
back to top
back to the  bioinformatics excercise top page