Friday, October 17, 2014

Multiple alignment implementation using Gibbs sampling (C++)

Gibbs sampling

1. Multiple alignment
2. Gibbs sampling
3. What is 'Motif'?
4. process of Gibbs sampling
5. Implementation of Gibbs sampling (C++ programming)
6. Result (execution of program)

1. Multiple alignment

DNA sequence alignment is arranging the sequences of DNA, RNA, Protein to find the property of sequence. Because Eukaryotes has repeated sequences that were homologous to each other and has similar properties.

Generally, we make N1 by N2 matrix to align sequences (N1, N2 is length of each sequence), and find the best alignment using dynamic programming. So it needs O(n^2) or at least O( n*log(n) ) time for algorithm. Such way of alignment is good method to align 2 sequence alignment, but when we need to alignment more than 3 sequences, then it's bad way. because it need very much time. (n-length, k-sequences need O(n^k) time. 

figure 1.1 Multiple align/ment

if we use Gibbs sampling for multiple alignment, we can find arranged sequences in less time. In this posting, I'd like to see what is Gibbs sampling and how to implement gibbs sampling using C++ programming.

2. Gibbs sampling

Actually, Multiple alignment problem is NP, so it's hard to find best result. So Gibbs divise heuristic method to do multiple alignment (Gibbs sampling). It can't find best result but one of the part of result was obtained by gibbs sampling.



3. Motif: Repeated pattern of sequences in Protein, DNA

Sequence motif (Wikipedia, the free encyclopedia)
In genetics, a sequence motif is a nucleotide or amino-acid sequence pattern that is widespread and has, or is conjectured to have, a biological significance. For proteins, a sequence motif is distinguished from a structural motif, a motif formed by the three-dimensional arrangement of amino acids, which may not be adjacent.

ex)
CTACCTTTGACTAGTCCGTTCTAGCG
AGCTAAGTCGGTTCCTAGCATGCGTG
TTCTACCGCAGTCAGCTCCGATGCGC
GAGTGAGTTCACGTAGCCGTAGCTAA
TTGCTAAGTGAGTTCACGGCTTGGCT
data 3.1 DNA sequences

It's not same perfectly, but similar sequences are important key for multiple alignment. Motif also has meaning for genetics. From motif, we can find DNA sequence of transcription factor, and we can predict the binding site of various sequences.

4. Process of Gibbs sampling

1) The necessary Data
  1. Multiple DNA / RNA / Protein sequence
  2. Motif width.
  3. The number of Motif

2). Randomly, we choose the start point of motif in sequence.
for example we assume that there are 10-length motif in 100-length sequence, there are 91 number of cases
 ( start point: 1~91 in 100-length sequence.).
1~91, 총 91개의 위치에 해당되는 경우의수가 존재합니다.

3). Compose PSSM (Position specific score matirx).

I made example of PSSM using under 5 DNA sequences.
(18 length sequences DNA, and 8 length Motif)

Seq 1. GTAAACAATATTTATAGC
Seq 2. AAAATTTACCTTAGAAGG
Seq 3. CCGTACTGTCAAGCGTGG
Seq 4. TGAGTAAACGACGTCCCA
Seq 5. TACTTAACACCCTGTCAA

data 4.1. input DNA seuqnce of gibbs sampling


The number of Motif = 8
Choose the start point of motif in sequence ramdomly.
{s1, s2, s3, s4, s5 | 7, 11, 9, 4, 1}

then,

Seq 1. GTAAACAATATTTATAGC
Seq 2. AAAATTTACCTTAGAAGG
Seq 3. CCGTACTGTCAAGCGTGG
Seq 4. TGAGTAAACGACGTCCCA
Seq 5. TACTTAACACCCTGTCAA


Highlighted 8-length is motif that is randomly choosen by gibbs sampling.
PSSM table consist of  selected motifs. but one sequence is excluded.
(In this example,second sequence was excluded.)

Example of PSSM table is below. 



















Figure 4.2 - PSSM table for using data 4.1
From the randomly selected motif, we just calculate most frequent nucleotide like Figure 4.2.
Consensus String is the best motif sequence in this example. (It will be repeated many times)

5) We calculate the score of Consensus String in PSSM
The sequence which was excluded is second sequence. (AAAATTTACCTTAGAAGG)














And now, we just calculate score for all start point of sequence (1~11. because the length of seuqnce is 18, and motif is 8.)
The way to get the score for first case (AAAATTTACCTTAGAAGG) is below.











Figure 4.2 scoring - result ex1)

0.25 * 0.5 * 0.5 * 0.75 * 0.5 * 0.25 * 0.25 * 0.5 = 0.000732421875
Like this, we calculate the score for all the start point, and choose the best score. (first case)

5) After getting the score, we repeat the step 3) to step 5). after that we can compare the score each case.
if ( new_score >= best_score )
    than best_score = new_score

The more time we repeat, the higher score was obtained.


5. Implementation of Gibbs sampling (C++ programming)


#include <iostream>
#include <string.h>
#include <process.h>
#include <vector>
#include <iomanip>

using namespace std;

#define size_motif 8
#define count_DNA 5
#define DNA_size 18
#define n_repeat 20000

class DNAs{
private:
int n_DNA; //The number of DNA
vector<string> DNA; //DNA vector.

public:
DNAs():n_DNA(0){}; //Constructor: n_DNA initiation.
        void add_string(string input){
DNA.push_back(input);
n_DNA++;
}

void print_strings(){
for(int i=0;i<n_DNA;i++){
cout<<DNA[i].c_str()<<endl;
}
}

string get_DNA(int input){
if(input>n_DNA){
return false;
}
return DNA[input];
}

string get_subDNA(int i,int pos){
return DNA[i].substr(pos,size_motif);
// i-th sequence(vector DNA), pos is start point of motif
};

int main(){
   DNAs DNA;
   string tmp_motif;
   string best_motifs[count_DNA]; //The motif of best score case.
   string best_motifs_choosed_DNA; // one excluded sequence of PSSM
   string best_consensus_motif;
   double best_score=0;

   DNA.add_string("GTAAACAATATTTATAGC");
    DNA.add_string("AAAATTTACCTTAGAAGG");
    DNA.add_string("CCGTACTGTCAAGCGTGG");
    DNA.add_string("TGAGTAAACGACGTCCCA");
    DNA.add_string("TACTTAACACCCTGTCAA");
    //////////////////////////////////////// DNA input

    for(int k=0;k<n_repeat;k++){ //n_repeat. in this case 20000.
      double PSSM[4][size_motif]={0,};
       double ratio[4]={0,};
       tmp_motif.clear();
       double score=1;
       int pos_motif[count_DNA]={0,};
       string get_sequence[count_DNA];

       for(int i=0;i<count_DNA;i++){
          for(int j=0;j<DNA_size;j++){
             switch((char)DNA.get_DNA(i)[j]){
             case 'A':
                ratio[0]++;
                break;
             case 'C':
                ratio[1]++;
                break;
             case 'G':
                ratio[2]++;
                break;
             case 'T':
                ratio[3]++;
                break;
          }
       }
    }
    ratio[0]/=(count_DNA*DNA_size); //A ratio
    ratio[1]/=(count_DNA*DNA_size); //C ratio
    ratio[2]/=(count_DNA*DNA_size); //G ratio
    ratio[3]/=(count_DNA*DNA_size); //T ratio
    int choosed_sequence = rand() % count_DNA//randomly choose one DNA. it will be excluded.
                // this sequence was used for scoring to find motif.

    for(int i=0;i<count_DNA;i++)
       pos_motif[i]=rand()%(DNA_size - size_motif+1) + 1;
                //randomly choose the start point of motif for all DNA


    for(int i=0;i<count_DNA;i++){
       if( i == choosed_sequence )
          continue;
       get_sequence[i] = DNA.get_subDNA(i,pos_motif[i]-1);
       for(int j=0;j<size_motif;j++){
          if(get_sequence[i][j] == 'A')
             PSSM[0][j]++;
          else if(get_sequence[i][j] == 'C')
             PSSM[1][j]++;
             else if(get_sequence[i][j] == 'G')
             PSSM[2][j]++;
          else if(get_sequence[i][j] == 'T')
             PSSM[3][j]++;
       }
    }
    for(int i=0;i<size_motif;i++){
       for(int j=0;j<4;j++){
          PSSM[j][i]/=count_DNA-1;//ratio if nucleotide.
       }
    }
/////////////  PSSM table.

/////////////////////////////////////////
                /*cout.setf(ios::fixed,ios::floatfield);
cout.setf(ios::showpoint);

for(int i=0;i<4;i++){
for(int j=0;j<size_motif;j++){
cout.precision(3);
cout<<setw(4)<<PSSM[i][j]<<" ";
}cout<<endl;
}
 * ////////////// print PSSM table.

    for(int i=0;i<size_motif;i++){
       double max=0;
      int pos;//ACGT 0,1,2,3
       for(int j=0;j<4;j++){
          if(PSSM[j][i] > max){
             max=PSSM[j][i];
             pos=j;
          }
       }
       switch(pos){
       case 0:
          tmp_motif+='A';
          break;
       case 1:
            tmp_motif+='C';
            break;
       case 2:
          tmp_motif+='G';
          break;
       case 3:
          tmp_motif+='T';
          break;
       }
    }
    //PSSM_ Consensus string
    best_consensus_motif = tmp_motif.c_str();


    for(int i=0;i<DNA_size-size_motif+1;i++){
      score=1;
       string score_motif = DNA.get_subDNA(choosed_sequence,i);
       for(int j=0;j<size_motif;j++){
           switch( score_motif[j] ){
           case 'A':
              score *= PSSM[0][j];
             break;
           case 'C':
         score *= PSSM[1][j];
          break;
       case 'G':
          score *= PSSM[2][j];
          break;
       case 'T':
          score *= PSSM[3][j];
          break; 
       }//score is multiplication of ratio for all nuclotide
    }
    if(score > best_score){
       cout<<score_motif.c_str()<<endl;
       cout<<score<<endl;
       best_score = score;
       best_motifs_choosed_DNA = score_motif;
       for(int i=0;i<count_DNA;i++){
          best_motifs[i] = get_sequence[i];
       }
     }//best score: save motif sequence and score
   }
}
cout<<"Best scores: "<<best_score<<endl;
cout<<"Consensus motif: "<<best_consensus_motif.c_str()<<endl;
for(int i=0;i<count_DNA;i++){
   if(best_motifs[i].size() > 0)
       cout<<i+1<<": "<<best_motifs[i].c_str()<<endl;
    else
       cout<<i+1<<": "<<best_motifs_choosed_DNA.c_str()<<" (choosed)"<<endl;
  }
}


6. Result (execution of program)



















Seq 1. GTAAACAATATTTATAGC
Seq 2. AAAATTTACCTTAGAAGG
Seq 3. CCGTACTGTCAAGCGTGG
Seq 4. TGAGTAAACGACGTCCCA
Seq 5. TACTTAACACCCTGTCAA

It is result of gibbs sampling (20000 repeated).
The best score is 0.00878906, and Motif sequence is TAAAACAA


Another example (data 3.1 above).
CTACCTTTGACTAGTCCGTTCTAGCG
AGCTAAGTCGGTTCCTAGCATGCGTG
TTCTACCGCAGTCAGCTCCGATGCGC
GAGTGAGTTCACGTAGCCGTAGCTAA
TTGCTAAGTGAGTTCACGGCTTGGCT



DNA_size = 26
size_motif = 6
n_repeat = 50000.




















Sequence 1,3,4,5 was founded exactly, but sequence 2 is not that I want.
But it could be found by setting higher n_repeat. but the more time would be needed.

If you have question about gibbsn sampling and my programming source(c++), than reply.
I would response your question.