Monday, January 5, 2015

Shotgun sequencing Implementation c++, and example

Shotgun sequencing: A method used for sequencing long DNA strands.
DNA is divided into small segments, which are sequenced using the chain termination method. Multiple overlapping reads for the target DNA are obtained by performing several rounds of this fragmentation and sequencing. Computer programs then use the overlapping ends of different reads to assemble them into a continuous sequence.
 
 
Should be noticed when you do the shotgun sequencing.

1. Incomplete coverage.
all the fragment will be covering the entire DNA sequence.

2. Sequencing errors
If there are sequencing errors, shotgun sequencing result will be wrong with origin DNA sequence.

3. Repetitive DNA
in Repetitive site, there are a number of overlab cases.
ACGTTTTT, TTTTTACG two stings have a number of overlab cases.
ACGTTTTTACG
ACGTTTTTTACG
ACGTTTTTTTACG
...

4. Unknown orientation
When you read the base, wrong direction of DNA fragment can cause problems with the assembly process.

 


ORF: Open reading frame.
ORF has the potential to code for a protein or peptide. but All the ORF is not included in the exon.
ORF begins with Start codon(initiation codon): ATG, and ends with Stop codon(TGA, TAA, TAG).
 
 
 
I've set an example for you.
Use under in a sentence, and do shutgun sequencing.
Then you can get a one full sequence.
If you can, translate the DNA sequence to protein sequence.
 
ex)
CGGGTTTCGGGCTCATGCCCTACGATGCGT
GACCGTGTTGAGAATATCGGGTTTCGGGCT
GAATATCGGGTTTCGGGCTCATGCCCTACG
TCATGCCCTACGATGCGTAACGGACTAGTA
GCTCATGCCCTACGATGCGTAACGGACTAG
GAGAATATCGGGTTTCGGGCTCATGCCCTA
GTTTCGGGCTCATGCCCTACGATGCGTAAC
GTTGAGAATATCGGGTTTCGGGCTCATGCC
CGTGTTGAGAATATCGGGTTTCGGGCTCAT
ATCGGGTTTCGGGCTCATGCCCTACGATGC
CATGGACCGTGTTGAGAATATCGGGTTTCG
TTTCGGGCTCATGCCCTACGATGCGTAACG
GTGTTGAGAATATCGGGTTTCGGGCTCATG
ATATCGGGTTTCGGGCTCATGCCCTACGAT
ATGGACCGTGTTGAGAATATCGGGTTTCGG
GACCGTGTTGAGAATATCGGGTTTCGGGCT
TATCGGGTTTCGGGCTCATGCCCTACGATG
GTTTCGGGCTCATGCCCTACGATGCGTAAC
GTTGAGAATATCGGGTTTCGGGCTCATGCC
AGAATATCGGGTTTCGGGCTCATGCCCTAC
TGTTGAGAATATCGGGTTTCGGGCTCATGC
CCGTGTTGAGAATATCGGGTTTCGGGCTCA
ACCGTGTTGAGAATATCGGGTTTCGGGCTC
TCGGGCTCATGCCCTACGATGCGTAACGGA
TCGGGTTTCGGGCTCATGCCCTACGATGCG
GTGTTGAGAATATCGGGTTTCGGGCTCATG
TCGGGTTTCGGGCTCATGCCCTACGATGCG
ATGGACCGTGTTGAGAATATCGGGTTTCGG
TTTCGGGCTCATGCCCTACGATGCGTAACG
GAGAATATCGGGTTTCGGGCTCATGCCCTA
CGTGTTGAGAATATCGGGTTTCGGGCTCAT
CATGGACCGTGTTGAGAATATCGGGTTTCG
TCGGGTTTCGGGCTCATGCCCTACGATGCG
AATATCGGGTTTCGGGCTCATGCCCTACGA
TCGGGCTCATGCCCTACGATGCGTAACGGA
ACCGTGTTGAGAATATCGGGTTTCGGGCTC
GCTCATGCCCTACGATGCGTAACGGACTAG
AGAATATCGGGTTTCGGGCTCATGCCCTAC
TGTTGAGAATATCGGGTTTCGGGCTCATGC
CATGGACCGTGTTGAGAATATCGGGTTTCG
TCGGGCTCATGCCCTACGATGCGTAACGGA
TTGAGAATATCGGGTTTCGGGCTCATGCCC
ATCGGGTTTCGGGCTCATGCCCTACGATGC
 
 
Shotgun sequencing result:
CATGGACCGTGTTGAGAATATCGGGTTTCGGGCTCATGCCCTACGATGCGTAACGGACTAGTA
 
 
MTVLRISGFG
 
The following source is my implementation of Shutgun sequencing.
Thanks.
 




#include <iostream>
#include <algorithm>
#include <string>
#include <vector>
#include <map>
#include <set>
#include <queue>
#include <list>
#include <cmath>
#include <cstdio>
#include <cstdlib>
#include <sstream>
#include <fstream>
#include <ctime>
#include <windows.h>

#define BUFSIZE 1024

using namespace std;

void mkSPtable(const char* pattern, int *SPtable, int m)
{
 int i, k;  

 k=-1;
 SPtable[0]=-1;

 for(i=1 ; i<m-1 ; i++)
 {
  while(k>=0 && pattern[k+1]!=pattern[i])
  {
   k=SPtable[k];
  }

  if(pattern[k+1]==pattern[i])
  {
   k++;
  }
  SPtable[i]=k;
 }
}

int kmp(const char* text, const char* pattern, int n, int m)
{
 int SPtable[BUFSIZE];
 int i, j;

 mkSPtable(pattern, SPtable, m);

 j=-1;

 for(i=0 ; i<=n-1 ; i++)
 {
  while(j>=0 && pattern[j+1]!=text[i])
  {
   j=SPtable[j];
  }

  if(pattern[j+1]==text[i])
  {
   j++;
  }

  if(j==m-1)
  {
   return 1;
  }
 }

 return 0;
}
///////////////////////////KMP 알고리즘을 이용한 string matching

int main()
{
srand(time(0));
vector<string> v;

ofstream output;
ifstream input;
string in;
output.open("./output.txt");
input.open("./input.txt");
while(! input.eof() ){
input >> in;
v.push_back(in);
}
v.pop_back();

int sz=v[0].size();
char tmp1[100];
char tmp2[100];
for(int i=sz ;i>0; i--)
{
for(int Cur=0; Cur<v.size() ; Cur++)
{
for(int j=Cur+1 ; j<v.size() ; j++)
{
if(kmp(v[Cur].substr(0, i).c_str(), v[j].substr(sz-i, i).c_str(), i, i))
{
v[Cur]=v[j].substr(0, sz-i)+v[Cur];
v.erase(v.begin()+j);
output<<i<<": Left comb"<<endl;
Cur--;
break;
}
if(kmp(v[Cur].substr(v[Cur].size()-i, i).c_str(), v[j].substr(0, i).c_str(), i, i))
{
v[Cur]=v[Cur]+v[j].substr(i, v[j].size()-i);
v.erase(v.begin()+j);
output<<i<<" right comb "<<endl;
Cur--;
break;
}
}
}
}
output<<"v.size() :  "<<v.size()<<endl;
output<<"======================"<<endl;

output << v[0] << endl;
 

No comments:

Post a Comment