CLC Genomics Workbench

Read mapping

The read mapping of CLC Genomics Workbench supports both short and long reads, it supports paired-ends reads, and it supports Sanger, 454, Solexa, Helicos, and SOLiD sequencing data.

The parameters for the assembly algorithm are.

  • Gap cost: The cost of creating a gap. Setting the gap cost higher will favor fewer gaps, and fewer reads will then be assembled.
  • Match score: The score of creating a match.
  • Mismatch cost: The cost of creating a mismatch. Raising this value will reduce the number of reads with mismatches, and more reads will be left unassembled.
  • Identity: Set minimum fraction of identity between the read and the reference sequence. If you want the reads to have e.g. at least 90% identity with the reference sequence in order to be included in the final contig, set this value to 0.9
  • Length: Set minimum length fraction of a read that must match the reference sequence. Setting a value at 0,5 means that at least half the read needs to match the reference sequence for the read to be included in the final contig.

Screenshot 1: Assembly parameters.

You can read more about the assembly parameters in our Bioinformatics explained article on multiple alignments

You can specify how Non-specific matches should be treated. The concept of Non-specific matches refers to a situation where a read aligns at more than one position. In this case you have two options:

  • Random: This will place the read in one of the positions randomly.
  • Ignore: This will not include the read in the final contig.

Screenshot 2: Conflict resolution and annotation.

If there is a conflict between reads, i.e. a position where there is disagreement about which base is correct, you can specify how the contig sequence should reflect the conflict:

  • Vote (A, C, G, T): The conflict will be solved by counting instances of each nucleotide and then letting the majority decide the nucleotide in the contig.
  • Unknown nucleotide (N): The contig will be assigned an 'N' character in all positions with conflicts.
  • Ambiguity nucleotides (R, Y, etc.): The contig will display an ambiguity nucleotide reflecting the different nucleotides found in the reads.

For the paired-ends reads, the user specifies a distance interval between the two sequences in a pair. This will be used for the assembly process to determine how far it can expect the two reads to be from each other.