CS 501 - Computational Biology

Similarity

Problem Statement

Find a biological database of your choice and download 2 closely related entries from it. How did you determine that the entries are closely related? What is the “distance” between the entries? How did you define and compute it?

Dataset

In our very first task related to the project, we have selected an exceedingly simple domain. The data we have selected is the 'Iris Data Set'. This dataset has information on various physical traits of flowers from different species belonging to the Iris genus.

The dataset can be found here.

Dataset Description

The dataset we have picked contains information about samples belonging to the Iris genus of flowers. Based on their attributes or traits, we can predict what species they belong to, hence being an ideal dataset for learning classification.

There are 150 instances of different flowers. For each instance, there exist 4 attributes and a class attribute which tells us which species the instance actually belongs to. Each attribute is a numerical value that quantifies the particular trait and the class label is a categorical value.

The 4 attributes are:

Sepal width in cm
Sepal length in cm
Petal width in cm
Petal length in cm

The 3 class attributes are:

Iris Setosa
Iris Virginica
Iris Versicolour

Below there is an example of how the dataset looked for the first few instances.

Preliminary Analysis and Preprocessing

Our initial analysis allowed us to conclude that there were no missing values. This makes our data as accuracte and reliable as possible. We also found out that there were no duplicate values. This is great news because this avoids any confusion due to unusual similarity scores and helps avoid unnecessary computations. Since every data instance was unique, finding similarity between 2 entries was extremely convenient and would give reliable values.

As it can be seen from the image of sample data above, all of the attributes are useful in calculating similarity. (Hooray!) Every instance, ie, flower, has a unique sepal width, sepal length, petal width and petal length. All of these properties can be used to calculate similarity. The class label too is instrumental in finding out whether 2 instances are similar or not, but in order to keep a uniform measure of similarity, we decided to exclude it from our initial computations.

We used a python script to import our dataset and prepare it (exclude the class label) for computing similarity, using pandas.

After preliminary analysis, it was easy to conclude that we needed to find some sort of distance measure. We tried out Manhattan distance as well as Euclidean distance between the attribute values of different instances.

Our data instances have 4 attribute values each. Each of those values being numerical in nature. So, the best way to find similarity between 2 entries was to treat both of the instances as vectors [1-D arrays] and find the Manhattan as well as Euclidean distances between the 2 instances.

Manhattan distance = Sum(|Val ij - Val ik|)

Euclidean distance = Sq. root(Sum(|Val ij - Val ik|^2)).

where Val ij = value of attribute i for data instance j and Val ik = value of attribute i for data instance k.

Both of these turned out equally effective distance measures but it seemed more convenient to consider Euclidean as our primary distance measure purely due to convention. As we all know, the L2 Norm is the most popular distance measure.

We achieved these aforementioned computations through simple dataframe operations in Python.

So as we can see from the above example, the Manhattan distance between the two entries [73] and [98] is 1.997 and 3.099 respectively.

We used the 'distance' package from SciPy to calculate both the Euclidean and Manhattan distances. The cityblock and euclidean functions used in the example above, are from the said package.

Correlation Between Euclidean and Manhattan Distance

The following table represents the first 26 records from the flower class Iris-Versicolor.

ID	Sepal Length	Sepal Width	Petal Length	Petal Width	Class
a	7.0	3.2	4.7	1.4	Iris-versicolor
b	6.4	3.2	4.5	1.5	Iris-versicolor
c	6.9	3.1	4.9	1.5	Iris-versicolor
d	5.5	2.3	4.0	1.3	Iris-versicolor
e	6.5	2.8	4.6	1.5	Iris-versicolor
f	5.7	2.8	4.5	1.3	Iris-versicolor
g	6.3	3.3	4.7	1.6	Iris-versicolor
h	4.9	2.4	3.3	1.0	Iris-versicolor
i	6.6	2.9	4.6	1.3	Iris-versicolor
j	5.2	2.7	3.9	1.4	Iris-versicolor
k	5.0	2.0	3.5	1.0	Iris-versicolor
l	5.9	3.0	4.2	1.5	Iris-versicolor
m	6.0	2.2	4.0	1.0	Iris-versicolor
n	6.1	2.9	4.7	1.4	Iris-versicolor
o	5.6	2.9	3.6	1.3	Iris-versicolor
p	6.7	3.1	4.4	1.4	Iris-versicolor
q	5.6	3.0	4.5	1.5	Iris-versicolor
r	5.8	2.7	4.1	1.0	Iris-versicolor
s	6.2	2.2	4.5	1.5	Iris-versicolor
t	5.6	2.5	3.9	1.1	Iris-versicolor
u	5.9	3.2	4.8	1.8	Iris-versicolor
v	6.1	2.8	4.0	1.3	Iris-versicolor
w	6.3	2.5	4.9	1.5	Iris-versicolor
x	6.1	2.8	4.7	1.2	Iris-versicolor
y	6.4	2.9	4.3	1.3	Iris-versicolor
z	6.6	3.0	4.4	1.4	Iris-versicolor

Plotting Manhattan and Euclidean similarity distance between nth - 1 and nth record where n = 26.

Looking at the line graph above, there seems to be a correlation between the Manhattan and Euclidean similarity measures. We will now use a scatter plot to plot Manhattan distance as a function of Euclidean distance and find if they correlate. If they do, we will also try to find a regression line that best describes their relation. Note that N = 51

Result

R Calculation

r = 19.87 / √((11.138)(37.087)) = 0.9776

The value of R is 0.9776. Therefore, in the domain of Iris, there is a strong positive correlation between the Manhattan and Euclidean similarity distance.

Regression Line: y = −0.074 + 1.784⋅x

Paper Review

Problem Statement

We were instructed to review a paper from the RECOMB conference. So, we chose a paper from RECOMB 2013 titled “IPED: Inheritance Path-based Pedigree Reconstruction Algorithm Using Genotype Data”. The paper can be found here.

Paper Summary

The paper focuses on finding an efficient way of pedigree reconstruction, ie, inference of family trees, which is a fundamental problem in genetics. It talks about the drawbacks of existing algorithms with respect to space and time complexity. Existing algorithms perform poorly when the number of generations is greater than a certain threshold, 4 according to the paper, and also perform poorly for inbred populations. The paper puts forward an algorithm that handles cases of both inbred and outbred populations as well as really large number of generations, efficiently.

The paper first introduced us to the concept of pedigree graphs and all the necessary nomenclature associated with them. It then talks about evaluating the relationship of pairs of individuals for both kind of pairs; extant and ancestral. The IBD(identity-by-descent) measure is central to this evaluation.

For outbred populations, M = 2(g - 1) : g is the generation, but for inbred populations, M is calculated by an algorithm. For chromosomes in (i, j), they calculate the average V_{i, j} over all combinations of i, j.

To determine relationship of an ancestral pair (k, l), a similar strategy is used but the weighted average of relationship score of all possible pairs of chromosomes in both the ancestral pair is calculated.

Also, to find E (IBD) and var (IBD), one needs to find the number of meoises (M). The authors have proposed an algorithm that deals with inheritance paths and their lengths from a shared ancestor. They use Inheritance Path Pairs(IPPs) to represent the length of the path and the number of such paths between an ancestor and an extant. They use a hash table for storing these IPP tuples, which speed up their time complexity. To compute these IPPs they use a Dynamic Programming algorithm that uses the IPPs available for current generation to calculate IPPs for next generation. The IPPs for a child are just a merge of IPPs of their parent. This algorithm has time complexity linear in terms of pedigree height.

Given the IPPs for two individuals, the number of Meioses (M) can be calculated which in turn can be used to calculate the expected value and variance of IBDs. Once the relationship scores have been calculated, the siblings can be assigned to the same parents with the help of a Max-Clique algorithm that finds a maximum clique of siblings. Only if the distance between a pair of extant individuals in the original and the reconstructed pedigree are the same, the reconstruction is considered accurate. Results from the paper suggest that IPED was better at pedigree chart generation and also faster compared to the traditional COP in Outbreeding cases, it also does not rely on any threshold value. In case of the inbreeding instances, IPED performs far better than CIP or COP algorithms as the number of generations increases. In conclusion, the authors state that IPED is an very efficient algorithm for pedigree reconstruction, it is independent of threshold values and sampling. They specifically focus on how well the algorithm performs for inbreeding cases.

Discussion Summary

We think that the problem was interesting and fun. We thought that the IPED did a great job when it was tested against the Inbred cases, but not having considered ‘half-siblings’ makes the problem less complex and incomplete. Also the authors stated that the algorithm is not optimal, and going for the approximation algorithm makes the algorithm vulnerable to breaking for specific cases, such as the presence of common inheritance path pairs amongst individuals. The approximation of the mean and variance of IBD length from average number of meiosis renders it nonoptimal.

We think that the approach is commendable but we do not have in-depth knowledge about the cases in question yet so we don’t have a case to present for a better approach. We were impressed with the drastic improvement in accuracy that IPED brought about in reconstructing pedigree graphs for inbred populations.

Proposal

Introduction

Clustered regularly interspaced short palindromic repeats known as CRISPR was first found in bacterial/prokaryotic cells. It is a built-in natural immune system that the bacterial cell uses to fight against viral invasions within its body.

In 1980s, scientists started noticing a strange pattern in some bacterial genomes. They found short palindromic repeating sequences within the genome that were 28-37 base pairs long each and were separated by other short sequences (32-38 base pairs) which were later named “Spacer Sequences/DNA”. These spaces turned out to be copies of the DNA from viruses that previously attacked the bacteria or its ancestors. These repeats along with the spaces together are known as the “CRISPR locus”.

During a phage attack, the virus modifies the bacterial genome by injecting its own DNA into the bacterial cell. In defense, the bacteria responds by activating the CRISPR CAS9 system which is a natural defense mechanism within its cell. There are 3 steps to this immune system.

Spacer Acquisition: This step only takes place when the bacteria encounters an unknown virus for the first time. The bacterial cell cuts a piece from the viral genome via CAS9 Enzymes (usually helicases and nucleases) and inserts it into the spacer DNA slots. This is similar to taking a mugshot of the bacteriophage.

CrRna Processing: If the bacteriophage genome is recognized by the bacterial cell, it will transcribe one of the strands from the CRISPR Locus to a crRNA which will serve as the “guided RNA” in the next step.

Interference: The crRna is then integrated with the CAS9 protein. CAS9 finds the complementary DNA strand to the crRna (aka a single guided RNA or sgRna) and cuts/modifies that DNA sequence in the bacterial cell through homologous recombination or non-homologous end joining. This results in the bacterial cells removing the foreign viral genes from their genome.

Since proteins are mostly universal, this discovery has opened up the possibility of editing/mutating the genome of any organism through injecting the CAS9 protein along with a guided RNA (CRISPR) into the target cells.

Problem

Though CRISPR has been one of the greatest breakthroughs in technology, it comes with many down sides. It is one of the easiest gene engineering techniques discovered, but scientists have also discovered that it causes many mutations that are not supposed to occur. Changes in unintended target areas are often referred to as off-target effects.[4] One of the main reasons for off-target mutations is that CRISPR-Cas9 cleaves DNA sequences that are similar to the target sequences.[3] In order to reduce these problems we could try to maximize cleavage efficiency and minimize off-target effects. Our survey is going to consider various research papers that have been published to address the same.

Strategy

Our aim is to learn in detail about CRISPR , how it works and survey the many machine learning methods that have been applied on making CRISPR cleavage efficient and predicting off-target effects of CRISPR. We plan to implement our survey gradually. First, we will focus on learning in depth about CRISPR-CAS9’s working mechanism, and how it is used in genetic engineering. Then we will focus on the machine learning methods that have been implemented to make CRISPR more efficient and accurate. We do not plan to implement the techniques due to lack of data and the lack of technology which is required to test the accuracy of methods even if they are implemented. We will put up our results on our website and also present our findings in class.

References

Abadi S, Yan WX, Amar D & Mayrose I (2017) A machine learning approach for predicting CRISPR-Cas9 cleavage efficiencies and patterns underlying its mechanism of action. PLoS Comput Biol 13(10): e1005807.
Listgarten J,Weinstein M, Kleinstiver B P, Sousa A A,Joung J. K, Crawford J, Gao K, Hoang L, Elibol M, Doench J G & Fusi N. (2018) Prediction of off-target activities for the end-to-end design of CRISPR guide RNAs, Nature biomedical Engineering 2:38-47.
Doench J G, Fusi N, Sullender M, Hegde M, Vaimberg E W, Donovan K F, Smith I , Tothova Z, Wilen C, Orchard R, Virgin H W, Listgarten J & Root D E.(2016), Optimized sgRNA design to maximize activity and minimize off-target effects of CRISPR-Cas9, Nat Biotechnol,34(2):184-191.
Otieno MO (2015) CRISPR-Cas9 Human Genome Editing: Challenges, Ethical Concerns and Implications. J Clin Res Bioeth 6:253. doi: 10.4172/2155-9627.1000253

Progress

Summary 1: RNA guided editing of bacterial genomes using CRISPR-Cas Systems

This paper talks about how CRISPR systems with dual-RNA:Cas9 can be used for editing bacterial genomes. The authors argument mainly revolves about the fact that we can use this CRISPR system for genome editing instead of cleaving the genome at specific points.

The authors reprogram the specificity by changing the sequence of short CRISPR RNA(crRNA) to make single and multi nucleotide changes carried on editing templates. The authors have worked with S. pneumoniae and E. coli as part of their analyses. It is shown that mutations can be introduced through the transformation of a template DNA fragment that will recombine in the genome and eliminate recognition of the target by the endonuclease (Cas9). Also it is observed that including several different crRNAs will let us introduce multiple mutations simultaneously.

To introduce specific changes in the genome, one must use an editing template carrying mutations that avert cleavage by the endonuclease. This is easy to achieve when performing gene insertion. But, when gene fusion is desired or generation of single nucleotide mutations, this is possible only by introducing mutations in the editing template that alter either the PAM or protospacer sequences.

While mutating the PAMs, there is only one criteria to be followed, we avoid mutating NGG to NAG or NNGGNN. All other random mutations in the PAMs are functional. Seed sequence (8-12bp adjacent to the PAMs) can prevent cleavage but there are restrictions with respect to nucleotide changing at each position and these restrictions vary with different spacers. Hence, mutations in the PAMs, if possible, should be the preferred strategy.

The specificity and versatility of editing using the CRISPR-Cas system rely on several unique properties of the Cas9 endonuclease: (i) its target specificity can be programmed with a small RNA, without the need for enzyme engineering, (ii) target specificity is very high, determined by a 20-bp RNA-DNA interaction with low probability of nontarget recognition, (iii) almost any sequence can be targeted, the only requirement being the presence of an adjacent NGG sequence, (iv) almost any mutation in the NGG sequence, as well as mutations in the seed sequence of the protospacer, eliminates targeting.

Summary 2: CCtop, An Intuitive, Flexible and Reliable CRISPR/Cas9 Target Prediction Tool

The Crispr/Cas9 system technology is used in gene editing making it possible for knock-out, knock-in of genes, gene inactivation, ^⋆ non-homologous end-joining as well as ^⋆homology directed repair. Though Crispr/Cas9 provides us with so many ways for gene editing, there are also a few drawbacks, one of them being the fact that there could be ^⋆off-target slicing of the gene. The authors interest is to develop an interactive tool, that is easy to use and can be used to evaluate and identify all candidate sgRNA target sites along with their off-target quality.

To achieve this, the authors set up a web tool called Crispr/Cas9 target online prediction (CCtop) built with HTML and CGI scripts, using python for all the processing.

They used the bowtie short read aligner to search for off target sites. It takes in index and few parameters along with it outputting the sequences that align with the input sequence. Then these alignments are scored to indicate the likelihood of the stability in the ^⋆sgRNA/DNA ^⋆heteroduplex. They also found that the likelihood decreases if the mismatch is closer to the PAM sequence.

The formula used to score the off-target sequences is Score_off-target = ∑_mismatch 1.2^pos.

Where pos is the position of each mismatch read from 5’ end.

After computing scores, each off target site is aligned with genes that its closest to. For this process, exon coordinate files of each organism are contained in the bx library in python which is used in this project. Basically, the off-target sites are checked to see if they contain any genes. Only exons closer than 100kb to target sites are assigned as the gene, else it is set to N/A. if they overlap, then the distance is set to 0. After assignment of genes, an sgRNA target site is scored. The sgRNA target sites are ranked based on how many numbers of predicted off-target sites are present and how they affect the off-target genes. The ranking is done using a single score which is calculated using the formula

Score = ∑_off-target = [log10(dist)+scoreoff-target]/ [total_off_targets] − total_off_targets.

Where dist is the distance of each off-target site to corresponding closet exon. Off target sites with no associated exons are not considered for this score.

Conclusion: The web tool developed was tested for gene inactivation and non-homology and homologous gene repair. The web tool is also constantly improved based on available resources. CCtop was designed to evaluate and present the users with possible sgRNA target sites, and the tool does what it was promised to do. It is also user friendly for first time amateur users at the same time it provides enough flexibility for experts to use it as well.

Overall, the tool has been well received in the scientific community.

Summary 3: CFD Score, Optimized sgRNA design to maximize activity and minimize off-target effects of CRISPR-Cas9

CRISPR Cas9 is an adaptive autoimmune system present in bacterias to fight against phage attacks. This self-defense mechanism within the bacterial cell uses a Singular Guided RNA (sgRna) along with the Cas9 endonuclease enzyme to eliminate bacteriophage/virus DNA within the cell. The Cas9 nuclease cleaves DNA that is complementary to the sgRNA. The concept of using an sgRNA with the Cas9 protein to remove DNA gave rise to the CRISPR Cas9 human-controlled gene-editing technique.

Compared to other gene-editing techniques like Zinc finger nuclease and Transcription activator-like effector nucleases, CRISPR Cas9 is a more modern and simpler technique providing more versatility in genome engineering. However, given the age of the technique, CRISPR Cas9 has not been studied as extensively as older techniques.
Scientists found that CRISPR can cleave unwanted DNA sequences that are not an exact match to the input sgRNA. These mismatches known as off-targets can cause serious genetic damage.

Experiment Purpose & Design
This paper investigates a number of techniques to design an optimal sgRNA to minimize off-targets and maximize activity. In this summary, we will discuss the sequence-related and point-wise mutation analysis the authors made while examining CRISPR Cas9 activity.
They conducted the CRISPR Cas9 experiment on mammalian cells using a library targeting the coding sequence of human CD33 with all possible sgRNAs, regardless of PAM.

Results
They categorized sgRNA mutations in 3 ways.

Nucleotide Deletion: This seldom led to activity. Out of 76 nucleotide and position combinations, 14 retained high levels of cleavage activity from which 13 occured at positions 2, 3, 19 and 20. At positions 19 and 20, the deletion of cytosine in the RNA was seldom tolerated.
Nucleotide Insertion: Similar to deletion, mutation by nucleotide insertion rarely led to activity. Most of the activity happened when nucleotides were inserted at position 2 and 3.
Nucleotide Mismatch:
- 89% of Guanine in RNA and Thymine in DNA mismatches were active across all positions.
- 37% of Cytosine in RNA and Cytosine in DNA mismatches were active across all positions.
Some single-base mismatches were position dependant.
- At position 16, no Purine - Purine mismatches were tolerated but Purine - Thymine mismatches were highly active.
- At position 19, 41% of Purine - Purine and 48% of Purine - Thymine mismatches were active.

They also found that sgRNAs with an NGG PAM were the most active.

Cutting Frequency Determination Scoring (CFD)

The authors created a 2-dimensional matrix in which each cell represents the percent activity of a nucleotide base pair and mutation position.

The paper’s definition:

Cutting Frequency Determination (CFD) score is calculated by using the percent activity values from the matrix. For example, if the interaction between the sgRNA and DNA has a single rG:dA mismatch in position 6, then that interaction receives a score of 0.67. If there are two or more mismatches, then individual mismatch values are multiplied together. For example, an rG:dA mismatch at position 7 coupled with an rC:dT mismatch at position 10 receives a CFD score of 0.57 × 0.87 = 0.50.

Survey

A Machine Learning Algorithm For Predicting CRISPR CAS9 Genomic Cleavage Propensity

Background

In the 1980s, scientists noticed short palindromic repeats in the bacterial/prokaryotic genome. These palindromic sequences were separated by short sequences(32-38 nucleotide base pairs) later known as “protospacers”. In 1993, Francisco Mojica discovered that the protospacers sequences separating palindromes in the bacterial genome were identical to DNA snippets from the genome of bacteriophage and virus. He coined the term “CRISPR” (Clustered, Regularly InterSpaced Palindromic Repeats) and named the bacterial genomic region containing these palindromes and protospacers as the “CRISPR Locus” or “CRISPR arrays”. This led to the discovery that bacterias keep the DNA copies of bacteriophage in the CRISPR locus to later use an adaptive immune system to remove those DNA snippets from their own genome. When the virus attacks the bacteria again, they produce complementary RNA segments to the protospacer in CRISPR arrays. These RNA sequences are also known as “Single-guided RNAs” aka sgRNAs. To remove each viral DNA snippet, the bacteria uses a Cas9 complex containing an sgRNA along with the Cas9 endonuclease enzyme. The sgRNA guides the Cas9 to bind to a PAM sequence located on the opposite strand of the target DNA. Once the sgRNA finds the complementary DNA sequence, Cas9 being a nuclease edits the DNA by cutting and removing the double stranded complementary DNA from the bacterial genome. This naturally occurring self-defense mechanism in bacterial cells is known as “CRISPR Cas9”.

CRISPR Cas9 - a Gene Editing Technique

The CRISPR Cas9 system in bacterias has been adapted by human beings as a gene-editing technique. CRISPR Cas9 works similarly in a research laboratory environment. Scientists start by creating a Single-Guided RNA sequence complementary to their target DNA/gene. They then bind the sgRNA to the Cas9 enzyme to create the Cas9 complex and later inject it into the target cell. There are a number of techniques to inject the CAS9 complex into the cells and body.

Viral injections
Zygote injections
Gels and creams
Drinkable or edible CRISPR
Ear injections
Skin Grafts
Ex-vivo therapy

Once the Cas9 complex is injected into the cell, it binds to the target DNA sequence. This can be used to mutate genes or turn them on and knock them off by adding, removing and replacing nucleotide base pairs in the target site. Even though the actual CRISPR Cas9 system in bacterias only removes DNA, scientists have found ways to replace and add DNA sequences by integrating it with other natural mechanisms like a cell’s homology-directed repair.

CRISPR Cas9 and off-targets

Even though CRISPR Cas9 is simpler and more time-efficient compared to older gene-editing techniques, they are not as thoroughly-studied. CRISPR Cas9 is not perfect and it can cleave unintended DNA sequences that are not complementary to the sgRNA. These sites are known as “off-targets”. Previous studies found that CRISPR Cas9 can tolerate sequence mismatches. This paper introduces a novel machine-learning algorithm called CRISTA that considers a number of features to find the propensity of a DNA being cleaved by a given sgRNA.

Earlier research on this topic experimented and analyzed off-target effects on DNA sequences that were similar to the SgRNA sequence. Since these target sites were pre-selected based on their sequence similarity, the results of these papers were biased and limited to only pairwise sequence similarity.

As a results, unbiased research was conducted on genome-wide data. The target sites were included whether or not they were similar to the sgRNA. This led to the discovery of other features (not limited to sequence similarity) that impacted the cleavage propensity of genomic sites. Scientists found that off-targets can be located at unexpected sites.

Targets with an alternative PAM sequence (besides the canonical NGG PAM) were surprisingly cleaved.
A large number of mismatches between the cleaved site and sgRNA were tollerated.
Sites that had a high cleavage frequency compared to the actual correct target site.

Previous computational work has also been conducted on designing sgRNAs to minimize off targets through calculating scores like CCTOP and CFD (details are in the progress tab). These algorithms, however do not go beyond pairwise sequence similarity and neglect features concerning the genomic context surrounding the target site and thermodynamics of the sgRNA.

Discoveries made by the unbiased research projects make it obvious that in order to predict the cleavage frequency of a target genomic site given an sgRNA, a number of features need to be considered. These features can not only be limited to basic sequence features and so CRISTR, the proposed machine learning solution does the job of incorporating a number of features in its predictive model to calculate the cleavage propensity of a genomic site given an sgRNA.

Data Assembly

Since this paper did not want to make the mistake of pre-selecting genomic sites based on their similarity to the sgRNA, they chose to use datasets from research experiments that focused on genome-wide data. This not only ensured that their research was free of the pairwise sequencing bias, it also introduced the concept of accounting for important features that affect cleavage frequency; These features were neglected by previous computational methods. They used five different datasets; each dataset had a list of SgRNAs and the cleaved genomic/DNA sequence along with the number of times it was cleaved. All datasets combined had 25 unique sgRNAs and 872 genomic sites cleaved by those sgRNAs.

Since all the datasets were obtained from different experimental conditions the cleavage frequencies were not comparable. The authors used linear regression to transform the cleavage frequencies to a common scale.

Bulges in Sequence Alignment

Previous research found that counting mismatches between the sgRNA and target DNA wasn’t enough in calculating sequence similarity as DNA and RNA could also have bulges/gaps. This meant that the cleaved DNA could be 3 nucleotide base pair longer or shorter than the complementary sgRNA. If the sgRNA was 20bp the cleaved DNA could be between 17-23bp long. To account for these bulges, the authors modified the Needleman-Wunch pairwise alignment problem. Since the sequence similarity was not just mismatches, rules had to be created to decide the penalty and whether a mismatch takes precedence over a gap. To learn the best parameters, different combination of parameter values were plotted against the cleavage frequencies. The parameter values resulting in the highest averaged square Pearson correlation (r^2) was chosen.

Methods

CRISPR Target Assessment (CRISTA) is a tool for predicting cleavage propensity of potential genomic targets given a specified sgRNA. It is based on a linear regression model using Random Forest Algorithm and in doing so, allows us to examine the important features that determine the variation of cleavage efficiency. The machine learning algorithm relies on assembly of a training dataset that encompasses a range of data inputs as well as incorporating a set of features that can be used to predict cleavage efficiencies. The algorithm can also be used to distinguish cleaved and uncleaved sites using a classification learning scheme.

Training Dataset Assembly

The training dataset initially comprised of a set of uncleaved sites, representing sites that were not cleaved by each sgRNA. Theoretically, one could just exclude the set of cleaved sites of each sgRNA and take the entire genome to represent the uncleaved set, but we include only uncleaved genomic sites with sufficient sequence complementarity to each sgRNA, for meaningful analysis. Each sgRNA from dataset was aligned to sites that follow NGG or NAG motifs in the genome, as per the modified Needleman Wunsch algorithm described above. Then, only the sites with alignment score > 14.75 (95% of cleaved instances have on average 16.7 matched bases) were retained for further analysis. The number of sites in the uncleaved set varies from 3000 to 70000 per sgRNA, and it is noted that this procedure could introduce some noise for targets in which the reference genome is not identical to the genome used in the experiments.

The combined training dataset was assembled from experimentally validated cleavage sites together with uncleaved sites. The dependent variable was the cleavage efficiency for each cleaved sample and zero for each uncleaved sample. To keep the data as unbiased as possible, undersampling was performed on the majority class (uncleaved sites) and oversampling was done on the minority class (cleaved sites). Each set of cleaved samples (targets that correspond to a single sgRNA) was oversampled using bootstrapping that resulted in a subset twice the original size and an equal sized of uncleaved set was chosen randomly, as the combined training set. This process was repeated and the results were averaged over executing the algorithm on 100 sampled datasets.

Predictive Features

A wide range of possible explanatory attributes were computed. These include features that are specific to target site eg., type of PAM sequence, nucleotide composition and GC content, chromatin structure, CpG islands, gene expression levels of coding regions, those specific to the sgRNA eg., secondary structure and those concerning similarity between the sgRNA and the target eg., number and spatial distribution of mismatches and bulges.

Implementation and Availability

Given the training dataset and set of features, CRISTA is implemented using the RandomForestRegressor in Python’s Scikit Learn library. The score provided by CRISTA represents the log number of sequencing reads identified, which in turn represents a proxy for cleavage frequency, referred to as inferred cleavage propensity. This score is a continuous value but can be used for binary classification, viz., for potentially cleaved and uncleaved sites for a given sgRNA. We can use a fixed threshold for categorization based on observed cleaved sites predicted by CRISTA. For example, 95% of cleaved sites in cleaved dataset used by CRISTA obtained a score higher than 0.12 while 50% surpassed the score of 0.4. In validation set, the thresholds were 0.39 and 0.54 for 95% and 50% respectively. This score can be used to set a lenient or strict threshold.

Assessing Algorithm Performance

The prediction performance of CRISTA can be evaluated using cross-validation procedures. A leave-one sgRNA-out cross validation procedure was devised such that in each iteration the samples of a single sgRNA were excluded and used as test set. The algorithm, trained on rest of the data, is then used to predict cleavage probabilities for the test set. In each iteration of the cross-validation, there is a preliminary step: pairwise alignment parameters were first optimized as previously described using only the training set and then were used to recompute the pairwise alignment features for training and test sets. Several metrics were used to evaluate, viz., AUC-ROC curves and Pearson’s correlation coefficient (r2 ) score. Please see below for detailed results. The evaluation is computed over original set of cleaved sites for each sgRNA (without bootstrapping) and an equally-sized sample of uncleaved sites.

Identifying a succinct set of influential features

To learn on independent importance of various features, one can reduce the number of features by applying a forward selection procedure. Features are added iteratively by examining performance of leave-one-out cross validation for incremental sets of features. Start with the feature that gives the highest r2 score independently and this is repeated in every iteration, adding that feature to the set. This procedure is repeated for 15 iterations and Random Forest Regression is applied to resulting set of features and relative importance of each feature is extracted.

Results

What did accounting for bulges result in?

The introduction of gaps affected 18% of the target in the training dataset. It statistically meant 87 of 491 sites contain 1.1 bulges on average, or if we consider the entire data set 175 out of 872 contain 1.23 bulges on average.

The r² = of 0.34 versus the r² = 0.27 (between pairwise alignment score and observed cleavage frequencies) when no gaps were allowed. The parameters for the sequence alignment which gave the best results was when the value for match =1, mismatch = 0 and gap = -1.25. The mismatch was chosen to not be penalized but matches are rewarded highly so a longer complementary strand is preferred.

Inclusion of bulges also meant reconsidering the PAM sites before finalizing, this led to 33,17 and 22 shifts in Tsai, Kleinsteiver and Frock data respectively.

Also, pairwise similarity score is just accountable for 34% of the variation in cleaved sites so the decision to include more features for the prediction process was done.

The machine learning algorithm to predict cleavage propensity:

Crista uses Random forest regression model. They evaluated the prediction performance of CRISTA in Leave-one sgRNA- out cross validation procedure.

Pearson correlation coefficient = r² ( the two variables were observed cleavage frequencies and the predicted frequencies ) . for Crista it was 0.65, Opt cd = 0.13, cctop = 0.23 and CFD = 0.52.
The spearman rank correlation for CRISTA was 0.81, OptCD = 0.66 , CCTop = 0.64 and CFD was 0.74.
ROC curve to compare AUC for if they could distinguish between cleaved (+ve sets) and uncleaved (-ve sets) sites was measured. Crista = 0.96, OptCD = 0.85 CFD = 0.91 and CCTop =0.85.
AUC of precision-recall curve which is ability to detect and to rank among the positive samples. The scores of CRISTA= 0.96, CFD =0.93 , OptCD= 0.88 and CCTop = 0.8.

Consistency was tested by calculating accuracy of all the tools in prediction against all sgRNAs. CRISTA r²= 0.8 , sd=0.13. CFD r²=0.65, sd = 0.2, OptCD r² = 0.32 , sd=0.28 , CCTop r²= 0.46 , sd= 0.25. The averaged spearman correlation coefficients were CRISTA = 0.88, CFD = 0.77, OptCD = 0.76 and CCTop = 0.72.

How accurate was CRISTA in different detection technique?

First, the study was unbiased towards any of the data source. They conducted a leave-one-study-out cross validation technique to check for accuracy of CRISTA . This also allowed for checking the compatibility between the different methods.

The performance of the model was evaluated against data that was like training set as well as new data.

From the results, it was decided to eliminate Frock et al Data set as it was not compatible with the other datasets. All the other datasets, when used as test data, CRISTA gave results where r² was always greater than 0.8 and the AUC for ROC and PRC curves were close to 1.

The prediction accuracy of known data and new data were not too varied implicating that the algorithm did well with both known as well as brand new data.

How did inclusion of uncleaved sites contribute to learning procedure?

The inclusion of uncleaved sites had an impact in the prediction performance of the model. This was tested by leave-one-sgRNA-out procedure retaining only cleaved sites in training set. The accuracy was lower than that of original CRISTA.

Features Selection

Forward selection process was used to select the group of important features

Three major clusters were noticed. The three were related to

features concerning pairwise similarity .
features relating to nucleotide content.
features related to PAM sites and the dna geometry associated with it.

Pairwise alignment score was the first feature to be selected.

Among pairwise similarity features, the number of mismatches (with many attributes describing the type of mismatch) , their positions and the number of DNA RNA bulges seem important.

Among Nucleotide content, position of some nucleotides was recognized to be important which affect the sensitivity of CRISPR-Cas9. Nucleotides in the second position upstream of PAM, couple of nucleotides in position 4-5 nucleotides (where cleavage occurs) and nucleotide in first five positions upstream of PAM are all said to have contributions in the prediction accuracy.

Among dna geometry and Pam related features, targets that were present within DNaseI hypersensitive sites and wintin an exon were selected. The measure describing width of minor grove around Pam site was selected and also DNA enthalpy was selected. DNA enthalpy is said to be of great importance in Cas9 efficacy prediction.

Validation of CRISTA’s Performance:

To validate the performance of CRISTA, a unbiased learning dataset is needed. Dataset of targeted sequencing from Cho et al and Wang et all studies were chosen. They consist of 170 samples of on-target , off-target and uncleaved sites, 12 sgRNAs.

CRISTA achieved average of pearson r² = 0.68, AUC for ROC = 0.7 and AUC for Prc-Rec = 0.72. CRISTA performed better than CCTop and CFD scores. But OptCD scored more than CRISTA because of its binary nature. It assigns 1 to all on targets and 0 to all other sites. Whereas, CRISTA ,CCTop and CFD produce a continuous scale.

Contributors

Amna Irfan

Shivani Kamath Belman

Pratik Kshirsagar

Similarity

Problem Statement

Dataset

Dataset Description

The 4 attributes are:

The 3 class attributes are:

Preliminary Analysis and Preprocessing

Correlation Between Euclidean and Manhattan Distance

Paper Review

Problem Statement

Paper Summary

Discussion Summary

Proposal

Introduction

Problem

Strategy

References

Progress

Summary 1: RNA guided editing of bacterial genomes using CRISPR-Cas Systems

Summary 2: CCtop, An Intuitive, Flexible and Reliable CRISPR/Cas9 Target Prediction Tool

Summary 3: CFD Score, Optimized sgRNA design to maximize activity and minimize off-target effects of CRISPR-Cas9

Survey

A Machine Learning Algorithm For Predicting CRISPR CAS9 Genomic Cleavage Propensity

Background

CRISPR Cas9 - a Gene Editing Technique

CRISPR Cas9 and off-targets

Data Assembly

Bulges in Sequence Alignment

Methods

Training Dataset Assembly

Predictive Features

Implementation and Availability

Assessing Algorithm Performance

Identifying a succinct set of influential features

Results

What did accounting for bulges result in?

The machine learning algorithm to predict cleavage propensity:

How accurate was CRISTA in different detection technique?

How did inclusion of uncleaved sites contribute to learning procedure?

Features Selection

Validation of CRISTA’s Performance:

Glossary

Project Name

Project Name

Project Name

Project Name

Project Name

Project Name