I am a first year Computer Science graduate student at UIC. Last year, I graduated with a bachelors degree in Computer Science and a minor in Linguistics. Since the past 3 years, I have been working as a full-stack developer at LeadFuze which is a marketing startup. In terms of my undergraduate research experience, I have worked on statistical topic models (mostly LDA) and dialogue systems. I am fluent in Java, Python, GoLang, C#. During the research analysis phase of my projects, I also had the chance to work in R-studio and SAS. Why am I taking this course? I am obsessed with animals specially cats and I wanted to learn a bit more on how computer science can be applied to the life sciences. Right now, I am set on doing my graduate research thesis on NLP, but I want to see if this course will change my mind.
Hello there :) I am a Master's student in the Bioengineering department at UIC with a specialization in Bioinformatics. I have a Bachelor's degree in Computer Science. I always enjoyed biology so I chose to get into bioinformatics to apply my computer science skillset in a field I like. I have experience working with various research teams and medical data. I chose this course because it allows me to delve into the computational aspects of biology.
I'm a Master's student in Computer Science. I like to understand why things work the way they do. đ¤
I love science for it can show how intelligent đ¤ (or dumb đ¤ˇ) humans can be ;)Curiosity killed the cat but satisfaction brought it back. đ => Let nothing hold you back from learning!
I took this course out of my geniune love for biology since an early age. I hope to acquire new knowledge and have some fun along the way!
Find a biological database of your choice and download 2 closely related entries from it. How did you determine that the entries are closely related? What is the âdistanceâ between the entries? How did you define and compute it?
In our very first task related to the project, we have selected an exceedingly simple domain. The data we have selected is the 'Iris Data Set'. This dataset has information on various physical traits of flowers from different species belonging to the Iris genus.
The dataset can be found here.
The dataset we have picked contains information about samples belonging to the Iris genus of flowers. Based on their attributes or traits, we can predict what species they belong to, hence being an ideal dataset for learning classification.
There are 150 instances of different flowers. For each instance, there exist 4 attributes and a class attribute which tells us which species the instance actually belongs to. Each attribute is a numerical value that quantifies the particular trait and the class label is a categorical value.
Below there is an example of how the dataset looked for the first few instances.
Our initial analysis allowed us to conclude that there were no missing values. This makes our data as accuracte and reliable as possible. We also found out that there were no duplicate values. This is great news because this avoids any confusion due to unusual similarity scores and helps avoid unnecessary computations. Since every data instance was unique, finding similarity between 2 entries was extremely convenient and would give reliable values.
As it can be seen from the image of sample data above, all of the attributes are useful in calculating similarity. (Hooray!) Every instance, ie, flower, has a unique sepal width, sepal length, petal width and petal length. All of these properties can be used to calculate similarity. The class label too is instrumental in finding out whether 2 instances are similar or not, but in order to keep a uniform measure of similarity, we decided to exclude it from our initial computations.
We used a python script to import our dataset and prepare it (exclude the class label) for computing similarity, using pandas.
After preliminary analysis, it was easy to conclude that we needed to find some sort of distance measure. We tried out Manhattan distance as well as Euclidean distance between the attribute values of different instances.
Our data instances have 4 attribute values each. Each of those values being numerical in nature. So, the best way to find similarity between 2 entries was to treat both of the instances as vectors [1-D arrays] and find the Manhattan as well as Euclidean distances between the 2 instances.
Manhattan distance = Sum(|Val ij - Val ik|)
Euclidean distance = Sq. root(Sum(|Val ij - Val ik|^2)).
where Val ij = value of attribute i for data instance j and Val ik = value of attribute i for data instance k.
Both of these turned out equally effective distance measures but it seemed more convenient to consider Euclidean as our primary distance measure purely due to convention. As we all know, the L2 Norm is the most popular distance measure.
We achieved these aforementioned computations through simple dataframe operations in Python.
So as we can see from the above example, the Manhattan distance between the two entries [73] and [98] is 1.997 and 3.099 respectively.
We used the 'distance' package from SciPy to calculate both the Euclidean and Manhattan distances. The cityblock and euclidean functions used in the example above, are from the said package.
The following table represents the first 26 records from the flower class Iris-Versicolor.
ID | Sepal Length | Sepal Width | Petal Length | Petal Width | Class |
---|---|---|---|---|---|
a | 7.0 | 3.2 | 4.7 | 1.4 | Iris-versicolor |
b | 6.4 | 3.2 | 4.5 | 1.5 | Iris-versicolor |
c | 6.9 | 3.1 | 4.9 | 1.5 | Iris-versicolor |
d | 5.5 | 2.3 | 4.0 | 1.3 | Iris-versicolor |
e | 6.5 | 2.8 | 4.6 | 1.5 | Iris-versicolor |
f | 5.7 | 2.8 | 4.5 | 1.3 | Iris-versicolor |
g | 6.3 | 3.3 | 4.7 | 1.6 | Iris-versicolor |
h | 4.9 | 2.4 | 3.3 | 1.0 | Iris-versicolor |
i | 6.6 | 2.9 | 4.6 | 1.3 | Iris-versicolor |
j | 5.2 | 2.7 | 3.9 | 1.4 | Iris-versicolor |
k | 5.0 | 2.0 | 3.5 | 1.0 | Iris-versicolor |
l | 5.9 | 3.0 | 4.2 | 1.5 | Iris-versicolor |
m | 6.0 | 2.2 | 4.0 | 1.0 | Iris-versicolor |
n | 6.1 | 2.9 | 4.7 | 1.4 | Iris-versicolor |
o | 5.6 | 2.9 | 3.6 | 1.3 | Iris-versicolor |
p | 6.7 | 3.1 | 4.4 | 1.4 | Iris-versicolor |
q | 5.6 | 3.0 | 4.5 | 1.5 | Iris-versicolor |
r | 5.8 | 2.7 | 4.1 | 1.0 | Iris-versicolor |
s | 6.2 | 2.2 | 4.5 | 1.5 | Iris-versicolor |
t | 5.6 | 2.5 | 3.9 | 1.1 | Iris-versicolor |
u | 5.9 | 3.2 | 4.8 | 1.8 | Iris-versicolor |
v | 6.1 | 2.8 | 4.0 | 1.3 | Iris-versicolor |
w | 6.3 | 2.5 | 4.9 | 1.5 | Iris-versicolor |
x | 6.1 | 2.8 | 4.7 | 1.2 | Iris-versicolor |
y | 6.4 | 2.9 | 4.3 | 1.3 | Iris-versicolor |
z | 6.6 | 3.0 | 4.4 | 1.4 | Iris-versicolor |
Plotting Manhattan and Euclidean similarity distance between nth - 1 and nth record where n = 26.
Looking at the line graph above, there seems to be a correlation between the Manhattan and Euclidean similarity measures. We will now use a scatter plot to plot Manhattan distance as a function of Euclidean distance and find if they correlate. If they do, we will also try to find a regression line that best describes their relation. Note that N = 51
Result
R Calculation
r = 19.87 / â((11.138)(37.087)) = 0.9776
The value of R is 0.9776. Therefore, in the domain of Iris, there is a strong positive correlation between the Manhattan and Euclidean similarity distance.
Regression Line: y = â0.074 + 1.784â xWe were instructed to review a paper from the RECOMB conference. So, we chose a paper from RECOMB 2013 titled âIPED: Inheritance Path-based Pedigree Reconstruction Algorithm Using Genotype Dataâ. The paper can be found here.
The paper focuses on finding an efficient way of pedigree reconstruction, ie, inference of family trees, which is a fundamental problem in genetics. It talks about the drawbacks of existing algorithms with respect to space and time complexity. Existing algorithms perform poorly when the number of generations is greater than a certain threshold, 4 according to the paper, and also perform poorly for inbred populations. The paper puts forward an algorithm that handles cases of both inbred and outbred populations as well as really large number of generations, efficiently.
The paper first introduced us to the concept of pedigree graphs and all the necessary nomenclature associated with them. It then talks about evaluating the relationship of pairs of individuals for both kind of pairs; extant and ancestral. The IBD(identity-by-descent) measure is central to this evaluation.
For outbred populations, M = 2(g - 1) : g is the generation, but for inbred populations, M is calculated by an algorithm. For chromosomes in (i, j), they calculate the average Vi, j over all combinations of i, j.
To determine relationship of an ancestral pair (k, l), a similar strategy is used but the weighted average of relationship score of all possible pairs of chromosomes in both the ancestral pair is calculated.
Also, to find E (IBD) and var (IBD), one needs to find the number of meoises (M). The authors have proposed an algorithm that deals with inheritance paths and their lengths from a shared ancestor. They use Inheritance Path Pairs(IPPs) to represent the length of the path and the number of such paths between an ancestor and an extant. They use a hash table for storing these IPP tuples, which speed up their time complexity. To compute these IPPs they use a Dynamic Programming algorithm that uses the IPPs available for current generation to calculate IPPs for next generation. The IPPs for a child are just a merge of IPPs of their parent. This algorithm has time complexity linear in terms of pedigree height.
Given the IPPs for two individuals, the number of Meioses (M) can be calculated which in turn can be used to calculate the expected value and variance of IBDs. Once the relationship scores have been calculated, the siblings can be assigned to the same parents with the help of a Max-Clique algorithm that finds a maximum clique of siblings. Only if the distance between a pair of extant individuals in the original and the reconstructed pedigree are the same, the reconstruction is considered accurate. Results from the paper suggest that IPED was better at pedigree chart generation and also faster compared to the traditional COP in Outbreeding cases, it also does not rely on any threshold value. In case of the inbreeding instances, IPED performs far better than CIP or COP algorithms as the number of generations increases. In conclusion, the authors state that IPED is an very efficient algorithm for pedigree reconstruction, it is independent of threshold values and sampling. They specifically focus on how well the algorithm performs for inbreeding cases.
We think that the problem was interesting and fun. We thought that the IPED did a great job when it was tested against the Inbred cases, but not having considered âhalf-siblingsâ makes the problem less complex and incomplete. Also the authors stated that the algorithm is not optimal, and going for the approximation algorithm makes the algorithm vulnerable to breaking for specific cases, such as the presence of common inheritance path pairs amongst individuals. The approximation of the mean and variance of IBD length from average number of meiosis renders it nonoptimal.
We think that the approach is commendable but we do not have in-depth knowledge about the cases in question yet so we donât have a case to present for a better approach. We were impressed with the drastic improvement in accuracy that IPED brought about in reconstructing pedigree graphs for inbred populations.
Clustered regularly interspaced short palindromic repeats known as CRISPR was first found in bacterial/prokaryotic cells. It is a built-in natural immune system that the bacterial cell uses to fight against viral invasions within its body.
In 1980s, scientists started noticing a strange pattern in some bacterial genomes. They found short palindromic repeating sequences within the genome that were 28-37 base pairs long each and were separated by other short sequences (32-38 base pairs) which were later named âSpacer Sequences/DNAâ. These spaces turned out to be copies of the DNA from viruses that previously attacked the bacteria or its ancestors. These repeats along with the spaces together are known as the âCRISPR locusâ.
During a phage attack, the virus modifies the bacterial genome by injecting its own DNA into the bacterial cell. In defense, the bacteria responds by activating the CRISPR CAS9 system which is a natural defense mechanism within its cell. There are 3 steps to this immune system.
Since proteins are mostly universal, this discovery has opened up the possibility of editing/mutating the genome of any organism through injecting the CAS9 protein along with a guided RNA (CRISPR) into the target cells.
Though CRISPR has been one of the greatest breakthroughs in technology, it comes with many down sides. It is one of the easiest gene engineering techniques discovered, but scientists have also discovered that it causes many mutations that are not supposed to occur. Changes in unintended target areas are often referred to as off-target effects.[4] One of the main reasons for off-target mutations is that CRISPR-Cas9 cleaves DNA sequences that are similar to the target sequences.[3] In order to reduce these problems we could try to maximize cleavage efficiency and minimize off-target effects. Our survey is going to consider various research papers that have been published to address the same.
Our aim is to learn in detail about CRISPR , how it works and survey the many machine learning methods that have been applied on making CRISPR cleavage efficient and predicting off-target effects of CRISPR. We plan to implement our survey gradually. First, we will focus on learning in depth about CRISPR-CAS9âs working mechanism, and how it is used in genetic engineering. Then we will focus on the machine learning methods that have been implemented to make CRISPR more efficient and accurate. We do not plan to implement the techniques due to lack of data and the lack of technology which is required to test the accuracy of methods even if they are implemented. We will put up our results on our website and also present our findings in class.
This paper talks about how CRISPR systems with dual-RNA:Cas9 can be used for editing bacterial genomes. The authors argument mainly revolves about the fact that we can use this CRISPR system for genome editing instead of cleaving the genome at specific points.
The authors reprogram the specificity by changing the sequence of short CRISPR RNA(crRNA) to make single and multi nucleotide changes carried on editing templates. The authors have worked with S. pneumoniae and E. coli as part of their analyses. It is shown that mutations can be introduced through the transformation of a template DNA fragment that will recombine in the genome and eliminate recognition of the target by the endonuclease (Cas9). Also it is observed that including several different crRNAs will let us introduce multiple mutations simultaneously.
To introduce specific changes in the genome, one must use an editing template carrying mutations that avert cleavage by the endonuclease. This is easy to achieve when performing gene insertion. But, when gene fusion is desired or generation of single nucleotide mutations, this is possible only by introducing mutations in the editing template that alter either the PAM or protospacer sequences.
While mutating the PAMs, there is only one criteria to be followed, we avoid mutating NGG to NAG or NNGGNN. All other random mutations in the PAMs are functional. Seed sequence (8-12bp adjacent to the PAMs) can prevent cleavage but there are restrictions with respect to nucleotide changing at each position and these restrictions vary with different spacers. Hence, mutations in the PAMs, if possible, should be the preferred strategy.
The specificity and versatility of editing using the CRISPR-Cas system rely on several unique properties of the Cas9 endonuclease: (i) its target specificity can be programmed with a small RNA, without the need for enzyme engineering, (ii) target specificity is very high, determined by a 20-bp RNA-DNA interaction with low probability of nontarget recognition, (iii) almost any sequence can be targeted, the only requirement being the presence of an adjacent NGG sequence, (iv) almost any mutation in the NGG sequence, as well as mutations in the seed sequence of the protospacer, eliminates targeting.
The Crispr/Cas9 system technology is used in gene editing making it possible for knock-out, knock-in of genes, gene inactivation, ⋆ non-homologous end-joining as well as ⋆homology directed repair. Though Crispr/Cas9 provides us with so many ways for gene editing, there are also a few drawbacks, one of them being the fact that there could be ⋆off-target slicing of the gene. The authors interest is to develop an interactive tool, that is easy to use and can be used to evaluate and identify all candidate sgRNA target sites along with their off-target quality.
To achieve this, the authors set up a web tool called Crispr/Cas9 target online prediction (CCtop) built with HTML and CGI scripts, using python for all the processing.
They used the bowtie short read aligner to search for off target sites. It takes in index and few parameters along with it outputting the sequences that align with the input sequence. Then these alignments are scored to indicate the likelihood of the stability in the ⋆ sgRNA/DNA ⋆ heteroduplex. They also found that the likelihood decreases if the mismatch is closer to the PAM sequence.
The formula used to score the off-target sequences is Scoreoff-target = ∑mismatch 1.2pos.
Where pos is the position of each mismatch read from 5’ end.
After computing scores, each off target site is aligned with genes that its closest to. For this process, exon coordinate files of each organism are contained in the bx library in python which is used in this project. Basically, the off-target sites are checked to see if they contain any genes. Only exons closer than 100kb to target sites are assigned as the gene, else it is set to N/A. if they overlap, then the distance is set to 0. After assignment of genes, an sgRNA target site is scored. The sgRNA target sites are ranked based on how many numbers of predicted off-target sites are present and how they affect the off-target genes. The ranking is done using a single score which is calculated using the formula
Score = ∑off-target = [log10(dist)+scoreoff-target]/ [total_off_targets] − total_off_targets.
Where dist is the distance of each off-target site to corresponding closet exon. Off target sites with no associated exons are not considered for this score.
Conclusion: The web tool developed was tested for gene inactivation and non-homology and homologous gene repair. The web tool is also constantly improved based on available resources. CCtop was designed to evaluate and present the users with possible sgRNA target sites, and the tool does what it was promised to do. It is also user friendly for first time amateur users at the same time it provides enough flexibility for experts to use it as well.
Overall, the tool has been well received in the scientific community.
CRISPR Cas9 is an adaptive autoimmune system present in bacterias to fight against phage attacks. This self-defense mechanism within the bacterial cell uses a Singular Guided RNA (sgRna) along with the Cas9 endonuclease enzyme to eliminate bacteriophage/virus DNA within the cell. The Cas9 nuclease cleaves DNA that is complementary to the sgRNA. The concept of using an sgRNA with the Cas9 protein to remove DNA gave rise to the CRISPR Cas9 human-controlled gene-editing technique.
Compared to other gene-editing techniques like Zinc finger nuclease and Transcription activator-like
effector nucleases, CRISPR Cas9 is a more modern and simpler technique providing more versatility in
genome engineering.
However, given the age of the technique, CRISPR Cas9 has not been studied as extensively as older
techniques.
Scientists found that CRISPR can cleave unwanted DNA sequences that are not an exact match to the input
sgRNA. These mismatches known as off-targets can cause serious genetic damage.
Experiment Purpose & Design
This paper investigates a number of techniques to design an optimal sgRNA to minimize off-targets and
maximize activity. In this summary, we will discuss the sequence-related and point-wise mutation
analysis the authors made while examining CRISPR Cas9 activity.
They conducted the CRISPR Cas9 experiment on mammalian cells using a library targeting the coding
sequence of human CD33 with all possible sgRNAs, regardless of PAM.
Results
They categorized sgRNA mutations in 3 ways.
Cutting Frequency Determination Scoring (CFD)
The authors created a 2-dimensional matrix in which each cell represents the percent activity of a nucleotide base pair and mutation position.
The paperâs definition:
Cutting Frequency Determination (CFD) score is calculated by using the percent activity values from the matrix. For example, if the interaction between the sgRNA and DNA has a single rG:dA mismatch in position 6, then that interaction receives a score of 0.67. If there are two or more mismatches, then individual mismatch values are multiplied together. For example, an rG:dA mismatch at position 7 coupled with an rC:dT mismatch at position 10 receives a CFD score of 0.57 Ă 0.87 = 0.50.
In the 1980s, scientists noticed short palindromic repeats in the bacterial/prokaryotic genome. These palindromic sequences were separated by short sequences(32-38 nucleotide base pairs) later known as âprotospacersâ. In 1993, Francisco Mojica discovered that the protospacers sequences separating palindromes in the bacterial genome were identical to DNA snippets from the genome of bacteriophage and virus. He coined the term âCRISPRâ (Clustered, Regularly InterSpaced Palindromic Repeats) and named the bacterial genomic region containing these palindromes and protospacers as the âCRISPR Locusâ or âCRISPR arraysâ. This led to the discovery that bacterias keep the DNA copies of bacteriophage in the CRISPR locus to later use an adaptive immune system to remove those DNA snippets from their own genome. When the virus attacks the bacteria again, they produce complementary RNA segments to the protospacer in CRISPR arrays. These RNA sequences are also known as âSingle-guided RNAsâ aka sgRNAs. To remove each viral DNA snippet, the bacteria uses a Cas9 complex containing an sgRNA along with the Cas9 endonuclease enzyme. The sgRNA guides the Cas9 to bind to a PAM sequence located on the opposite strand of the target DNA. Once the sgRNA finds the complementary DNA sequence, Cas9 being a nuclease edits the DNA by cutting and removing the double stranded complementary DNA from the bacterial genome. This naturally occurring self-defense mechanism in bacterial cells is known as âCRISPR Cas9â.
The CRISPR Cas9 system in bacterias has been adapted by human beings as a gene-editing technique. CRISPR Cas9 works similarly in a research laboratory environment. Scientists start by creating a Single-Guided RNA sequence complementary to their target DNA/gene. They then bind the sgRNA to the Cas9 enzyme to create the Cas9 complex and later inject it into the target cell. There are a number of techniques to inject the CAS9 complex into the cells and body.
Once the Cas9 complex is injected into the cell, it binds to the target DNA sequence. This can be used to mutate genes or turn them on and knock them off by adding, removing and replacing nucleotide base pairs in the target site. Even though the actual CRISPR Cas9 system in bacterias only removes DNA, scientists have found ways to replace and add DNA sequences by integrating it with other natural mechanisms like a cellâs homology-directed repair.
Even though CRISPR Cas9 is simpler and more time-efficient compared to older gene-editing techniques, they are not as thoroughly-studied. CRISPR Cas9 is not perfect and it can cleave unintended DNA sequences that are not complementary to the sgRNA. These sites are known as âoff-targetsâ. Previous studies found that CRISPR Cas9 can tolerate sequence mismatches. This paper introduces a novel machine-learning algorithm called CRISTA that considers a number of features to find the propensity of a DNA being cleaved by a given sgRNA.
Earlier research on this topic experimented and analyzed off-target effects on DNA sequences that were similar to the SgRNA sequence. Since these target sites were pre-selected based on their sequence similarity, the results of these papers were biased and limited to only pairwise sequence similarity.
As a results, unbiased research was conducted on genome-wide data. The target sites were included whether or not they were similar to the sgRNA. This led to the discovery of other features (not limited to sequence similarity) that impacted the cleavage propensity of genomic sites. Scientists found that off-targets can be located at unexpected sites.
Previous computational work has also been conducted on designing sgRNAs to minimize off targets through calculating scores like CCTOP and CFD (details are in the progress tab). These algorithms, however do not go beyond pairwise sequence similarity and neglect features concerning the genomic context surrounding the target site and thermodynamics of the sgRNA.
Discoveries made by the unbiased research projects make it obvious that in order to predict the cleavage frequency of a target genomic site given an sgRNA, a number of features need to be considered. These features can not only be limited to basic sequence features and so CRISTR, the proposed machine learning solution does the job of incorporating a number of features in its predictive model to calculate the cleavage propensity of a genomic site given an sgRNA.
Since this paper did not want to make the mistake of pre-selecting genomic sites based on their similarity to the sgRNA, they chose to use datasets from research experiments that focused on genome-wide data. This not only ensured that their research was free of the pairwise sequencing bias, it also introduced the concept of accounting for important features that affect cleavage frequency; These features were neglected by previous computational methods. They used five different datasets; each dataset had a list of SgRNAs and the cleaved genomic/DNA sequence along with the number of times it was cleaved. All datasets combined had 25 unique sgRNAs and 872 genomic sites cleaved by those sgRNAs.
Since all the datasets were obtained from different experimental conditions the cleavage frequencies were not comparable. The authors used linear regression to transform the cleavage frequencies to a common scale.
Previous research found that counting mismatches between the sgRNA and target DNA wasnât enough in calculating sequence similarity as DNA and RNA could also have bulges/gaps. This meant that the cleaved DNA could be 3 nucleotide base pair longer or shorter than the complementary sgRNA. If the sgRNA was 20bp the cleaved DNA could be between 17-23bp long. To account for these bulges, the authors modified the Needleman-Wunch pairwise alignment problem. Since the sequence similarity was not just mismatches, rules had to be created to decide the penalty and whether a mismatch takes precedence over a gap. To learn the best parameters, different combination of parameter values were plotted against the cleavage frequencies. The parameter values resulting in the highest averaged square Pearson correlation (r^2) was chosen.
CRISPR Target Assessment (CRISTA) is a tool for predicting cleavage propensity of potential genomic targets given a specified sgRNA. It is based on a linear regression model using Random Forest Algorithm and in doing so, allows us to examine the important features that determine the variation of cleavage efficiency. The machine learning algorithm relies on assembly of a training dataset that encompasses a range of data inputs as well as incorporating a set of features that can be used to predict cleavage efficiencies. The algorithm can also be used to distinguish cleaved and uncleaved sites using a classification learning scheme.
The training dataset initially comprised of a set of uncleaved sites, representing sites that were not cleaved by each sgRNA. Theoretically, one could just exclude the set of cleaved sites of each sgRNA and take the entire genome to represent the uncleaved set, but we include only uncleaved genomic sites with sufficient sequence complementarity to each sgRNA, for meaningful analysis. Each sgRNA from dataset was aligned to sites that follow NGG or NAG motifs in the genome, as per the modified Needleman Wunsch algorithm described above. Then, only the sites with alignment score > 14.75 (95% of cleaved instances have on average 16.7 matched bases) were retained for further analysis. The number of sites in the uncleaved set varies from 3000 to 70000 per sgRNA, and it is noted that this procedure could introduce some noise for targets in which the reference genome is not identical to the genome used in the experiments.
The combined training dataset was assembled from experimentally validated cleavage sites together with uncleaved sites. The dependent variable was the cleavage efficiency for each cleaved sample and zero for each uncleaved sample. To keep the data as unbiased as possible, undersampling was performed on the majority class (uncleaved sites) and oversampling was done on the minority class (cleaved sites). Each set of cleaved samples (targets that correspond to a single sgRNA) was oversampled using bootstrapping that resulted in a subset twice the original size and an equal sized of uncleaved set was chosen randomly, as the combined training set. This process was repeated and the results were averaged over executing the algorithm on 100 sampled datasets.
A wide range of possible explanatory attributes were computed. These include features that are specific to target site eg., type of PAM sequence, nucleotide composition and GC content, chromatin structure, CpG islands, gene expression levels of coding regions, those specific to the sgRNA eg., secondary structure and those concerning similarity between the sgRNA and the target eg., number and spatial distribution of mismatches and bulges.
Given the training dataset and set of features, CRISTA is implemented using the RandomForestRegressor in Pythonâs Scikit Learn library. The score provided by CRISTA represents the log number of sequencing reads identified, which in turn represents a proxy for cleavage frequency, referred to as inferred cleavage propensity. This score is a continuous value but can be used for binary classification, viz., for potentially cleaved and uncleaved sites for a given sgRNA. We can use a fixed threshold for categorization based on observed cleaved sites predicted by CRISTA. For example, 95% of cleaved sites in cleaved dataset used by CRISTA obtained a score higher than 0.12 while 50% surpassed the score of 0.4. In validation set, the thresholds were 0.39 and 0.54 for 95% and 50% respectively. This score can be used to set a lenient or strict threshold.
The prediction performance of CRISTA can be evaluated using cross-validation procedures. A leave-one sgRNA-out cross validation procedure was devised such that in each iteration the samples of a single sgRNA were excluded and used as test set. The algorithm, trained on rest of the data, is then used to predict cleavage probabilities for the test set. In each iteration of the cross-validation, there is a preliminary step: pairwise alignment parameters were first optimized as previously described using only the training set and then were used to recompute the pairwise alignment features for training and test sets. Several metrics were used to evaluate, viz., AUC-ROC curves and Pearsonâs correlation coefficient (r2 ) score. Please see below for detailed results. The evaluation is computed over original set of cleaved sites for each sgRNA (without bootstrapping) and an equally-sized sample of uncleaved sites.
To learn on independent importance of various features, one can reduce the number of features by applying a forward selection procedure. Features are added iteratively by examining performance of leave-one-out cross validation for incremental sets of features. Start with the feature that gives the highest r2 score independently and this is repeated in every iteration, adding that feature to the set. This procedure is repeated for 15 iterations and Random Forest Regression is applied to resulting set of features and relative importance of each feature is extracted.
The introduction of gaps affected 18% of the target in the training dataset. It statistically meant 87 of 491 sites contain 1.1 bulges on average, or if we consider the entire data set 175 out of 872 contain 1.23 bulges on average.
The r2 = of 0.34 versus the r2 = 0.27 (between pairwise alignment score and observed cleavage frequencies) when no gaps were allowed. The parameters for the sequence alignment which gave the best results was when the value for match =1, mismatch = 0 and gap = -1.25. The mismatch was chosen to not be penalized but matches are rewarded highly so a longer complementary strand is preferred.
Inclusion of bulges also meant reconsidering the PAM sites before finalizing, this led to 33,17 and 22 shifts in Tsai, Kleinsteiver and Frock data respectively.
Also, pairwise similarity score is just accountable for 34% of the variation in cleaved sites so the decision to include more features for the prediction process was done.
Crista uses Random forest regression model. They evaluated the prediction performance of CRISTA in Leave-one sgRNA- out cross validation procedure.
Consistency was tested by calculating accuracy of all the tools in prediction against all sgRNAs. CRISTA r2= 0.8 , sd=0.13. CFD r2=0.65, sd = 0.2, OptCD r2 = 0.32 , sd=0.28 , CCTop r2= 0.46 , sd= 0.25. The averaged spearman correlation coefficients were CRISTA = 0.88, CFD = 0.77, OptCD = 0.76 and CCTop = 0.72.
First, the study was unbiased towards any of the data source. They conducted a leave-one-study-out cross validation technique to check for accuracy of CRISTA . This also allowed for checking the compatibility between the different methods.
The performance of the model was evaluated against data that was like training set as well as new data.
From the results, it was decided to eliminate Frock et al Data set as it was not compatible with the other datasets. All the other datasets, when used as test data, CRISTA gave results where r2 was always greater than 0.8 and the AUC for ROC and PRC curves were close to 1.
The prediction accuracy of known data and new data were not too varied implicating that the algorithm did well with both known as well as brand new data.
The inclusion of uncleaved sites had an impact in the prediction performance of the model. This was tested by leave-one-sgRNA-out procedure retaining only cleaved sites in training set. The accuracy was lower than that of original CRISTA.
Forward selection process was used to select the group of important features
Three major clusters were noticed. The three were related to
Pairwise alignment score was the first feature to be selected.
Among pairwise similarity features, the number of mismatches (with many attributes describing the type of mismatch) , their positions and the number of DNA RNA bulges seem important.
Among Nucleotide content, position of some nucleotides was recognized to be important which affect the sensitivity of CRISPR-Cas9. Nucleotides in the second position upstream of PAM, couple of nucleotides in position 4-5 nucleotides (where cleavage occurs) and nucleotide in first five positions upstream of PAM are all said to have contributions in the prediction accuracy.
Among dna geometry and Pam related features, targets that were present within DNaseI hypersensitive sites and wintin an exon were selected. The measure describing width of minor grove around Pam site was selected and also DNA enthalpy was selected. DNA enthalpy is said to be of great importance in Cas9 efficacy prediction.
To validate the performance of CRISTA, a unbiased learning dataset is needed. Dataset of targeted sequencing from Cho et al and Wang et all studies were chosen. They consist of 170 samples of on-target , off-target and uncleaved sites, 12 sgRNAs.
CRISTA achieved average of pearson r2 = 0.68, AUC for ROC = 0.7 and AUC for Prc-Rec = 0.72. CRISTA performed better than CCTop and CFD scores. But OptCD scored more than CRISTA because of its binary nature. It assigns 1 to all on targets and 0 to all other sites. Whereas, CRISTA ,CCTop and CFD produce a continuous scale.
Mutagenesis: Process in which genetic information of an organism is changed, leading to mutation
Nuclease: An enzyme that cleaves the chains of nucleotides in nucleic acids into smaller units.
Transcription: A process in which DNA is read by an RNA polymerase and produces a complementary,
antiparallel RNA strand called primary transcript.
Cleave: A chemical reaction that breaks the DNA sequence into two at a specific point.
Phage: A virus that parasitizes a bacterium by infecting it and reproducing inside it.
Homologous: DNA that is identical or compatible.
Recombination: An exchange between homologous region of DNA.
Oligonucleotide: Polynucleotide whose molecules contain a relatively small number of nucleotides.
Point Mutation: Mutation of a single nucleotide in the sequence.
Locus: A portion of the genome in bacteria.
Protospacer: Target sequence in the foreign genetic material.
Motif: a sequence motif is a nucleotide or amino-acid sequence pattern that is widespread and has,
or is
thought to have, a biological significance.
PAM: Precursor adjacent motif, part of CRISPR target sequence.
Endonuclease: A nuclease that is present in the nucleus.
Non-homologous end joining: after a dna strand is sliced, if there is no donor dna (donor dna:
specific
desired target dna) to insert in the space, the genes will repair by themselves which might cause a gene
mutation due to a missing or additional base addition in the sequence. This type of repair is called
non-homologous end joining.
Homology directed repair: after a dna strand is sliced, if there is a homologous donor DNA present
which is
used to rejoin the sliced dna, it is called a homology directed repair.
Off-target: refers to nonspecific and unintended genetic modifications
SgRNA: Single guide RNA, it contains of two parts, a variable crRNA sequence which is usually 20
base
pairs long (read from 3’ to 5’ end and a constant tracrRNA that is used to bind to the Cas9
enzyme which act as molecular scissors used to slice up double stranded DNA.
Heteroduplex: heteroduplex is the more stable alignments with low percentage of mismatched base
pairs.
Purines: the bases adenine and guanine present in DNA and RNA.
Lorem ipsum dolor sit amet, consectetur adipisicing elit. Mollitia neque assumenda ipsam nihil, molestias magnam, recusandae quos quis inventore quisquam velit asperiores, vitae? Reprehenderit soluta, eos quod consequuntur itaque. Nam.
Close ProjectLorem ipsum dolor sit amet, consectetur adipisicing elit. Mollitia neque assumenda ipsam nihil, molestias magnam, recusandae quos quis inventore quisquam velit asperiores, vitae? Reprehenderit soluta, eos quod consequuntur itaque. Nam.
Close ProjectLorem ipsum dolor sit amet, consectetur adipisicing elit. Mollitia neque assumenda ipsam nihil, molestias magnam, recusandae quos quis inventore quisquam velit asperiores, vitae? Reprehenderit soluta, eos quod consequuntur itaque. Nam.
Close ProjectLorem ipsum dolor sit amet, consectetur adipisicing elit. Mollitia neque assumenda ipsam nihil, molestias magnam, recusandae quos quis inventore quisquam velit asperiores, vitae? Reprehenderit soluta, eos quod consequuntur itaque. Nam.
Close ProjectLorem ipsum dolor sit amet, consectetur adipisicing elit. Mollitia neque assumenda ipsam nihil, molestias magnam, recusandae quos quis inventore quisquam velit asperiores, vitae? Reprehenderit soluta, eos quod consequuntur itaque. Nam.
Close ProjectLorem ipsum dolor sit amet, consectetur adipisicing elit. Mollitia neque assumenda ipsam nihil, molestias magnam, recusandae quos quis inventore quisquam velit asperiores, vitae? Reprehenderit soluta, eos quod consequuntur itaque. Nam.
Close Project