How many CRISPR systems are there? Probably thousands. And for the most part they can be detected by scanning large quantities of data on the genomes of bacteria considered “rare”, such as those collected in breweries or in the waters of Antarctic lakes. This is illustrated by the authors of a recent US study who, using a specific cluster analysis algorithm, tracked down 188 of them.
When we talk about big data refer to genomics – oh di “big data genomici” – we mean that set of data concerning the structures and functions of the genome of plant, animal and human organisms, including the sequence of molecules in genes and the interactions between the same molecules and proteins.
It’s about a gigantic and complex amount of data collected by geneticists, biologists and biotechnologists from all over the world, with the aim of analyzing them to study therapies intended for the treatment of genetic pathologies, define new genetic markers and develop personalized medicines.
The National Institutes of Health (NIH), an agency of the US Department of Health and Human Services, is among the organizations responsible for managing databases containing genomic big data shared globallyincluding those relating to bacteria.
The National Center for Biotechnology Information of the NIH, in collaboration with researchers from the McGovern Institute for Brain Research and the Broad Institute – both within the Massachusetts Institute of Technology (MIT) – made a algorithm capable of classifying data bacterial genomes. This method led to the identification of ben 188 new types of CRISPR systemsas illustrated in the article “Uncovering the functional diversity of rare CRISPR-Cas systems 1 with deep terascale clustering”, published in Science on November 24, 2023.
The algorithm used by the research team is based on a “locality-sensitive” categorization technique, which made it possible to select, in the databases examined, similar (not identical) data on bacterial genomes, and then group them into specific categories. During the work of analyzing genomic big data, an unexpected quantity of new CRISPR systems was identified, including a type with a longer guide RNA, which in the future could lead to even more precise genome editing technology in cutting operations. DNA paste. The methodology followed by the study group is an invitation to expand, in the years to come, the sampling criteria for bacteria, including – as the authors did – the collection of water from mines or lakes. This would contribute to enriching current databases with rare genomic big data and giving new life to research.
Origin and function of CRISPR systems
Before delving into the topic of how big data and genomics are correlated in support of research, let us remember that the universal acronym CRISPR – Clustered Regularly Interspaced Short Palindromic Repeats (literally “grouped and regularly spaced short palindromic repeats”) – refers to one class of DNA segments found in bacteria. Segments characterized by short repeated sequences, useful for these microorganisms to identify and tear apart the genome that originates from viruses similar to those that produced the palindromic repeats. In short, CRISPR represents, for bacteria, a natural form of protection from external attacks.
Studies of this defense mechanism have led, over the years, to the experimentation of increasingly advanced genetic engineering techniques for the manipulation of DNA in plant, animal and human organisms.
The first studies on what, only later, would take the name of “CRISPR”, date back to 1987 and see the Japanese university of Osaka as the protagonist. The actual acronym was coined in 2001, with the aim of clarifying and univocally indicating the multiple DNA sequences in bacteria, until then called with different terms in the scientific literature.
In the following years, a specific type of bacterium called “streptococcus pyogenes“, of a CRISPR system that uses the Cas9 protein, whose function is to “molecular scissors” to defend against pathogens.
It was then, in 2012, the scientists Emmanuelle Charpentier and Jennifer A. Doudna who made this system a new genome editing toolcapable – compared to the previous ones – of identifying and cutting target DNA sequences within the genome of a plant, animal and human cell in a simpler, more precise and faster way, eliminating them and replacing them with others.
And “genetic cut and paste” targeted, which earned them the 2020 Nobel Prize for Chemistry, which paved the way for laboratory research for potential applications in the medical field (diagnostic and therapeutic).
Big data clustering to support genomics
On the subject of big data and genomics for CRISPR research, the starting point of the study group directed by the National Institutes of Health USA stems from an observation that is as simple as it is incisive, namely that «…databases containing bacteria are extremely rich in strategic information for biotechnology. But, in recent years, they have reached such proportions as to make it difficult to find the enzymes and molecules of interest within them and to do so in the most correct way possible.».
Hence the need for an algorithm based on techniques clustering di big data capable of selecting and categorizing information taken from enormous quantities of genomic data, where “clustering” (or “cluster analysis”) refers to those methods which have the aim of grouping of similar elements within a very voluminous and heterogeneous data set.
To be precise, the team employed an algorithm called “Fast Locality-Sensitive Hashing-based clustering” (FLSHclust), developed in the laboratory of Feng Zhang, one of the pioneers of CRISPR research and professor at the Massachusetts Institute of Technology.
The tecnica “locality-sensitive” which it uses has allowed us to group together similar but not “identical” genomic data, probing billions of proteins and DNA sequences over the course of a few weeks rather than months.
In more detail, starting from a vast range of genomic data relating to bacteria of different types and originscollected in coal mines, breweries, Antarctic lakes and dog saliva, the algorithm extracted three publicly available databases, in which it identified «a surprising number and diversity of CRISPR systems».
Towards overcoming the risk of “off-target” editing
In the years following the discovery of CRISPR Cas9, research continued along a precise line, aimed at overcoming the system’s critical issues, first of all that of “off-target” editing due to inaccuracies and errors in the “cut-and-paste” operations of DNA sequences.
Precisely in this regard, the joint work of the National Institutes of Health and MIT on the subject of big data and genomics has allowed – among the 188 systems detected – the identification of CRISPR systems which, using a RNA guide (from English RiboNucleic Acid, ribonucleic acid) 32 base pairs long instead of 20, «they could be used to develop genome editing technology that is more precise and less prone to off-target editing», we read in the article in Science.
The study team also demonstrated in the laboratory that two of these “long-drive” CRISPR systems could, in the future, make changes to the DNA of human organisms, while a third system highlighted a side effect which, in the future, could be exploited by researchers for the development of a technique aimed at early diagnosis of infectious diseases. Specifically, the observed side effect consists of «extensive degradation of nucleic acids after the CRISPR protein binds to its target».
The study group also discovered new mechanisms of action for some already known CRISPR systems and a system that, in particular, focuses on RNA and which, in the years to come, could be used precisely inRibonucleic acid editing, i.e. in the manipulation of gene regulation and expression processes, as well as in protein synthesis. Another great step forward in genetic engineering towards possible applications in the field of early diagnosis.
Big data and genomics: what direction for the future of research?
The study by the National Center for Biotechnology Information of the NIH and the Massachusetts Institute of Technology on big data and genomics has, first of all, the merit of having demonstrated the variety and richness of CRISPR systems which can be found by analyzing the genomic data of bacteria and how a large part of these systems are present in uncommon bacteria (such as those, in fact, that live in coal mines, in breweries, in Antarctic lakes and in the saliva of dogs), suggesting that genome editing research should look elsewhere from now on, should «broaden sampling diversityto continue to expand the diversity of what we can discover», the authors underline. And they continue:
«Some of the microbial systems analyzed come from water collected in coal mines around the world. If we hadn’t looked in that direction, we may never have discovered the new CRISPR systems»
An algorithm like Fast Locality-Sensitive Hashing-based clustering – they comment – can do a lot in the presence of genomic big data from the most disparate origins. In the future, it could also support researchers studying other types of biochemical systems or anyone interested in working with large databases, «to study, for example, how proteins evolve or discover new genes».
breaking latest news © (Article protected by copyright)