The Emerging Role of Big Data and Machine Learning in Drug Discovery

Share on linkedin
Share on facebook
Share on reddit
Share on pinterest

The terms “big data”, “machine learning”, and “artificial intelligence” have been trending in the AI drug discovery space for several years, both in mainstream media and academic press. These new technologies are believed to make drug discovery cheaper, faster, and more productive, while also promising to enable personalized medicine approaches (e.g., advanced biomarkers, improved patient stratification, etc). But how is AI used in drug discovery, and what is the driving force behind this technological transformation?

First, let’s briefly review some of the basic concepts in the heart of new technologies.

Big Data: Volume, Velocity, and Variety in AI Drug Discovery

The term “big data” by itself is more of a marketing nature. It describes an abstract concept of having large volumes of data obtained from various channels in multiple formats, which needs to be arranged in such a way that it can be possible to quickly access, search, update, and analyze it to output useful information.

Today, “big data” is a central strategic concept in most industries, including AI drug discovery, mainly because of the exponential rate of data generation globally — nearly 90% of all data currently available on Earth has been created in the last two years. The computing power required to quickly process huge volumes and varieties of data cannot be achieved via traditional data management architectures using a single server or a server cluster.

Machine Learning: Teaching Computers to Learn in AI Drug Discovery

Machine Learning algorithms are computer programs that teach computers how to adjust themselves so that a human does not need to explicitly describe how to perform the task to be achieved by the computer. The information that a Machine Learning algorithm needs in order to adjust its own program to solve a particular task is a set of known examples.

One of the revolutionary things about machine learning is that it allows computers to learn to perform complex tasks, which are hard or even impossible for humans to describe in “if-then” logic, and instruct. This is called supervised machine learning; it is when the program needs some labeled example data to learn. There are other learning techniques, which do not require a training dataset, for example, learning by “trial and error” — unsupervised machine learning.

Since 2012, a specific Machine Learning technique called Deep Learning has been taking the AI world by storm. It deals with Artificial Neural Networks of various architectures and specific advanced algorithms for their training. Progress made within just three years since 2012 is larger than computer scientists had done in the preceding twenty-five years on several key problems, including Image Understanding, Signal Processing, Voice Understanding, and Text Understanding.

Exponential progress in big data processing and machine learning has led to a point when a combination of both technologies opened up huge practical potential for a variety of use cases, including data security, financial trading, marketing personalization, fraud detection, natural language processing, smart cars, healthcare, and AI drug discovery. By harnessing the power of AI and big data, researchers can expedite the drug discovery process, make more accurate diagnoses, and develop personalized treatments for patients. This revolution in AI-driven drug discovery is transforming the healthcare landscape and opening new opportunities for researchers, pharmaceutical companies, and healthcare providers alike.

To understand how big data analysis and machine learning algorithms can improve drug discovery outputs, let’s review three stages on the way to successful medicines, where new technologies fit in best.

1. Understanding biological systems and diseases

In most cases, a drug discovery program can only be initiated after scientists have come to understand a cause and a mechanism of action behind a particular disease, pathogens or medical condition.

Without exaggeration, biological systems are the most complex in the world and the only way to understand them is to follow a comprehensive approach, looking into multiple organizational “layers”, starting from genes and all the way to proteins, metabolites and even external factors influencing inner “mechanics”.

In 1990, a group of scientists began the process of decoding the human genome. It took 13 years and was worth $2.7 billion to have the project finished. Often called the Book of Life, deciphering the genome would not have been possible without massive amounts of computing power and custom software.

The genome is sort of “instruction” for the organism saying which proteins and other molecules should be produced, when and why. Having complete knowledge of the genome opens doors to a much deeper understanding of our body, what can go wrong with it and under what circumstances.

However, looking at just genetic information is not enough, since the genome is more like a paper map of the world: although it tells where cities and villages are located, it does not tell, who the inhabitants of those cities are, what they are doing and how they live. To better understand what is going on, scientists have to go beyond the genome’s one-dimensional view into a multidimensional one, linking the genome with large-scale data about the output of those genes at specific times, in specific places, in response to specific environmental pressures. This is what is called “multi-omic” analysis.

“Omic” here refers to different “layers” of the biological system: Genome – all the genes in the body, DNA; transcriptome — a variety of RNAs and other molecules responsible for “reading” and “executing” genome information; proteome — all the proteins in the body; metabolome — all the small molecules; epigenome – the multitude of chemical changes to the DNA and factors, including environmental, which dictate such changes.

Such a multidimensional approach is very promising for understanding mechanisms of diseases, especially, such complex ones as cancer and diabetes. They involve a tangled web of genes, the influence of lifestyle factors and environmental conditions. Whether you smoke or exercise daily, — that influences when those various genes are turned on and off.

Research on biology systems generates enormous amounts of data, which needs to be stored, processed and analyzed. The 3 billion chemical coding units that string together to form a person’s DNA, if entered into an Excel spreadsheet line-by-line, would produce 7,900 miles-long table. The human proteome contains more than 30,000 distinct proteins that have been identified so far. And the number of small molecules in the body, metabolites, exceeds 40,000. Mapping data, originated from various experiments, associations, combinations of factors and conditions, generates trillions of data points of information.

This is where Big Data analysis and Machine Learning algorithms start to shine, allowing to derive hidden data patterns, find dependencies and associations unknown before. For example, this automated protocol for large-scale modeling of gene expression data can produce models that are predictive of differential gene expression as a function of a compound structure. In contrast to the usual in silico design paradigm, where one interrogates a particular target-based response, the newly developed protocol opens doors for virtual screening and lead optimization for desired multitarget gene expression profiles.

The important practical goal of the above research on biological systems is to be able to identify a protein or a pathway in the body, a “target”, playing a major role in a mechanism of a particular disease. Then, it would be possible to inhibit or otherwise modulate the target by chemical molecules to influence the course of the disease.

(For sponsorship opportunities, drop me a line at

2. Finding the “right” drug molecules

Image credit: Andrii Buvailo/Midjourney

In the realm of drug discovery, finding the “right” drug molecules is a crucial step. Once a suitable biological target has been proposed by scientists, it is time to search for molecules that can selectively interact with the target, stimulating the desired effect — a “hit” molecule.

Various screening paradigms exist to identify hit molecules. For example, a popular High-throughput screening (HTS) approach involves screening millions of chemical compounds directly against the drug target. In fact, it is a sort of “trial and error” method to find a needle in the haystack. This screening paradigm involves the use of complex robotic automation, it is costly, and the success rate is rather low. But what is good about it, though, is that it assumes no prior knowledge of the nature of the chemical compounds likely to have activity at the target protein. So, HTS appears to be an experimental source of ideas for further research, and it provides useful “negative” results to be taken into account.

Other approaches include fragments screening and a more specialized focused screening approach — physiological screening. This is a tissue-based technique looking for a response more aligned with the final desired in vivo effect as opposed to targeting one specific drug target.

In the pursuit of cutting costs of the above complex laboratory screens and increasing their efficiency and predictability, computational scientists advanced computer-aided drug discovery (CADD) approaches using pharmacophores and molecular modeling to conduct so-called “virtual” screens of compound libraries. In this approach, millions of compounds can be in silico screened against a known 3D structure of a target protein (structure-based approach); if the structure is unknown, it is possible to identify drug candidates based on knowledge of other molecules which are known to have activity towards the target of interest.

Drug discovery modeling is a promising area where big data analytics and machine learning algorithms can become “superstars”. By harnessing the power of AI and computational methods, researchers can expedite the search for effective molecules, reduce costs, and improve the predictability of drug discovery processes, ultimately transforming the pharmaceutical industry and healthcare as a whole.

3. Improving preclinical development using artificial intelligence

One of the reasons why the pharmaceutical industry is experiencing a low success rate of research and development is to a substantial degree conditioned by the fact, that animal testing of new drug candidates is not very representative of what the human outcome would be. Drugs fail at later stages and it costs substantial money for investors and wasted time for companies. What is more crucial, it costs healthy years and even lives for people with health issues.

New artificial intelligence algorithms and big data approaches are increasingly probed as alternatives to in vivo testing, For example, in this study, published in Toxicological Sciences, AI was trained to predict toxicity of thousands of unknown chemicals, based on previous animal tests, and the results of prediction were shown to be as accurate as live animal tests. We are certainly in early days of being able to substitute animal testing with AI-enabled computational models, however, the industry is heading in this direction quite rapidly. Especially interesting combination is AI technologies and the organ-on-a-chip approach. Read the article “Robotic fluidic coupling and interrogation of multiple vascularized organ chips” to have a glimpse of how thinking develops in this area.

4. Optimizing clinical trials using AI

Finally, the clinical trials are a critical stage of drug development workflow, with an estimated average success rate of about 11% for drug candidates moving from Phase 1 towards approval. Even if the drug candidate is safe and efficacious, clinical trials might fail due to the lack of financing, insufficient enrollment, or poor study design. [Fogel DB. 2018].

Artificial Intelligence (AI) is increasingly perceived as a source of opportunities to improve the operational efficiency of clinical trials and minimize clinical development costs. Typically, AI drug discovery vendors offer their services and expertise in the three main areas. AI start-ups in the first area help to unlock information from disparate data sources, such as scientific papers, medical records, disease registries, and even medical claims by applying Natural Language Processing (NLP). This can support patient recruitment and stratification, site selection, and improve clinical study design and understanding of disease mechanisms. As an example, about 18 % of clinical studies fail due to insufficient recruitment, as 2015 study reported.

Another aspect of success in clinical trials is improved patient stratification. Since trial patients are expensive – the average cost of enrolling one patient was $15,700-26,000 in 2017 — it is important to be able to predict which patient will have greater benefit or risk from treatment. Top AI drug discovery companies operate with multiple data types, such as Electronic Health Records (EHR), omics, and imaging data to reduce population heterogeneity and increase clinical study power. Vendors could use speech biomarkers to identify neurological disease progression, imaging analyses to track treatment progression, or genetic biomarkers to identify patients with more severe symptoms.

AI is also streamlining the operational processes of clinical trials. AI vendors help to track patient health from their homes, monitor treatment response, and patient adherence to the trial procedures. By doing that AI companies decrease the risk of patient dropouts, which accounted for 30% on average. Usually, the Phase 3 clinical study stage requires 1000-3000 participants, with a part of them taking a placebo. That’s why the development of synthetic control arms — AI models that could replace the placebo-control groups of individuals thus reducing the number of individuals required for clinical trials — might become a novel trend.

In one of previous newsletter issues, I have summarized several AI companies in clinical research: 8 Notable AI Companies in Clinical Research to Watch in 2022.

Concluding Remarks

Just several years have passed since famous tech entrepreneur Marc Andreessen penned his famous “Why Software Is Eating the World” essay. Today, a new statement proves itself true: “Software Eats Bio”.

New computational technologies and machine learning algorithms are revolutionizing the biopharmaceutical industry and the way drugs are discovered, bringing the promise of AI drug discovery. A systemic understanding of biological processes and mechanisms of diseases opens doors to not only better drug molecules but also the whole new concept of personalized medicine, which takes into account individual variability in environment, lifestyle, and genetic peculiarities of individual patients or groups of patients. Big Data and Machine Learning are the essential technologies enabling us to hope for true personalized medicine to become a mainstream phenomenon, someday.

0 replies on “The Emerging Role of Big Data and Machine Learning in Drug Discovery”

Related Post