How machine learning helps drug discovery process

Business objective

Drug discovery is a time-consuming, complex, and expensive process that typically takes around 10-15 years and costs billions of dollars. In recent years, machine learning (ML) has demonstrated its potential to revolutionize the drug discovery process by accelerating candidate identification and reducing costs. This case study examines how a pharmaceutical company implemented a ML-based approach to streamline drug discovery and identify potential drug candidates more efficiently.

  • Data science
  • Deep learning
  • Machine learning
  • Business intelligence

Company Background

A mid-sized pharmaceutical company specializing in the development of novel therapeutics for various diseases. In an effort to improve efficiency and reduce costs, the company decided to explore ML techniques in its drug discovery process.


Problem Statement

The company sought to accelerate the identification of potential drug candidates in the early stages of drug discovery, specifically during the hit identification and lead optimization phases. They aimed to reduce the time and cost required for these processes, while maintaining or improving the quality of the candidate molecules.



We implemented a three-stage ML-based approach to streamline its drug discovery process:

  1. Data collection and preprocessing: The company collected a large dataset containing information on molecular structures, biological activities, and pharmacokinetic properties of known compounds. This dataset was cleaned, preprocessed, and standardized to ensure consistency and quality.
  2. Model development and training: The data science team used this dataset to train various ML models, including deep learning algorithms such as convolutional neural networks (CNNs) and recurrent neural networks (RNNs), as well as traditional ML algorithms such as random forests and support vector machines. They used techniques like cross-validation and hyperparameter optimization to fine-tune the models for optimal performance.
  3. Model validation and deployment: The ML models were tested on a separate, unseen dataset to evaluate their performance in predicting the bioactivity and pharmacokinetic properties of new compounds. Once validated, the models were integrated into our drug discovery pipeline to aid in the selection of potential drug candidates.


Tech stacks

To accelerate the drug discovery process, we leveraged Amazon Web Services (AWS), a robust cloud computing platform that offers a wide range of services and tools to support machine learning and data processing tasks. Here is an overview of how we utilized various AWS services and tools to implement the ML-based approach in their drug discovery pipeline:

  1. Data storage and management: we used Amazon S3 (Simple Storage Service) to securely store and manage the large datasets containing molecular structures, biological activities, and pharmacokinetic properties of known compounds. By using Amazon S3, we could easily scale its storage requirements as the dataset grew over time and could ensure data durability and security.
  2. Data preprocessing: Amazon Glue, a fully managed extract, transform, and load (ETL) service, was employed to clean, preprocess, and standardize the collected data. Glue enabled the data science team to automate the data transformation process, ensuring data consistency and quality across the datasets.
  3. Machine learning model development and training: we used Amazon SageMaker, a fully managed service that allows developers to build, train, and deploy ML models quickly and easily. SageMaker provided built-in algorithms and pre-configured environments for popular ML frameworks, such as TensorFlow and PyTorch, enabling the data science team to focus on model development without worrying about infrastructure management. The team used SageMaker’s distributed training capabilities to train various ML models, including CNNs, RNNs, random forests, and support vector machines, on the preprocessed data.
  4. Hyperparameter optimization: To fine-tune the ML models for optimal performance, we used Amazon SageMaker’s Hyperparameter Optimization (HPO) feature. HPO automates the process of searching for the best hyperparameters by using techniques such as Bayesian optimization and random search. This allowed the data science team to find the optimal model configuration more quickly and efficiently.
  5. Model validation and deployment: Once the ML models were trained and optimized, we used SageMaker to deploy the models as RESTful APIs, making it easy to integrate the models into the drug discovery pipeline. SageMaker’s built-in model monitoring and endpoint management features ensured that the deployed models were consistently available and reliable for use in the drug discovery process.
  6. Scalability and cost management: By using AWS, we could take advantage of the platform’s scalability and pay-as-you-go pricing model. As the company’s computational and storage requirements increased, AWS services could be easily scaled up to meet the demand. Additionally, we could optimize costs by selecting the most appropriate compute instances for their ML workloads and using Amazon EC2 Spot Instances to take advantage of unused EC2 capacity at a lower cost.



The successful implementation of machine learning in the drug discovery process demonstrates the potential of ML to revolutionize the pharmaceutical industry. By accelerating the identification of potential drug candidates and reducing costs, ML can enable companies to bring life-saving therapies to patients more quickly and efficiently. As it continues to refine its ML models and incorporate new data, further improvements in drug discovery outcomes can be expected.

Are you looking to retain your customers?

Give us a BUZZ