Session 11: Practical Issues

Date: Thursday 16:00 – 17:30 CET
Chair: Ludovico Boratto (University of Cagliari)

  • INDrug Discovery as a Recommendation Problem: Challenges and Complexities in Biological Decisions
    by Anna Gogleva (AstraZeneca, United Kingdom), eliseo papa (AstraZeneca, United Kingdom), Erik Jansson (Astra Zeneca, Sweden), and Greet De Baets (AstraZeneca, United Kingdom)

    Drug discovery is notorious for its low success rates [5]. Despite best research efforts, the majority of drugs fail at early stages of development, even before they enter clinical trials. This phenomenon stems from the inherent complexity of biological systems and our poor understanding of human diseases. To improve that understanding, swaths of data have been generated in recent years. Still, data does not easily translate into knowledge or actionable insights. Here we explore how approaches from the recommendation system domain could help scientists comprehend the ever-growing amount of biomedical facts. The aim of these efforts is to make better drug development decisions, which ultimately result in safe and efficient treatments for patients [3].
    Recommendation systems are well established in e-commerce, streaming and social media platforms, however in the biomedical domain their usage is limited to a few recent studies [1, 6, 7, 8]. Direct transfer of classic recommendation approaches to the biomedical domain is not trivial. Specifics of the problem space impose numerous challenges for a recommendation system practitioner, to name a few:

    Regardless of the challenges, the adoption of recommendation systems presents numerous opportunities to support and accelerate drug discovery. Even a slight increase in success rate of drug pipelines will result in a vast number of patients gaining access to safe and effective treatments. Recommendation systems could play a leading role in this process.
    Adding context to experimental data is one class of problems that could benefit from recommenders. In this process new data is integrated with prior evidence to produce a new hypothesis. In a typical scenario, thousands of genes need to be ranked by their relevance to a disease given new and existing data. As a case study we focused on finding out why some lung cancer patients develop resistance to treatments. Current protocol to find resistance markers starts with high-throughput genomic screens resulting in an initial list of potential gene candidates, followed by tedious manual curation by several experts to reduce the list to a manageable number for further follow-up.
    To find resistance markers faster and to reduce bias we built a hybrid recommendation system on top of a heterogeneous biomedical knowledge graph [2]. In the absence of continuous feedback and training data, we approached recommendations as a multi-objective optimization problem [4]. Genes were ranked by trading off diverse types of evidence that link them to potential mechanisms of resistance in lung cancer. We used a knowledge graph as the primary source of features, so that the relevance of a gene could be expressed via properties of a graph. Our hybrid feature set also included clinical and pre-clinical data as well as metrics of literature support obtained with natural language processing techniques. This hybrid approach helped to identify novel resistance mechanisms that could have been overlooked by experts due to inherent bias or limited integration of data. Most importantly, our method reduced the time required to prioritise resistance markers from months to minutes and became a standard procedure for processing genomic screens.
    Another class of problems exists around target identification tasks. The idea here is to find a molecular target, often a gene or a protein, that could be modulated with a drug to treat a disease. As the number of potential targets is large, the search space can be reduced using network propagation on a dedicated subgraph that captures the functional relationship between genes. This approach also requires a set of seed genes, defined based on high confidence associations with diseases. Disease preferences are then propagated through the network resulting in a preference distribution for the complete set of genes which is used to reduce the search space.
    In contrast to adding context to experiments, a considerable amount of training data is available to support target identification. For instance, both successful and failed clinical trials can act as a useful source of data for target identification. Such a setting warrants use of supervised recommendation systems. A supervised approach, however introduces another machine learning hurdle — trust. Since supervised models are typically ”black boxes”, their quality must be ascertained indirectly, for example using train-test split and estimating model’s performance on the test set. Such quantitative performance metrics often are of little value to a biological expert looking for relevant gene targets. Instead, experts instinctively assess model quality by checking if a list of recommendations contains a handful of expected genes [9].
    To simultaneously use biologists’ intuitions as training data, while avoiding an overly optimistic trust in model output, we used an ensemble modeling approach. We partitioned training data among multiple models such that each available training gene was omitted from one model’s training data. The model was then permitted to assess this previously unseen gene in constructing its final list of recommendations, while training genes were removed from consideration. Each model therefore produced a list of recommendations based on an incomplete set of genes. A final set of recommendations was then constructed by collating each individual model’s output list. Because these output lists were constructed with biologist input through supervised training, biologists placed a higher degree of trust in the recommendations. This allowed roughly two dozen genes to be fast-tracked for manual assessment and experimental screening.
    In summary, accumulation of large amounts of biomedical data coupled with a need to comprehend and reason about it makes drug discovery an attractive field to apply recommendation techniques. Specifics of the problem space and complexity of biological systems call for efficient recommendation solutions that could operate in unsupervised or weakly supervised settings. At the same time, a strong emphasis on explainability is essential to gain trust of biomedical experts.

    Full text in ACM Digital Library

  • PAShared Neural Item Representations for Completely Cold Start Problem
    by Ramin Raziperchikolaei (Rakuten, Inc., United States), Guannan Liang (Rakuten, Inc., United States), and Young-joo Chung (Rakuten, Inc., United States)

    Neural networks have become popular recently in recommendation systems to extract user and item representations. Most previous works follow a two-branch setting, where user and item networks learn user and item representations in the first and second branches, respectively. In the item cold-start problem, where the usage patterns of the items do not exist, the user network uses ID/interaction vector as the input and the item network uses the item side information (content) as the input. In this paper, we will show that by using this structure, two representations are learned for each item in the training set; one is the output of the item network and the other one is hidden inside the user network and is used for learning user representations. Learning two representations makes training slower and optimization more difficult. We propose to unify the two representations and only use the one generated by the item network. Also, we will show how attention mechanisms fit in our setting and how they can improve the quality of the representations. Our results on public and real-world datasets show that our approach converges faster, achieves higher recall in fewer iterations, and is more robust to the changes in the number of training samples compared to the previous works.

    Full text in ACM Digital Library

Platinum Supporters
 
 
Gold Supporters
 
 
 
 
 
Silver Supporters
 
 
 
 
Special Supporter