Selected Publications
To see a more complete view of my publications, please view my google scholar page or my CV.
Evaluating generalizability of artificial intelligence models for molecular datasets
Yasha Ektefaie, Andrew Shen, Daria Bykova, Maximillian G. Marin, Marinka Zitnik & Maha Farhat
Nature Machine Intelligence, 2024
Deep learning has made rapid advances in modelling molecular sequencing data. Despite achieving high performance on benchmarks, it remains unclear to what extent deep learning models learn general principles and generalize to previously unseen sequences. Benchmarks traditionally interrogate model generalizability by generating metadata- or sequence similarity-based train and test splits of input data before assessing model performance. Here we show that this approach mischaracterizes model generalizability by failing to consider the full spectrum of cross-split overlap, that is, similarity between train and test splits. We introduce SPECTRA, the spectral framework for model evaluation. Given a model and a dataset, SPECTRA plots model performance as a function of decreasing cross-split overlap and reports the area under this curve as a measure of generalizability…Read more.
Yasha Ektefaie, George Dasoulas, Ayush Noori, Maha Farhat, Marinka Zitnik
Nature Machine Intelligence, 2023
Multimodal learning with graphs
Graph-centric artificial intelligence (graph AI) has achieved remarkable success in modeling interacting systems prevalent in nature, from dynamical systems in biology to particle physics. The increasing heterogeneity of data calls for graph neural architectures that can combine multiple inductive biases. However, combining data from various sources is challenging because appropriate inductive bias may vary by data modality. Multimodal learning methods fuse multiple data modalities while leveraging cross-modal dependencies to address this challenge. Here, we survey 140 studies in graph-centric AI and realize that diverse data types are increasingly brought together using graphs and fed into sophisticated multimodal models. These models stratify into image-, language-, and knowledge-grounded multimodal learning. We put forward an algorithmic blueprint for multimodal graph learning based on this categorization…Read more.
Yasha Ektefaie, Avika Dixit, Luca Freschi, Maha Farhat
The Lancet Microbe 2021
Globally diverse Mycobacterium tuberculosis resistance acquisition: a retrospective geographical and temporal analysis of whole genome sequences
Mycobacterium tuberculosis whole genome sequencing (WGS) data can provide insights into temporal and geographical trends in resistance acquisition and inform public health interventions. We aimed to use a large clinical collection of M tuberculosis WGS and resistance phenotype data to study how, when, and where resistance was acquired on a global scale.We did a retrospective analysis of WGS data. We curated a set of clinical M tuberculosis isolates with high- quality sequencing and culture-based drug susceptibility data (spanning four lineages and 52 countries in Africa, Asia, the Americas, and Europe) using public databases and literature curation. For inclusion, sequence quality criteria and country of origin data were required. We constructed geographical and lineage specific M tuberculosisphylogenies and used Bayesian molecular dating with BEAST, version 1.10.4, to infer the most recent common susceptible ancestor age for 4869 instances of resistance to ten drugs…Read more.
Yasha Ektefaie, William Yuan, Deborah A. Dillon, Nancy U. Lin, Jeffrey A. Golden, Isaac S. Kohane & Kun-Hsing Yu
NPJ Breast Cancer 2021
Integrative multiomics-histopathology analysis for breast cancer classification.
Histopathologic evaluation of biopsy slides is a critical step in diagnosing and subtyping breast cancers. However, the connections between histology and multi-omics status have never been systematically explored or interpreted. We developed weakly supervised deep learning models over hematoxylin-and-eosin-stained slides to examine the relations between visual morphological signal, clinical subtyping, gene expression, and mutation status in breast cancer. We first designed fully automated models for tumor detection and pathology subtype classification, with the results validated in independent cohorts (area under the receiver operating characteristic curve ≥ 0.950). Using only visual information, our models achieved strong predictive performance in estrogen/progesterone/HER2 receptor status, PAM50 status, and TP53 mutation status. We demonstrated that these models learned lymphocyte-specific morphological signals to identify estrogen receptor status…Read more.