Deep Learning for Biosequences



Jean-Philippe Vert (Google Research)

Jean-Philippe Vert is a research scientist at Google Brain in Paris and adjunct research professor at PSL Mines ParisTech’s Centre for Computational Biology. Prior to joining Google in 2018, he worked as a postdoc in computational biology at Kyoto University (2001-2002), research professor and founding director of the Centre for Computational Biology at Mines ParisTech (2003-2018), team leader at the Curie Institute in Paris on computational biology of cancer (2008-2018), Miller visiting professor at UC Berkeley (2015-2016), and research professor at the department of mathematics of Ecole normale superieure in Paris (2016-2018). He graduated from Ecole Polytechnique (1995), Corps des Mines (1998), and holds a PhD in mathematics from Paris 6 University (2001). His research interest concerns the development of statistical and machine learning methods, particularly to model complex, high-dimensional and structured data, with an application focus on computational biology, genomics and precision medicine. His recent contributions include new methods to embed structured data such as strings, graphs or permutations to vector spaces, regularization techniques to learn from limited amounts of data, and computationally efficient techniques for pattern detection and feature selection. He is also working on several medical applications in cancer research, including quantifying and modeling cancer heterogeneity, predicting response to therapy, and modeling the genome and epigenome of cancer cells at the single-cell level.



Short Abstract: Deep neural networks are increasingly used to analyze biological sequences, including DNA, RNA and proteins, leading to promising applications in annotation, classification, structure prediction or generation. While the architectures of deep neural networks for biosequences have been so far largely borrowed from the field of natural language processing, I will discuss in this presentation some specificities of biosequences that deserve specific methodological developments, in particular 1) how to transform a biosequence as a sequence of tokens, 2) how to incorporate some known symmetries of biosequences in the architecture of the model, and 3) how to solve tasks which are specific to biosequences such as learning to align.