Welcome !
Since Feb 2021
Keywords: Project planning. Resource management. Strategic thinking.
I lead the Informatics Infrastructure team of the Tree of Life programme, which oversees the implementation and delivery of the genome assembly pipelines, and provides support for large-scale genome analyses for the Tree of Life faculty teams.
My ambition is to provide the most efficient platform for assembling and analysing genomes at an unprecedented scale. The Tree of Life projects will generate tens of thousands of high-quality genomes over the coming years – more than have ever been sequenced ! It is a challenging and extremely exciting task that will shape the future of biology.
We are the interface between the Tree of Life teams (assembly production and faculty research) and the Sanger’s IT teams, working together with the informatics teams of the other programmes. The work involves a wide range of scientific fields and technologies such as assembly methods, genomics, comparative genomics, cloud computing, large-scale analyses, with a strong emphasis on metadata tracking, quality controls, and event recording.
I worked in the Comparative Genomics team of the Ensembl genome browser. The team is in charge of comparing the genomes to one another, implementing new methods and algorithms (extending our API and database schema), and applying them on new datasets. Scalability was our main focus as we had to process hundreds of genomes in a limited timescale.
I was also involved in the development of the eHive workflow manager, a system for creating and running workflows on a distributed compute resource. It is now responsible for scheduling and executing in excess of 1,000 CPU years of compute per year in Ensembl.
Oct 2019 to Jan 2021
Keywords: Development management. Technical leadership. Recruitment. Mentorship. API development. Database design and optimisation. Workflow design and development. User support (data, API, workflows). Team size: 3 people.
In this role, I transferred the knowledge I had accumulated in the past 8 years to the new Project Leader and the developers, whilst helping them and overseeing the development of the software.
We initiated a massive revamp of our compute workflows and data storage strategy, in order to cope with the scale of data that projects such as the Darwin Tree of Life will generate. Our aim was ambitious: provide comparative analyses on tens of thousands of genomes, and more.
I was still maintaining and contributing to the development of the eHive workflow manager, keeping it the most efficient solution for the Ensembl Comparative Genomics workflows.
May 2014 to Sep 2019
Keywords: Project planning and management. Scientific and public communication. Technical leadership. Recruitment. Mentorship. Development management. Reporting. Data-production planning and operational management. API development. Database design and optimisation. Workflow design and development. Data production under tight deadlines. Processing of large datasets. User support (data, API, workflows). Team size: 2-6 people.
I managed the whole Comparative Genomics team of the Ensembl project, incl. the development of the eHive workflow manager. We had to provide technical support to other Ensembl teams who were using our software. During this period, we wrote an extensive user manual for eHive.
May 2013 to May 2014
Keywords: Development management. Reporting. Technical advisor. Data-production planning and operational management. API development. Database design and optimisation. Workflow design and development. Data production under tight deadlines. Processing of large datasets. User support (data, API, workflows). Team size: 2 people.
I was managing part of the Comparative Genomics team of Ensembl, incl. the development of the eHive workflow manager, and was still carrying on my developer duties. Our work was focused on the reconstruction of phylogenetic trees and gene families, improving and extending the software. I was still giving some Ensembl API workshops.
Jan 2011 to May 2013
Keywords: API development. Database design and optimisation. Workflow design and development. Data production under tight deadlines. Processing of large datasets. User support (data, API, workflows).
I focused on the reconstruction of protein phylogenetic trees, reshaping the API, improving and extending the software. I also gave some Ensembl API workshops.
Sep 2006 to Dec 2010
Title: Reconstruction of ancestral vertebrate genomes
I have developed a set of new methods to predict the genome structure of ancestral species (all the last common ancestors of any given group of extant species. Here: about 50 vertebrate species) at different levels (number of chromosomes, chromosome content, gene order). We have also set up a database and a genome browser, Genomicus, to make the data available to the community. As we use the data from the Ensembl project, Genomicus is updated every 2 months, after each Ensembl release.
Thesis (in French) and presentation (in English) available online.
PhD | Bioinformatics | Sep 2006 to Dec 2010 | École normale supérieure, Paris, France |
MSc | Bioinformatics | Sep 2005 to Aug 2006 | Évry university, France |
MSc | Computer science, Software development, Mathematics | Sep 2003 to Aug 2006 | ENSIIE, Évry, France |
BSc | Mathematics | Sep 2004 to Jun 2005 | Paris Diderot university, France |
Classes préparatoires | Mathematics, Physics, Computer science | Sep 2001 to Jun 2003 | Lycée Louis-le-Grand, Paris, France |
May 2019 to Sep 2019
Title: Using Deep Learning techniques to enhance orthology calls
This project was funded by the 2019 edition of the Google Summer of Code program. Harshit Gupta has been selected to develop a machine-learning algorithm to predict orthologies in the Ensembl Genomes Browser organization under the supervision of myself and Mateus Patricio.
We developed a machine-learning algorithm to predict orthologies, using TensorFlow, directly from sequence data -without using any phylogenetics methods-. The method achieved high accuracy (>90% in most settings) and we are now developing a plan to use it in production in Ensembl.
Project URL: GitHub
May 2016 to Sep 2016
Title: Graphical editor of XML files
This project was funded by the 2016 edition of the Google Summer of Code program. Anuj Khandelwal has been selected to work on a Graphical workflow editor for eHive using Blockly in the Ensembl Genomes Browser organization under the supervision of myself and Leo Gordon.
eHive is a system used to run computation pipelines in distributed environments. Currently the eHive workflows are configured in a specific file format that requires basic programming skills. This project aimed at removing this drawback by creating a graphical editor for eHive workflows using Google’s Blockly library.
We were envisaging XML as the file format, with a Relax NG specification. The backbone of this graphical editor would be an automated conversion of a Relax NG specification to Blockly blocks and matching rules so that the Blockly diagrams conform to the specification. The graphical interface will have to be able to import existing XML files to visualize them in terms of Blockly blocks, edit them, and export the diagram back to XML.
The project submitted to Google is not specific to eHive and the proposed editor should be able of handling any specifications written using the Relax NG schema.
May 2013
The EMBL postdoc retreat is an official annual EMBL event under the patronage of the EMBL Heads of Units. It promotes scientific exchange among postdocs and provides a platform to address problems relevant to postdocs.
École normale supérieure, Paris, France
Jun 2005 to Aug 2005
I worked on improving the user interface of Exogean, a software for annotating gene structures in eukaryotic genomic DNA.
2005
Conception of a new software to analyse and manage costs for electronics manufacturers (project & product visualisation, resources management).
2002 to 2003
I implemented a 3D renderer from scratch, i.e. without using any libraries such as OpenGL. The program is able to render (project) 3D objects onto a plane as raster images, including colours & transparency, but also lightings and shadings. It’s all written in C++.
I also know about ray-tracing, BSP trees, Bezier curves, B-splines.
2001 to 2002
I implemented a JPEG encoder/decoder from scratch, i.e. without using any libraries, reading the documentation of the compression algorithm.
I also wrote a reader / writer for the BMP file format, and a library to manipulate, transform, and apply effects to images.
Association for Project Management (APM), 2017
Course on project management. Principles and approaches have since been applied to all Ensembl management layers to plan and manage projects (development and others).
F-Secure Corporation, 2004