Since the emergence of the hCoV-19 virus (or SARS-CoV-2) responsible for the COVID-19 pandemic, unprecedented efforts are taking place across the world to sequence genomes of this virus and share the data. As of today (9/21/2020), the GISAID (Shu et al., 2017) provides access to more than 105,000 full genomes, and ~23,000 for the NCBI and the EBI. The first genomes were sequenced in China by the end of December 2019. Their number first increased slowly and then rapidly when the pandemic appeared on all continents. Submissions of several thousand sequences to GISAID in a single day has become common. Moreover, some genomes may be submitted incomplete, with sequencing and assembly errors. These characteristics pose major challenges to bioinformatics, notably that of multiple sequence alignment (MSA; Chatzou et al., 2016), which is crucial for subsequent analyses (phylogeny, transmission clusters, mutation study, structure, etc.).
To solve this difficulty, we use a profile HMM-based approach (Durbin et al., 1998), which is the norm for HIV (www.hiv.lanl.gov), and is particularly well suited to hCoV-19, as its genome is highly conserved, without known recombination in human hosts (Xiaolu et al., 2020; De Maio et al., 2020). Using a profile, the addition of new data to an existing MSA requires linear computing times in the number of input genomes. Moreover, profile-based MSA proved to be very accurate (Earl et al., 2014; Nute and Warnow, 2016). This approach is implemented in COVID-Align, which can be used thanks to a Web service and via Docker.
We need your help to improve this web service. Please send your comments and/or suggestions to: frederic[dot]lemoine[at]pasteur[dot]fr and olivier[dot]gascuel[at]pasteur[dot]fr,
Evolutionary Bioinformatics unit, C3BI USR 3756, Institut Pasteur and CNRS, Paris, France
- To learn more about COVID-ALIGN, please read our help page;
- To infer trees from aligned sequences, do not hesitate to use NGPhylogeny.fr
- An example analysis is available here. The example is composed of 7 aligned sequences and the automatically added reference sequence (GISAID ID: EPI_ISL_402124). Amongst aligned sequences, there are five samples from Human hosts. Three of them are a part of clade G, related to the SNP mutation 23403A > G : EPI_ISL_417851 from Iceland, EPI_ISL_421509 from France and EPI_ISL_427121 from Australia. The other two samples do not belong to the G clade: EPI_ISL_418955 from USA, EPI_ISL_413214 from Australia. These sequences were selected for having variation in the gene coding for RBD of Spike protein. Furthermore, there are two samples from animal hosts: EPI_ISL_402131 being the Bat isolate RatG13 and EPI_ISL_402131 being a sample from Pangolin.
If you use this web service, please cite:
Frédéric Lemoine, Luc Blassel, Jakub Voznica, Olivier Gascuel, COVID-Align: Accurate online alignment of hCoV-19 genomes using a profile HMM, Bioinformatics, btaa871, https://doi.org/10.1093/bioinformatics/btaa871