GISAID began in 2008, after researchers around the world expressed reluctance to place sequential data from their avian influenza surveillance in public domain databases. The underfunded scientists were unwilling to abandon a new streak, but were then made aware of the analysis by another researcher with a multibillion-dollar lab. And as GISAID got more and more data, the people running it had to find a way to identify each sequence and put them all in context with each other. It is now the main data repository for SARS-CoV-2 genomes.
But the world of the Covid nomenclature has two other great and noble houses. Nextstrain, based at the Fred Hutchinson Cancer Research Institute and the University of Basel, is one. Its organization revolves around clades, large branches on the phylogenetic tree of life. (Nextstrain started doing the same job for the flu.) Its names have a cheat code – clades are organized by year of discovery and a letter of the alphabet, then by mutations. specific interest. Oliveira’s team variant had a bunch of mutations, but the N501Y was important. (The mutation changes an asparagine, abbreviated as the letter N, to tyrosine, abbreviated as a Y, at 501st amino acid on the virus spike protein, in the RBD (i.e. the receptor binding domain) that attaches to humans ACE2 receiver (this is the angiotensin converting enzyme).
Easy, right? (Ahem.) But things got even more complicated. The one the British researchers were seeing had the same mutation, among many others. To distinguish it from that of Oliveira, each was given a new designation – adding “V1” on that of the UK and “V2” on the other. Another similar variation which led to Manaus, in Brazil, became “v3”.
“We don’t try to name everything. In fact, we’re really explicitly trying not to have more than 10 more names per year, and we’re interested in picking out the things that matter most, ”Hodcroft says. “It’s, like, big changes in the tree. When we see groups with different genetics and they spread, even if it takes time, in a region or in the world, we give them a Nextstrain clade.
This is not what the other bigwige in space is doing. It is analytical software called Pangolin – “Phylogenetic attribution of designated lineages of global epidemic”. The so-called Pango lines start with a letter, initially A or B, designating the first two divergent sequences of SARS-CoV-2 that emerged from China at the end of 2019 and at the beginning of 2020. Each generation is assigned a number, and its descendants receive an additional number, preceded by a period – but only for three generations. Four or more, and the whole lineage is assigned a new letter. Imagine an Obed-begotten-Jesse-and-Jesse-begotten-David vibe, but with diagrams and genomic recipes. “The lines run on a different resolution. You can have very big ones and small ones, but the idea is to capture the emerging edge of the pandemic, ”says Áine O’Toole, an evolutionary biologist at the University of Edinburgh who created Pangolin and who is now the one of its main developers. “The idea is to have a group of sequences linked to some kind of epidemiological information.”
(After posting, O’Toole emailed me to note that although she had created the Pangolin software, she did not propose the Pango notation used in the nomenclature – it was a bigger team. It’s an important distinction that also proves my take on the difficulty of naming things, including people who name things.)
Pangolin got a little tricky. Anyone working on a viral genome can use the software to try to determine if they have something new and where it might match all known lineages (with data pulled from GISAID, just like Nextstrain). But making a final call on whether a strain is truly new and deserves a different place in the heuristic – its Pango lineage – depends on the living people on the team and the suggestions of scientists in the field. “I think maybe this is something we need to work harder on, to try to show that there is a difference between the lineage designation and lineage assignmentSays O’Toole. “When we designate lines, it’s just based on what we know. If you have a new lineage and we haven’t seen it, Pangolin won’t be able to assign it, as it can’t predict which lineages will occur in the future. So there is a lag. “