Abstracts

This page contains the abstracts contributed to ProgGen22. Talks are listed in program order. Posters are listed in order of the presenter's last name.

Plenaries

K01 The Darwin Tree of Life Project and what it will enable using Anopheles mosquitoes as an example

Discussion
Mara Lawniczak
Wellcome Sanger Institute
The availability of a reference genome for a species forever changes the breadth and depth of work that can be carried out on that species. The Darwin Tree of Life project is aiming to "change biology forever" by sequencing the reference genomes for the 70,000 described eukaryotic species that can be found in Britain and Ireland. I will present an update on the DToL project, which is a collaboration between many scientists from many institutes. I will then use Anopheles mosquitoes to showcase several large scale projects that have been possible thanks to the long term availability of reference genomes.
K02 Modelling the evolution of SARS-CoV-2 with applications to public health policy

Discussion
Erik Volz
Imperial College London
The covid19 epidemic in the United Kingdom was driven by a succession of SARS-CoV-2 variants with distinct properties. I will review the major SARS-CoV-2 variants that have shaped the UK epidemic and how each variant was initially detected and characterised. Current changes in testing and sequencing policy will have important ramifications for how quickly SARS-CoV-2 variants can be detected and characterised in the future.
K03 The temporal and genomic scale of selection during hybridization

Discussion
Graham Coop
UC Davis
Hybrid populations frequently form when previously isolated populations spread back into secondary contact. The genomes of hybrid individuals are a mosaic of ancestry from the parental populations, which is broken into finer pieces over time by recombination. At the same time selection may act quickly to resolve hybrid ancestry and remove maladaptive hybrid combinations of alleles. In the talk we'll discuss how we can resolve the temporal and genomic spatial scale of selection, using theory, simulations, and analyses of empirical data.

Talks

T01 The lingering effects of Neanderthal introgression on human complex traits

Discussion
April Wei [1,+], Christopher Robles[2,+], Ali Pazokitoroudi[3], Andrea Ganna[4,5,6], Alexander Gusev[7], Arun Durvasula[8,9], Steven Gazal[10], Po-Ru Loh[5,11], David Reich[8,9,5,12] and Sriram Sankararaman[*2,3,13]
Affiliations: 1 Department of Computational Biology, Cornell University, 2 Department of Human Genetics, UCLA, Los Angeles, CA 90095; 3 Department of Computer Science, UCLA, Los Angeles, CA 90095; 4Analytical and Translational Genetics Unit, Center for Genomic Medicine, Massachusetts General Hospital, Boston, MA, USA; 5 Program in Medical and Population Genetics, Broad Institute of MIT and Harvard, Cambridge, MA, USA; 6 Stanley Center for Psychiatric Research, Broad Institute of MIT and Harvard, Cambridge, MA, USA; 7 Dana-Farber Cancer Institute and Harvard Medical School, Boston, MA; 8 Department of Genetics, Harvard Medical School, Boston, Massachusetts 02115, USA; 9 Department of Human Evolutionary Biology, Harvard University, Cambridge, MA 02138, USA; 10 Center for Genetic Epidemiology, Department of Public and Population Health Sciences, USC Keck School of Medicine, Los Angeles, CA 90033; 11 Division of Genetics, Department of Medicine, Brigham and Women’s Hospital and Harvard Medical School, Boston, Massachusetts 02115, USA; 12 Howard Hughes Medical Institute, Harvard Medical School, Boston, Massachusetts 02115, USA; 13 Department of Computational Medicine, UCLA, Los Angeles, CA 90095.
Topics: Population genetics, Natural selection, Quantitative genetics, Methods for GWAS, Causal inference in genetic studies
Keywords: Neanderthal Introgression, UK Biobank, Variance Component Analysis, Fine Mapping
The mutations introduced into the ancestors of modern humans from interbreeding with Neanderthals have been suggested to contribute an unexpected extent to complex human traits. However, testing this hypothesis has been challenging due to the idiosyncratic population genetic properties of introgressed mutations. We developed rigorous methods to assess the contribution of introgressed Neanderthal mutations to heritable trait variation and applied these methods to 235,592 introgressed Neanderthal mutations and 96 distinct phenotypes measured in the UK Biobank. Introgressed mutations have a significant contribution to trait variation, but the contribution of these mutations tends to be significantly depleted relative to modern human mutations. Different from previous studies, we find no evidence for elevated heritability. Previous work has suggested that significant associations in introgressed variants are likely driven by linkage with nearby causal modern human variants. We therefore developed a customized statistical fine-mapping methodology for introgressed mutations that led us to identify 112 regions across 47 phenotypes where introgressed mutations are far more certain to have a phenotypic effect. Examination of fine mapping mutations reveal their substantial impact on genes that are important for the immune system, development, and metabolism. Our results provide the first rigorous basis for understanding how Neanderthal introgression modulates human complex traits.
T02 Polygenic scores enable discovery of widespread genetic interactions associated with quantitative traits in the UK Biobank

Discussion
Lino A.F. Ferreira [1, 2], Sile Hu [3], Simon R. Myers [2, 1]
Affiliations: [1] Wellcome Centre for Human Genetics, University of Oxford, [2] Department of Statistics, University of Oxford, [3] Novo Nordisk Research Centre Oxford
Topics: Quantitative genetics, Methods for GWAS
Keywords: Genetic Interactions, Epistasis, Polygenic Scores
Although a huge number of human genetic associations have been discovered, genetic predictions overwhelmingly use additive polygenic scores (PGS), which ignore interactions between loci. Such interactions are expected based on evidence from model organisms but human examples are limited, with the vast number of potentially- interacting SNPs hampering discovery power. We developed an approach to overcome this lack of power by testing for interactions between a SNP and groups of other variants, such as those in the PGS, and applied it to 97 quantitative traits in the UK Biobank. We are motivated by regulatory networks, where a SNP impacting expression of one gene can have downstream effects on many others. Our approach is robust to false positives due to nonlinear additive effects or locally clustered associations and is initialised by iteratively building a PGS accounting for all significant linear signals of association. We find widespread, independent interactions: 483 loci across the genome interacting with the PGS of 52 traits. For example, rs3131894 shows no direct association with waist circumference but changes the predictive power of the PGS for this trait; it is an eQTL for ZFP57, HLA-G and HLA-A and associates with coeliac disease risk. Future work will determine which biological components these loci modulate. These findings offer potential to improve PGS performance using non-linear terms and identify determinants of PGS performance variability among populations.
T03 Influences of rare copy number variation on human complex traits

Discussion
Margaux L.A. Hujoel [1,2], Maxwell A. Sherman [1,2,3], Alison R. Barton [1,2,4] , Ronen E. Mukamel [1,2], Vijay G. Sankaran [2,5], Po-Ru Loh [1,2]
Affiliations: [1] Division of Genetics, Department of Medicine, Brigham and Women’s Hospital and Harvard Medical School, Boston, MA, USA, [2] Broad Institute of MIT and Harvard, Cambridge, MA, USA, [3] Computer Science and Artificial Intelligence Laboratory, Massachusetts Institute of Technology, Cambridge, MA, USA, [4] Bioinformatics and Integrative Genomics Program, Harvard Medical School, Boston, MA, USA, [5] Division of Hematology/Oncology, Boston Children's Hospital and Department of Pediatric Oncology, Dana-Farber Cancer Institute, Harvard Medical School, Boston, MA, USA
Topics: Methods for GWAS, Assembly and variant identification
The human genome contains hundreds of thousands of regions exhibiting copy number variation (CNV). However, the phenotypic effects of most such polymorphisms are unknown because only larger CNVs have been ascertainable from SNP-array data generated by large biobanks. We developed a computational approach leveraging haplotype-sharing in biobank cohorts to more sensitively detect CNVs. Applied to UK Biobank, this approach accounted for approximately half of all rare gene inactivation events produced by genomic structural variation. This CNV call set enabled the most comprehensive analysis to date of associations between CNVs and 56 quantitative traits, identifying 269 independent associations (P < 5 x 10-8) likely to be causally driven by CNVs. Putative target genes were identifiable for nearly half of the loci, enabling new insights into dosage-sensitivity of these genes and implicating several novel gene-trait relationships. These results demonstrate the ability of haplotype-informed analysis to provide insights into the genetic basis of human complex traits.
T04 Evaluation of methods for estimating coalescence times using ancestral recombination graphs

Discussion
Débora Y. C. Brandt [1], Xinzhu Wei [1], Yun Deng [2], Andrew H. Vaughn [2], Rasmus Nielsen [2, 3, 4, 5]
Affiliations: [1] Department of Computer Science, University of California Los Angeles, Los Angeles, 90095, CA, USA, [2] Center for Computational Biology, University of California, Berkeley, CA 94720, USA [3] Department of Statistics, University of California Berkeley, Berkeley, CA 94720, USA, [4] GLOBE Institute, University of Copenhagen, Oester Voldgade 5-7 1350 Copenhagen K, Denmark, [5] Department of Integrative Biology, University of California Berkeley, Berkeley, CA 94720, USA
Topics: Population genetics
Keywords: Ancestral Recombination Graph, ARGweaver, Relate, Tsinfer, Tsdate, Simulation, Calibration
The ancestral recombination graph (ARG) is a structure that describes the joint genealogies of sampled DNA sequences along the genome. Recent computational methods have made impressive progress towards scalably estimating whole-genome genealogies. In addition to inferring the ARG, some of these methods can also provide ARGs sampled from a defined posterior distribution. Obtaining good samples of ARGs is crucial for quantifying statistical uncertainty and for estimating population genetic parameters such as effective population size, mutation rate, and allele age. Here, we use simulations to benchmark the estimates of pairwise coalescence times from three popular ARG inference programs: ARGweaver, Relate, and tsinfer+tsdate. We use neutral coalescent simulations to 1) compare the true coalescence times to the inferred times at each locus; 2) compare the distribution of coalescence times across all loci to the expected exponential distribution; 3) evaluate whether the sampled coalescence times have the properties expected of a valid posterior distribution. We find that inferred coalescence times at each locus are more accurate in ARGweaver and Relate than in tsinfer+tsdate. However, all three methods tend to overestimate small coalescence times and underestimate large ones. Lastly, the posterior distribution of ARGweaver is closer to the expected posterior distribution than Relate's, but this higher accuracy comes at a substantial trade-off in scalability.
T05 Untangling the ARG

Discussion
Yan Wong [1], Anastasia Ignatieva [1], Anthony Wilder Wohns [2], Jere Koskela [3], Jerome Kelleher [1]
Affiliations: [1] University of Oxford, [2] Broad Institute, [3] University of Warwick
Topics: Population genetics, Phylogenetics
Keywords: Ancestral Recombination Graph, Trees, Tree Sequence
Ancestral Recombination Graphs, or ARGs, have been of deep theoretical interest for several decades and are now becoming possible to infer at scale. However, terminological confusions abound: the word "ARG" is understood to mean different things by different people. For example, sometimes it is defined as a stochastic process of inheritance (akin to the "coalescent with recombination"); alternatively it is sometimes applied to the graph structure that results from the inheritance process. We aim to clear up this confusion, and following the latter definition, point out a distinction between a recombination graph where the nodes are *events* of different types, and a more flexible definition where the nodes instead represent *genomes* and where the genetic information transmitted through the graph is tracked using annotations on edges rather than annotations on nodes. We show that this second definition allows rapid access to the local trees at each point in the genome, and can be used to represent a much wider range of models of inheritance, as well as dealing with imperfect knowledge of recombination events. We therefore argue that this latter representation – which we call a genome-ARG or gARG, and which can be stored in the well-characterised "tree sequence" data structure – is the best suited for practical analysis and evolutionary inference.
T06 Simple selection inference from pre-estimated genealogies using a likelihood approach

Discussion
Armin Scheben [1], Adam Siepel [1]
Affiliations: [1] Simons Center for Quantitative Biology, Cold Spring Harbor Laboratory, Cold Spring Harbor, NY
Topics: Population genetics, Natural selection
Keywords: Selection, Ancestral Recombination Graph
Inference of selection from genome sequencing data is a key challenge in population genetics. As genomic data sets grow and inference of ancestral recombination graphs steadily improves, it is increasingly feasible to test selection from pre-estimated genealogies. This potentially enables a considerable advance over classical selection inference methods relying on patterns of allele frequencies or linkage disequilibrium. Methods have begun to emerge for selection inference based on genealogies but they are generally complex and computationally intensive. Here we examine a series of simpler tests based on a likelihood approach that can be implemented in a few lines of code. The tests can be applied separately to individual loci, and applied to subtrees as well as full trees. Using known and inferred genealogies from simulated data, we find that they perform well in comparison to much more computationally intensive methods. We further show that our tests using pre-estimated genealogies are a useful tool to investigate hitchhiking deleterious alleles that arose during recent evolution.
T07 Addressing local ancestry and time since admixture with genealogies and ancient human DNA

Discussion
Alice Pearson[1], GeoGenetics Meso-Neo Consortium[2], Eske Willerslev[2,3], Richard Durbin[1]
Affiliations: [1] Department of Genetics, University of Cambridge, [2] Lundbeck Foundation GeoGenetics Centre, GLOBE Institute, University of Copenhagen, [3] Department of Zoology, University of Cambridge
Topics: Population genetics
Keywords: Ancestral Recombination Graph, Machine Learning, Admixture
Recently, two new approaches have transformed our understanding of human population history. Firstly, sequencing of ancient genomes gives us a snapshot of past genetic variation. We can therefore make inferences from observed genetic signatures present before historical events such as population bottlenecks and natural selection have obscured them from the modern gene pool. Ancient DNA has thus revealed what cannot be determined from modern genomes alone. Secondly, the development of methods that aim to reconstruct population genealogies from genetic variation data. Together with an understanding of how evolutionary processes alter genealogies this has allowed inference of historical and ongoing processes in real world populations. The latest updates in these approaches now allow us to combine the two to infer genealogies involving both present-day and ancient individuals. I will discuss a new method that uses machine learning and tree sequences built from ancient and present-day genomes from Europeans and West Asians. The method allows us to infer local ancestry along each sample chromosome and subsequently estimate the time since admixture for individuals belonging to ancient European groups. We show that our inferred admixture ages are a better metric than sample ages alone for understanding movements of people across Europe in the past.
T08 Attention-based neural networks for population genetics

Discussion
Théophile Sanchez, Pierre Jobic, Guillaume Charpiat, Flora Jay
Affiliations: Université Paris-Saclay, Centre national de la recherche scientifique, Institut national de recherche en sciences et technologies du numérique, Laboratoire interdisciplinaire des sciences du numérique
Topics: Population genetics, Deep learning in genomics
Keywords: Population Size Inference, Attention Mechanism
Artificial neural networks (ANNs) have recently offered new perspectives to solve inference problems from high dimensional data in numerous scientific fields, but it is yet unclear which architectures are the most suited to genomic data. Here, we present a new ANN architecture integrating attention mechanisms to infer effective population size history from genomic data. Built upon our previous exchangeable architecture SPIDNA, MixAttSPIDNA adds attention layers that allow computing more expressive and complex features from combinations of haplotypes. The contribution of each haplotype to the features is learned automatically and depends on its content and affinity with the other haplotypes. Likewise, we use this mechanism to automatically perform a voting scheme that aggregates predictions from different genomic regions. This new architecture outperforms approximate Bayesian computation and SPIDNA on simulations while relying directly on raw genetic data and being invariant to haplotype permutation in the input. As a proof-of-concept, we use this architecture to infer the effective population size history of 54 populations from the HGDP dataset (Bergström et al., 2020), and we compare our results to smc++ (Terhorst et al., 2017). This application highlights the ability of the network to handle data with a varying number of haplotypes and to quickly perform predictions for datasets including numerous populations.
T09 Modelling the influence of kinship systems on human genetic diversity

Discussion
Léa Guyon , Jérémy Guez, Evelyne Heyer, Raphaëlle Chaix
Affiliations: Eco-anthropologie, MNHN
Topics: Population genetics
Keywords: Kinship Systems, Ancient DNA, Modern DNA, Modelling, SLiM
Kinship rules such as descent rules – indicating the group (lineage, clan) to which an individual is affiliated – and post-marital residence rules – determining the place of living of a couple after marriage – vary widely between human populations. In western societies, individuals are generally affiliated to the groups of both their parents (bilateral descent) and choose where to settle after their marriage (neolocality). However, the majority of populations display unilineal descent rules, that is, individuals are affiliated to the group of their father (patrilineal) or to the group of their mother (matrilineal), and are either patrilocal – the wife migrates to her husband’s village – or matrilocal – the husband settles in his wife’s village. Interestingly, human populations are currently mostly patrilineal (~ 40 %) and patrilocal (~ 60 %), but little is known about the evolution of these kinship rules in human history. Hence, we wonder if this overrepresentation of patrilineality and patrilocality always existed and, if not, when they became dominant in human populations. By modelling populations displaying different kinship rules, we evaluate the influence of these cultural practices on genetic diversity and identify relevant diversity estimators that could be applied to ancient DNA, in order to trace back the history of human social organizations in space and time.
T10 Polymorphism-aware estimation of species trees meets RevBayes

Discussion
Carolin Kosiol [1,6], Rui Borges [1], Bastien Boussau [2] , Sebastian Hoehna [3,4] , Ricardo J. Pereira [5]
Affiliations: [1] Institut fuer Populationsgenetik, Vetmeduni Vienna; [2] Universite de Lyon, Universite Claude Bernard Lyon 1, CNRS UMR 5558, LBBE; [3] GeoBio-Center, Ludwig-Maximilians-Universitaet Muenchen; [4] Department of Earth and Environmental Sciences, Paleontology & Geobiology, Ludwig-Maximilians Universitaet Muenchen; [5] Division of Evolutionary Biology, Department of Biology II, Ludwig-Maximilians-Universitaet Muenchen; [6] Centre for Biological Diversity, University of St Andrews
Topics: Population genetics, Natural selection, Phylogenetics
Keywords: Species Tree, Bayesian Inference, RevBayes, Polymorphism-Aware Phylogenetic Models, Mutation Bias, Selection, Grasshoppers
The availability of population genomic data through new sequencing technologies gives unprecedented opportunities for estimating important evolutionary forces such as genetic drift, selection, and mutation biases across organisms. Yet, analytical methods that can handle polymorphisms jointly with sequence divergence across species are rare and not easily accessible to empiricists. We implemented polymorphism-aware phylogenetic models (PoMos), an alternative approach for species tree estimation, in the Bayesian phylogenetic software RevBayes. PoMos naturally account for incomplete lineage sorting, which is known to cause difficulties for phylogenetic inference in species radiations, and scale well with genome-wide data. Simultaneously, PoMos can estimate mutation and selection biases. We have applied our methods to resolve the complex phylogenetic relationships of a young radiation of Chorthippus grasshoppers, based on coding sequences. In addition to establishing a well-supported species tree, we found a mutation bias favoring AT alleles and selection bias promoting the fixation of GC alleles, the latter consistent with GC-biased gene conversion. PoMos offer a wide range of models to reconstruct phylogenies and can be easily combined with existing models in RevBayes - e.g., relaxed clock and divergence time estimation - offering new insights into the evolutionary processes underlying molecular evolution and, ultimately, species diversification.
T11 A phylogenetically aware SMC to infer the evolutionary history of species

Discussion
Iker Rivas-González, Asger Hobolth, Mikkel H. Schierup
Affiliations: Aarhus University
Topics: Population genetics, Phylogenetics
Keywords: Hidden Markov Model, SMC, Incomplete Lineage Sorting
Hidden Markov models are useful to reconstruct the coalescent process and to infer quantities in population genomics. A well-known family of such models are sequentially Markovian coalescents (SMCs), which allow for the estimation of population parameters through time using discretized coalescent times as hidden states. However, SMCs only model a single coalescent event, so they cannot be used for investigating speciation events involving more than two species. Here, we present a new method, ILSMC, which makes use of incomplete lineage sorting (ILS) to extract information about the evolutionary history of three species. The hidden states of ILSMC are four-leaved phylogenetic trees with discretized coalescent times, and the observed states are sites along a four-way genome alignment. Due to the parameterization of the model, ILSMC is able to obtain unbiased estimators of ancestral effective population sizes, speciation times, recombination rates and mutation rates. ILSMC is fully flexible since each discretized time interval can be modeled with different parameter values, and it can thus infer changes in parameters through time. Moreover, ILSMC is extensible, and future additions could involve, for example, modeling introgression in the context of ILS. Here, I will present the framework for inferring the transition and emission probabilities of ILSMC, and apply it to simulated and real data from a four-species alignment.
T12 Nonreversible MCMC for phylogenetics

Discussion
Jere Koskela
Affiliations: University of Warwick
Topics: Population genetics, Phylogenetics
Keywords: Bayesian Inference, Coalescent, MCMC
Posterior distributions arising out of coalescent-based models of genetic diversity are nearly always intractable. MCMC is a gold- standard tool for sampling from them. However, their computational cost scales notoriously badly with problem complexity, so that their practical use is restricted to data sets which are small by modern standards. Over the past 5 years, several classes of nonreversible MCMC methods which use the gradient of the posterior density to guide chains during runs have become increasingly prominent. These methods have better theoretical scaling properties than more classical Metropolis-Hastings algorithms, but cannot be readily implemented for coalescent models because a posterior defined on discrete tree topologies does not have a natural notion of a gradient. I will demonstrate how embedding spaces of coalescent trees into a continuous space facilitates the use of gradient information in algorithm specification, and also provides a simple framework for designing adaptive MCMC algorithms. Comparisons with Metropolis-Hastings on simple coalescent examples show that these methods speed up mixing over the posterior, sometimes dramatically.
T13 Too many signatures? Perhaps you are underestimating the effects of overdispersion.

Discussion
Ragnhild Laursen, Marta Pelizzola, Asger Hobolth
Affiliations: Aarhus University
Topics: Applications to cancer and other diseases, Deep learning in genomics
Keywords: Cancer Genomics, Cross-Validation, Model Checking, Model Selection, Mutational Signatures, Negative Binomial, Non-Negative Matrix Factorization, Poisson
Mutational signatures are derived from a collection of somatic mutations from cancer genomes using non-negative matrix factorization (NMF). To extract the mutational signatures with NMF we have to determine an error model and a number of mutational signatures for the observed mutational counts. The error model determines the underlying distributional assumption of the data. In most applications, the mutational counts are assumed to be Poisson distributed, but often overdispersion is present. In this talk, I will discuss how this leads to an overestimation of the number of signatures and show why the negative binomial distribution is more appropriate for mutational counts. Furthermore, I will introduce a novel cross-validation procedure to identify the number of signatures, which is less influenced by overdispersion. The discussion is supported by a simulation study, where different amounts of noise are added to the data and the true number of signatures are known. This shows how badly the number of signatures are overestimated, when choosing a wrong error-model. In the presence of overdispersion we also show that our cross-validation procedure is more robust at determining the correct number of signatures than state-of-the-art methods, which are overestimating the number of signatures.
T14 TAFI (Tumor Allele Frequency Interpreter): a new deep learning tool to reveal the evolutionary history of tumors

Discussion
Verónica Miró Pina , David Castellano, Donate Weghorn
Affiliations: [1,3] Centre for Genomic Regulation (CRG), The Barcelona Institute of Science and Technology, Dr. Aiguader 88, Barcelona 08003, Spain [2] University of Arizona [3] Universitat Pompeu Fabra (UPF), Barcelona, Spain
Topics: Population genetics, Applications to cancer and other diseases, Deep learning in genomics
Keywords: Site Frequency Spectrum, Coalescent, Demography, Convolutional Neural Network, Next Generation Sequencing
Tumor progression is a somatic evolutionary process in which population expansion is driven by the accumulation of mutations that promote cell proliferation (cancer drivers). Other mutations, called passengers, with an indifferent fitness effect, subsequently accumulate. The distribution of both driver and passenger variant allele frequencies (VAF) can be used to infer relevant biological parameters, such as mutation rates, or to distinguish between demographic models. However, some technical issues arise when trying to compute these estimators from available data. In large scale sequencing projects (e.g., PCAWG), the technique used is bulk sequencing, with low read depth. This curtails our ability to call low frequency variants and yields a general underestimation of the amount of genetic diversity in the tumor. Our goal is to develop an algorithm that can estimate mutation rate and demographic history while accounting for these systematic errors and biases generated by sequencing techniques, mutation calling tools and subsequent filters, that are applied to generate current datasets. Our method combines coalescence simulations (using msprime) ad deep learning, to estimate the true amount of genetic diversity and provide reliable estimates of the tumor’s mutation rate, demography and age. This algorithm could be applied to create a comprehensive study of how different evolutionary parameters vary across tumor types and across individual tumors.
T15 A paternal bias in germline mutation is widespread across amniotes and can arise independently of cell divisions

Discussion
Marc de Manuel [1*], Felix Wu [1,2]*, Molly Przeworski [1,2]
Affiliations: [1] Department of Biological Sciences, Columbia University, New York, New York, USA. [2] Department of Systems Biology, Columbia University, New York, New York, USA.
Topics: Population genetics
Keywords: Sex Bias In Germline Mutation
In humans and other mammals, germline mutations are more likely to arise in fathers than in mothers. Although this sex bias has long been attributed to DNA replication errors in spermatogenesis, recent evidence from humans points to the importance of mutagenic processes that do not depend on cell division, calling into question our understanding of this basic phenomenon. Here, we infer the ratio of paternal-to-maternal mutations, alpha, in 42 species of amniotes, from putatively neutral substitution rates of sex chromosomes and autosomes. Despite marked differences in gametogenesis, physiologies and environments across species, fathers consistently contribute more mutations than mothers in all the species examined, including mammals, birds and reptiles. In mammals, alpha is as high as 4 and correlates with generation times; in birds and snakes, alpha appears more stable around 2. These observations can be explained by a simple model, in which mutations accrue at equal rates in both sexes during early development and at a higher rate in the male germline after sexual differentiation, with a conserved paternal-to-maternal ratio across species. Thus, alpha may reflect the relative contributions of two or more developmental phases to total germline mutations, and is expected to depend on generation time even if mutations do not track cell divisions.
T16 Detecting substantial polygenicity of weak selection coefficients among complex traits from UK Biobank and across different human populations by inferring the genome-wide distribution of fitness effects from GWAS summary statistics

Discussion
Alexander Xue, Yi-Fei Huang, Adam Siepel
Affiliations: Simons Center for Quantitative Biology, Cold Spring Harbor Laboratory, One Bungtown Road, Cold Spring Harbor, NY 11724
Topics: Population genetics, Natural selection, Quantitative genetics, Methods for GWAS
Keywords: Stabilizing Selection Of Complex Traits, GWAS Summary Statistics, Likelihood Model, PRF
Current approaches for detecting selection from GWAS data are unable to directly estimate the distribution of fitness effects (DFE). To this end, we introduce ASSESS, an inferential method that exploits the Poisson Random Field (PRF) to model selection coefficients from genome-wide allele count data, while jointly conditioning GWAS summary statistics on a latent distribution of phenotypic effect sizes from genotypes. The likelihood function, which is unified under the assumption of an explicit relationship between fitness and trait effect, is optimized using an EM algorithm to yield a trait’s DFE. To validate the performance of ASSESS, we conducted several simulation experiments under various data configurations, demographic scenarios, and genomic architectures. We find consistent behavior in accurately recovering the underlying selection history, as well as a high degree of robustness to a range of assumption violations of our conceptual framework. Additionally, we applied ASSESS to publicly available data for an array of human traits in both European and non-European populations. We discover a pattern of polygenicity estimates much higher than in previous investigations, which we attribute to the PRF’s sensitivity to weaker selection coefficients. Our in silico demonstration as well as the empirical insight gained here illustrate the potential of ASSESS to satisfy an increasing need for powerful yet convenient population genomic inference from GWAS summary statistics.
T17 Guaranteeing unbiasedness in selection tests based on polygenic scores.

Discussion
Jennifer Blanc , Jeremy J. Berg
Affiliations: Department of Human Genetics, The University of Chicago
Topics: Population genetics, Natural selection, Quantitative genetics, Methods for GWAS
Population stratification is a well-studied problem in genome-wide association studies, leading to biases in the estimated strength of phenotypic association for individual genetic variants. While current methods to correct for stratification are generally effective in reducing significant false positive associations, even subtle biases in effect size estimates can accumulate across loci, leading to systematic biases in polygenic scores. In turn, these biases in the distribution of polygenic scores can lead to false positives in downstream analyses, such as tests for polygenic selection. Using theory from population and statistical genetics, together with simulations, we show that it is possible to guarantee the unbiasedness of polygenic selection tests without needing to achieve the much more difficult task of guaranteeing that the effect sizes are completely unbiased. We show that by analyzing GWAS and test panels jointly in a unified framework, we can leverage the observed overlap in population structure between the two samples so as to protect the GWAS from stratification biases along the relevant axis of shared structure. More generally, our results have implications beyond tests for selection, as any analysis that attempts to quantify the covariance between polygenic scores and demographic or environmental variables is subject to the same type of stratification biases, and can therefore benefit from our framework.
T18 The relationship between the distribution of fitness effects and the distribution of mutation rates

Discussion
David Castellano, Ryan Gutenkunst
Affiliations: University of Arizona
Topics: Population genetics, Natural selection
Keywords: Distribution Of Fitness Effects
The distribution of fitness effects (DFE) of new mutations is a key parameter in molecular evolution. The DFE has been explored across different species and across different genes within a species. Recent work has compared the DFE for sites with different levels of conservation across a multispecies alignment finding that the more conserved a site, the more likely mutations occurring at this site are deleterious. In this work, we compare the DFE of sites with different mutation rates in humans based on their 3mer sequence context. For each mutation type, we compute their mutation rate using de novo mutations from trio data and their DFE using the distribution of allele frequencies of synonymous and non-synonymous mutations occurring at the same sequence context. We find that 92 mutation types (out of 96 possible mutation types) can be found both as synonymous and non-synonymous. We correlate the mutation rate and the mean selective effect of those 92 mutation types, then we compute the unweighted DFE (that is an artificial DFE where the mutation spectrum is flat) and compare it with the observed DFE in the Yoruba population and the expected DFE under the current germline mutation spectrum. Our work highlights the importance of the distribution of mutation rates across sites to explain the observed distribution of fitness effects in humans.
T19 TL-GWAS removes unnecessary assumptions and model dependencies from population genetics analysis

Discussion
Ava Khamseh, Olivier Labayle Pabet [1,4], Kelsey J Tetley-Campbell [1,4], Mark J van der Laan [3], Chris P Ponting [1], Sjoerd V Beentjes [1,2], Ava Khamseh [1,4]
Affiliations: [1] University of Edinburgh, MRC Human Genetics Unit, Edinburgh, United Kingdom, [2] University of Edinburgh, School of Mathematics, Edinburgh, United Kingdom, [3] University of California Berkeley, Public Health, Berkeley, CA, [4] University of Edinburgh, School of Informatics, Edinburgh, United Kingdom
Topics: Population genetics, Quantitative genetics, Methods for GWAS, Deep learning in genomics
Keywords: Targeted Learning, GWAS, Mathematical Guarantees
We present a unified statistical workflow, Targeted Learning-GWAS (TL- GWAS), for estimating effect sizes and epistatic interactions in genome-wide association studies of polygenic traits. Our approach is based on Targeted Learning, a framework for estimation which integrates mathematical statistics, machine learning and causal inference to provide mathematical guarantees and realistic p-values. TL-GWAS defines effect sizes and interactions of genomic variants on traits in a model-independent manner, thus avoiding all-too-common model-misspecification whilst taking advantage of a library of parametric and state-of-the-art non-parametric algorithms. TL-GWAS data-adaptively incorporates confounders and sources of population stratification, accounts for population dependence and controls for multiple hypothesis testing by bounding any desired type I error rate. We validate the effectiveness of our method by reproducing experimentally verified effect sizes on UK Biobank data, whilst allowing for the discovery of non-linear effect sizes of additional allelic copies on trait or disease. TL-GWAS thus broadens the reach of current GWAS studies by allowing for the classification of the types of SNPs and phenotypes for which non-linearities occur. Our method provides a platform for comparative analyses across biobanks, or integration of multiple biobanks and heterogeneous populations to increase power, whilst controlling for population stratification and multiple hypothesis testing.
T20 Biobank-scale inference of ancestral recombination graphs enables genealogy-based mixed model association of complex traits

Discussion
Pier Palamara, Brian Zhang [1], Arjun Biddanda [1], Pier Francesco Palamara [1, 2]
Affiliations: [1] Department of Statistics, University of Oxford, Oxford, UK, [2] Wellcome Centre for Human Genetics, University of Oxford, Oxford, UK
Topics: Population genetics, Methods for GWAS
Keywords: Ancestral Recombination Graph
Accurate inference of gene genealogies from genetic data has the potential to facilitate a wide range of analyses, but is computationally challenging. We introduce a scalable method, called ARG-Needle, that uses genotype hashing and a coalescent hidden Markov model to infer genome-wide genealogies from sequencing or genotyping array data in modern biobanks. We developed strategies to utilize the inferred genome-wide genealogies within linear mixed models to perform association and other complex trait analyses. We validated the accuracy and scalability of ARG-Needle using extensive coalescent simulations and used it to build genome-wide genealogies using genotype data for 337,464 UK Biobank individuals. We performed genealogy-based association analysis in 7 complex traits, detecting more rare and ultra-rare signals (N=133, frequency range 0.0004%-0.1%) than genotype imputation from ~65k sequenced haplotypes (N=65). We validated these signals using 138k exome sequencing samples. ARG- Needle associations strongly tagged (average r=0.72) underlying sequencing variants that are enriched for missense (2.3x) and loss-of- function (4.5x) variation. Compared to imputation, inferred genealogies also capture additional signals for higher frequency variants. These results demonstrate that biobank-scale inference of gene genealogies may be leveraged in the analysis of complex traits, complementing approaches that require the availability of large, population-specific sequencing panels.
T21 Tracking the history of genetic variants through decomposition of ancestry in population-scale sequencing data

Discussion
Patrick K. Albers [1,2], Richard Durbin[2], Nicole Soranzo [1]
Affiliations: [1] Wellcome Sanger Institute, [2] Department of Genetics, University of Cambridge
Topics: Population genetics, Natural selection, Quantitative genetics
Keywords: TMRCA, PSMC, Ancestry Inference, Whole-Genome Sequencing
The genetic variation we share with others derives from mutations in the genomes of our ancestors, in whom the presence or absence of particular variant alleles may have contributed to their ability to pass on genetic material to the next generation. While previous work has focused on using genetic variation to, for example, describe patterns of population structure or infer past demographic events, only little is known about the history of variant alleles themselves. We developed a novel approach, Ancestry Decomposition (AD), for ‘simultaneous’ estimation of ancestral relationships among all individuals in a larger population sample. This is based on an improvement to the Pairwise Sequentially Markovian Coalescent (PSMC) methodology for inference of the time to the most recent common ancestor (TMRCA) between individual haplotypes, which by itself we show to maintain high accuracy while being fast and scalable to accommodate larger sequencing datasets. Notably, AD is a “tree-less” method that, at a given time in the past, identifies shared lineages of ancestry in a probabilistic manner, and where we can track the history of focal alleles within their genealogical context, but without the need to reconstruct the underlying tree. We use AD to characterise patterns of selection on variant alleles over time, by contrasting gradients of coalescence between ancestral lineages carrying or not carrying a focal allele, which we evaluate through simulation.
T22 Simulating realistic pedigrees

Discussion
Luke Anderson-Trocmé [1], Shadi Zabad [3], Wilder Wohns [4], Jerome Kelleher [5], Simon Gravel [1, 2]
Affiliations: [1] Department of Human Genetics, McGill University, Montreal, QC, Canada, [2] McGill University and Genome Quebec Innovation Centre, Montreal, QC, Canada, [3] Department of Computer Science, McGill University, Montreal, QC, Canada, [4] Broad Institute of MIT and Harvard, [5] Big Data Institute, University of Oxford, Oxford, UK
Topics: Population genetics, Natural selection, Phylogenetics
Keywords: Population Scale Pedigree, Msprime Coalescent Simulation, Assortative Mating
Population scale pedigrees encode the process by which individuals in a population move through space and time. Well-studied models such as Wright-Fisher’s create pedigrees that capture the mean and variance in offspring number but do not capture the idiosyncratic dynamics of real populations. To move beyond this, we took advantage of the French- Canadian pedigree -- with over 5M digitized records, it is one of the largest and most complete spatially linked population scale pedigrees available for research. We developed a pedigree-aware extension to genome simulation software msprime, and show that pedigree-aware simulations in millions of individuals are possible and exquisitely capture real population structure. To pick apart the relative impact of demographic factors influencing population structure, we have developed a forwards-in-time pedigree simulation framework that leverages user specified migration rates, fertility rates and assortative mating preferences based on relatedness and age disparity. Moreover, the availability of empirical and simulated pedigrees provides us with an unprecedented ability to assess the accuracy of standard population genetic assumptions in large cohorts. This information, together with the ability to generalize pedigree and coalescent simulations will enable more detailed study of the impact of complex demographic forces on fine-scale population structure.
T23 Ancestral population structure from admixed individuals

Discussion
Kristian Hanghoej [1], Jonas Meisner [1,2], Anders Albrechtsen[1]
Affiliations: [1] Department of Biology, University of Copenhagen, Ole Maaløes Vej 5, 2200 Copenhagen N, Denmark [2] Copenhagen Research Center for Mental Health, Gentofte Hospital, Gentofte Hospitalsvej 15, 2900 Hellerup, Denmark
Topics: Population genetics, Deep learning in genomics
Keywords: Local Ancestry, Admixed Populations
Large genetic studies often contain populations where almost all individuals are admixed such as the Native American populations in the 1000 genomes project (1KG). Population genetic analyses of such scenarios are difficult but one approach is to mask part of the genome so only a single ancestry remains. However, there are several challenges with this approach. Current methods for inference of local ancestry need predefined reference panels and do not scale well for large sample sizes. Subsequent analyses, like PCA, have issues dealing with missing data when there are no overlapping sites with information, a consequence of splitting an admixed individual into masked versions of its ancestries. Here, we present a framework for analyzing such data. We developed HaploNet to unsupervised model the latent haplotype clusters using a variational autoencoder. From the haplotype clusters our software FATASH infers the local ancestry using a HMM without relying on a reference panel. Lastly, we perform a probabilistic PCA where each individual is represented with each of its ancestries, allowing the individual to be in the PCA several times. We apply this framework to simulations to show its accuracy and to the 1KG data where we explore the ancestral population structure. In a single PCA, we show the Native American individuals in the 1KG have both ancestry that clusters with one of several Native American populations and ancestry that clusters with southern European ancestry.
T24 Sex-specific population structure in baboons revealed by hundreds of complete X-chromosomes

Discussion
Erik Fogh Sørensen[1] , Jeffrey Rogers[2], Christian Roos[3], Dietmar Zinner[3], Sascha Knauf[3], Julia Fischer[3], Clifford J Jolly[4], Jane Philips-Conroy[5], Idrissa Chuma[6], Julius Keyyu[7], Primate Sequencing and Conservation Initiative[2, 3] Kasper Munch[1]
Affiliations: [1] Aarhus University, [2] Baylor College of Medicine, [3] German Primate Center, [4] New York University, [5] Washington University, [6] Tanzania National Parks, [7] Tanzania Wildlife Research Institute.
Topics: Population genetics, Natural selection, Phylogenetics
Keywords: Chromopainter, Ancestry, Admixture
Baboons (Papio) contain multiple hybridizing species. Their divergence is similar in timeframe to the divergences among Homo species. Baboons thus present an opportunity to study species dynamics similar to those in humans. Baboons have complex migration patterns, as well as highly varied social structures. Male-biased admixture leads to characteristic signals on the X chromosome, as it spends less time in males than autosomes. We sequenced 225 wild baboons at 30X depth, sampled from 19 locations. Shapeit4 was used to phase the dataset. Haplotype-based analysis was performed with Chromopainter, fineSTRUCTURE and Globetrotter to infer population structure and recent admixture events. Chromopainter's paintings were further analyzed to find distinct patterns. The findings are that admixture is present in most populations. The X chromosome contains a higher proportion of admixed ancestry. The X chromosome also harbours longer stretches without admixture than the autosomes, suggesting a role in hybrid incompatibility and positive selection in hybrid zones.
T25 Quantifying allo-coprophagy from gut microbiome shotgun sequencing data

Discussion
Hélène Tonnelé [1], Felipe Morillo [1], NIDA Center for GWAS in Outbred Rats [2], Center for Microbiome Innovation [2], Amelie Baud[1]
Affiliations: [1] Center for Genomic Regulation, Barcelona, Spain. [2] University of California San Diego, CA, USA
Topics: Quantitative genetics, Causal inference in genetic studies, Pathogen genomics
Keywords: Gut Microbiome, Illumina Sequencing, Variance Decomposition
Individuals living together (e.g. human couples, laboratory and farm animals sharing an enclosure) tend to have similar gut microbiomes. Whether this results from similar exposures to environmental factors or from the transfer of bacteria through physical contact or shared surfaces, however, remains an open question - an important one if probiotics are to become widely used in medicine. In laboratory rodents, allo-coprophagy (animal eating the feces of a cage mate) gives ample opportunity for bacterial transfer. In order to investigate the impact of allo-coprophagy, we relied on detecting DNA from cage mates in a focal rat’s gut. We studied 1,041 outbred rats that were genotyped and whose cecal (gut) content was characterised by shallow shotgun sequencing. Using the reads mapping to bacterial genomes first, we quantified all bacterial taxa and used mixed models to estimate shared environmental (cage) effects. Those were significant and large, as expected. Using the reads mapping to the rat reference genome next, we estimated allo-coprophagy as the proportion of reads showing discrepancies with the rat’s own genotypes but consistent with the genotypes of its cage mates. We found very low levels of such mixtures (0.5% on average), and those did not differ from mixtures between non-cage mates, which reflect technical contamination (e.g. robot spill-up). Thus, we found no evidence of allo-coprophagy, suggesting it is unlikely to explain the large cage effects that were detected.
T26 Social Evolution in the Spider Genus Stegodyphus

Discussion
Jilong Ma, Mikkel Heide Schierup, Palle Villesen, Jesper Bechsgaard, Trine Bilde
Affiliations: [1]Aarhus University
Topics: Population genetics, Assembly and variant identification
Keywords: Social Evolution, De Novo Assembly, Stegodyphus
The evolution of sociality has occurred independently in many organisms. The evolutionary triggers of the transition and its consequences for genome evolution are not well known, but are likely to dynamically change and reinforce the syndrome. Social transition is often associated with a biased sex ratio, reproductive skewness and inbreeding, resulting in reduced effective population size and thus reduced purifying selection for maintaining genome integrity. The genome changes at the early stage of social transition can better reveal the triggers of social evolution and potential risk factors.The spider genus Stegodyphus is well suited for investigating the evolution of sociality at an early stage. Phylogenetic analysis has demonstrated three independent transitions from subsociality to social lifestyle during the past 0.5-3 million years. We have performed de novo genome assembly of three pairs of subsocial and social species together with a subsocial outgroup. We used 30X Pacbio HiFi and Hi-C followed by Hifiasm and 3D-DNA analyses to assemble the genomes into accurate complete chromosome level assemblies for the genomes ranging from 2.5 to 3.2 Gb in size. We present initial results of the comparative genome evolution based on the seven-species genome alignment, and demonstrate relaxed purifying selection on each of the branches leading to social species. We relate our results to the question: Are there common influences on the genome from the social transition?

Posters

P01 Evolutionary fitness effects of loss-of-function mutations in humans and their pathogenic effects at present

Visit poster
Ipsita Agarwal, [1,2], Zach Fuller [3], Molly Przeworski [2]
Affiliations: [1] University of Oxford, [2] Columbia University, [3] 23andMe
Topics: Population genetics, Natural selection
Given the huge sample sizes now available for human exome sequences, we can estimate the fitness effects of loss-of-function (LOF) mutations, reflective of their average effect over many different backgrounds over evolutionary time, and begin to relate those to pathogenic effects of those mutations in individuals at present. Here, we estimate posterior distributions for the strength of selection on genic loss-of-function using observed frequencies of LOF mutations in 56,855 individuals, given an estimated mutation rate and a realistic demographic model. We then consider the distribution of fitness effects (DFE) for de novo LOF mutations identified in individuals with severe diseases that impose a heavy realized fitness burden, including autism and developmental disorders. As expected, these cohorts show an enrichment of deleterious mutations, with the majority of evolutionary fitness effects that we estimate to be in excess of 10%. Interestingly, however, the study design has a big influence on just how deleterious the mutations that are detected end up being; for instance, mutations identified in probands in simplex family studies of autism are more deleterious on average than those identified in multiplex family studies, and mutations identified in female probands are more deleterious on average than those identified in males. These results suggest that different ascertainment schemes provide a lens through which to view different parts of the DFE of causal mutations.
P02 Simultaneous inference of parental admixture proportions and admixture times from unphased local ancestry calls

Visit poster
Siddharth Avadhanam, Amy Williams
Affiliations: Cornell University
Topics: Population genetics
Keywords: Local Ancestry Inference, Hidden Markov Model
Population genetic analyses of local ancestry tracts routinely assume that the ancestral admixture process is identical for both parents of an individual, an assumption that may be invalid when considering recent admixture. Here we present Parental Admixture Proportion Inference (PAPI), a Bayesian tool for inferring the admixture proportions and admixture times for each parent of a single admixed individual. PAPI analyzes unphased local ancestry tracts and has two components: a binomial model that exploits homozygous ancestry regions to infer parental admixture proportions, and a hidden Markov model (HMM) that infers admixture times from tract lengths. The HMM accounts for unobserved within-ancestry recombination by approximating pedigree crossover dynamics, thus enabling inference of parental admixture times. In simulations, PAPI's admixture proportion estimates deviate from the truth by 0.047 on average, outperforming ANCESTOR by 46%. Moreover, PAPI's admixture time estimates were strongly correlated with the truth (R=0.76). We ran PAPI on African Americans genotypes from the PAGE study (N=5,786) and found strong evidence of assortative mating by ancestry proportion: couples' ancestry proportions are closer to each other than expected by chance (P<10-6), and are highly correlated (R=0.87). We anticipate that PAPI will be useful in studying the population dynamics of admixture and will also be of interest to individuals seeking to learn about their personal genealogies.
P03 Stochastic networks models for unsupervised phenotyping in high-dimensional genomic data

Visit poster
Tom Bartlett[1], Swati Chandna[2], Sandipan Roy[3]
Affiliations: [1] Department of Statistical Science, University College London [2] Department of Economics, Mathematics and Statistics, Birkbeck University of London [3] Department of Mathematical Sciences, University of Bath
Topics: Applications to cancer and other diseases, Single cell -omics
Similar levels of sparsity can be observed in adjacency matrices that represent observations of a network, as can be observed in count matrices that represent observations of genomic sequencing count data (where sparsity refers to density of non-zero entries). Hence, modelling techniques developed for stochastic networks, that can be extended to bipartite networks with multiedges, are well specified (if exchangeability is assumed) to model asymmetric matrices of count data. Based on this observation, we have made use of the latest theory of stochastic networks [1], to model the observed distributions of genomic sequencing data in the eigenspace of the graph Laplacian [2]. Thus, we have used this theory to develop novel techniques for modelling genomic data, understanding that clusters of cells (like nodes) will follow multivariate-Gaussian distributions in this space. Hence, we propose a new method for unsupervised classification of novel cell-types [2], and show it is able to identify more subtle and precise distinctions (as confirmed by domain specialists) between cellular phenotypes than are made by leading software available in computational biology [3]. [1] Rubin-Delanchy P, et al. arXiv preprint arXiv:1709.05506. 2017 Sep 16. [2] Bartlett TE, Jia P, Chandna S, Roy S. Nature Scientific reports. 2021 Dec 8;11(1):1-0. [3] Satija R, et al. Nature biotechnology. 2015 May;33(5):495-502.
P04 Analyzing gene co-occurrences accounting for linked descent in bacterial populations

Visit poster
Franz Baumdicker, Christian Resl
Affiliations: University of Tuebingen
Topics: Population genetics, Natural selection, Methods for GWAS, Phylogenetics
Keywords: Pangenomes, Bacterial Evolution
The genome of many prokaryotes is highly flexible. The total set of genes in a bacterial population, the pan-genome, is therefore often significantly larger than any individual genome from the population. While some essential genes occur in all individuals, other genes occur only in a subset of the population. A prominent example of such dispensable genes are genes that confer resistance to antibiotic treatments. The evolutionary forces determining the presence and absence of genes is thus of great interest. The composition of individual genomes is by no means a combination of independently evolving genes. Prokaryotes are not recombining their genomes as frequently as diploids. Consequently, their clonal relationship, i.e the phylogenetic tree of the population, strongly affects the likelihood of observed gene combinations. On the other hand, the selective benefit or cost of genes within individual genomes depends on the presence of other genes. For a metabolic pathway a complete set of genes is often necessary. On the other hand, some genes might exclude each other such that only one of the two genes can be present in any individual. We present the software Goldfinder, a probabilistic approach that accounts for the dependencies created by the clonal species tree of a microbial population. Goldfinder can be used to find gene pairs that co-occur significantly more or less often than expected, while accounting for the probability to co-occur due to the linked descent.
P05 Evaluation of Bayesian genealogy-based coalescent models for demographic inference

Visit poster
Ronja Jessica Billenstein [1, 2], Sebastian Höhna [1, 2]
Affiliations: [1] GeoBio-Center LMU, Ludwig-Maximilians-Universität München, Richard-Wagner-Str. 10, 80333 Munich, Germany, [2] Department of Earth and Environmental Sciences, Paleontology & Geobiology, Ludwig-Maximilians-Universität München, Richard-Wagner-Str. 10, 80333 Munich, Germany.
Topics: Population genetics
Keywords: Demographic Inference, Coalescent
Coalescent processes are widely applied for inferring demographic histories, especially population sizes through time. From maximum likelihood estimation of a simple, constant process, more complex models were developed. These models include for example varying population size functions, the division into several intervals with their own function each, and allowing for different sampling points in time. We implemented various genealogy-based coalescent processes in the Bayesian software RevBayes. These cover a wide range from simple basic demographic models such as a constant, linear or exponential population size trajectory to user-defined trajectories with an individual number of intervals that can each have one of the basic models attached. Additionally, using skyline models, the user can choose between piecewise constant or piecewise linear models with a variety of possibilities on how to set the prior for the population sizes or on how the interval change points should be placed. Our implementation in RevBayes allows for extreme flexibility and includes the largest number of demographic models for inference within the same software. We applied several of the models to infer the demographic history of horses. Interestingly, the method of interval change point positioning and whether heterochronous or isochronous data were used had the highest impact on the resulting population size trajectories, yielding qualitatively different conclusions.
P06 High resolution species assignment of Anopheles mosquitoes using a genus-wide targeted sequencing approach

Visit poster
Marilou Boddé, Alex Makunin [2], Lemonde Bouafou [3,4], Nil Rahola [3,4], Marc F Ngangue [4], Boris Makanga [5], Diego Ayala [3,4], Richard Durbin [1], Mara Lawniczak [2]
Affiliations: 1. Dept. of Genetics, University of Cambridge, Cambridge, UK 2. Wellcome Sanger Institute, Hinxton, UK 3. MIVEGEC, Univ. Montpellier, CNRS, IRD, Montpellier, France 4. CIRMF, Franceville, Gabon 5. IRET, Libreville , Gabon
Topics: Deep learning in genomics
Keywords: Anopheles Mosquitoes, VAE, Species Assignment
Our lab previously developed a genus-wide targeted amplicon sequencing panel, called ANOSPP, to facilitate large-scale monitoring of Anopheles populations. Combining information from these amplicons allows for a more nuanced species assignment than single gene (e.g. COX1) barcoding, which is desirable in the light of Anopheles' permeable species boundaries. We present a hierarchical species assignment method called NNoVAE, for Nearest Neighbours (NN) and Variational Autoencoders (VAE), working on these amplicon sequences. The NN step assigns a sample to a species-group based by comparing the amplicon sequences to a reference database. The VAE step is required to distinguish between closely related species and has sufficient resolution to reveal population structure within species. In tests on independent samples with over 80% amplicon coverage, NNoVAE correctly classifies to species level 98% of samples within the An. gambiae complex and 89% of samples outside the complex. We will discuss NNoVAE's assignments on a labelled reference dataset and a large dataset of more than 3000 wild-caught mosquitoes from Gabon. ANOSPP and NNoVAE will be used to survey Anopheles species diversity and plasmodium transmission patterns through space and time on a large scale, with plans to analyse half a million mosquitoes in the next five years.
P07 Disentangling the effects of natural selection, mutation bias, demography, and GC-biased gene conversion on codon usage

Visit poster
Rui Borges, Ioanna Kotari [1, 2], Carolin Kosiol [3], Rui Borges [1]
Affiliations: [1] Institut für Populationsgenetik, Vetmeduni Vienna, Austria, [2] Vienna Graduate School of Population Genetics, Vetmeduni Vienna, Austria, [3] Centre for Biological Diversity, University of St Andrews, United Kingdom
Topics: Population genetics, Natural selection, Phylogenetics
Keywords: Codon Models, Codon Usage Bias, Natural Selection, Mutation Bias, Demography, GC-biased Gene Conversion
Codon models are one of the main tools used to infer selection on protein-coding genes. They have been popularized in comparative genomic studies by their extensive use in genome-wide selection scans. However, such models have significant limitations that are increasingly being recognized. The main one being that current codon models are still very simplistic in the sense that they usually ignore the population-level processes (e.g., demography and nucleotide usage biases) by which coding sequences evolve. Here, we present a codon model that integrates forces such as reversible and biased mutations, genetic drift, GC-bias, and directional selection operating on the 64 codons. By properly modeling the population genetics of codon evolution, we can determine the extent to which codon usage is shaped by natural selection. In particular, we find the stationary distribution of our codon model and use it to test the effects of demography, mutation biases, and GC-biased gene conversion on the frequency of the different codons. We discuss the implications of our theoretical results more broadly, namely, in light of some widely-used heuristics of codon usage.
P08 R-Orthogonalization: a joint testing procedure for sparse regression problems and tree association

Visit poster
Ryan Christ [1 ,2], Xinxin Wang [2], Ira Hall [1] and David Steinsaltz [3]
Affiliations: [1] Yale University, [2] Washington University in St. Louis, [3] Oxford University
Topics: Methods for GWAS
Keywords: Higher Criticism, Sparse Regression, Tree Testing, Li And Stephens
Many applications require testing whether a large set of predictors jointly affect an outcome of interest. Joint likelihood ratio and chi- square tests struggle to maintain power in the sparse signal setting where very few predictors have an effect on the outcome. Sparse signal problems are common in genetics; for example, testing whether a set of variants, or the clades of some tree structure, jointly predict a phenotype. Tests that target sparse alternatives, such as Tukey's Higher Criticism, require that the predictors follow an orthogonal design, and attempts to generalise these approaches to non- orthogonal designs have yet to yield powerful procedures. We propose R-Orthogonalization (RO), a testing procedure for non-orthogonal designs that maintains power under sparse alternatives, and admits background covariates. After jointly testing a set of predictors RO also returns a minimally adjusted outcome vector that is independent of the p-value of the first test. This allows RO to be applied iteratively, or in conjunction with other testing procedures, with the resulting independent p-values being easily combined. We demonstrate the power of this approach in genetics, including iterating RO up a tree structure over diploid samples to capture and combine sparse effects on its edges. This approach can be incorporated into an ancestry-testing framework that naturally accommodates polyploidy and various ancestry-inference engines, including tsinfer and RELATE.
P09 SigNet: three ANN-based tools for extracting and analyzing mutational processes in tumor genomes

Visit poster
Claudia Serrano Colome [1,2], Oleguer Canal Anton [3], Donate Weghorn [1,2]
Affiliations: [1] Centre for Genomic Regulation (CRG), [2] Universitat Pompeu Fabra (UPF), [3] KTH Royal Institute of Technology in Stockholm.
Topics: Applications to cancer and other diseases, Deep learning in genomics
Keywords: Signature Decomposition, COSMIC Signatures, Artificial Neural Networks
Mutations that occur during cancer development can be classified into different mutational processes. This is done using the mutations’ sequence contexts and statistical tools that decompose the observed mutation spectra into statistically independent mutational signatures. Many so-called refitting algorithms have been developed to find linear combinations of these signatures in a given tumor. However, these algorithms were developed years ago and they have not been tested for the newest catalogs of signatures. Here I present SigNet: a collection of data-based algorithms useful in the study of mutational processes. The first module is called SigNet Refitter, a method that uses Deep Neural Networks to do signature refitting. This algorithm outperforms the existing methods, is easily adaptable to any new catalog of mutational signatures and provides very accurate prediction intervals for the contribution of each signature in a single tumor. Moreover, we designed a method called SigNet Generator, to synthetically create realistic-looking data based on the known correlations between signatures, which allows us to create a labeled dataset. Finally, since these known correlations are susceptible to change and more signatures are left to be identified, we also implemented a method capable of identifying samples that do not preserve such correlations. We refer to it as SigNet Detector. SigNet, with its three modules, provides useful tools for the field of signature decomposition.
P10 Inference of population size changes, split-rejoin times, and admixture proportions from a single diploid genome sequence

Visit poster
Trevor Cousins, Aylwyn Scally, Richard Durbin
Affiliations: University of Cambridge
Topics: Population genetics
Keywords: Coalescent, Structure, Population Size Changes, PSMC
Many existing demographic inference methods that attempt to discover the changes in a population’s effective size over time, such as the pairwise sequentially Markovian coalescent (PSMC), assume that a population evolved under panmixia. If a population experienced structure at some period in the past, these methods thus exhibit bias in their estimation of the population’s effective size. We investigate the underlying parameters of PSMC’s hidden Markov model and show that the transition matrix contains information that can reveal the presence of ancestral population structure. Specifically, we demonstrate that even if a panmictic and structured population have an identical coalescent rate, the transition matrix has enough information in some cases to distinguish the two evolutionary histories. Using just a single diploid sequence and leveraging the information provided in the transition matrix, we develop a new method that seeks to infer not just changes in effective population size over time, but also split-rejoin times and admixture proportions. Applying our method to numerous human individuals, we explore to what extent the observed increase in inverse coalescence rate 100-300k years ago is more consistent with population structure.
P12 A new framework for efficiently sampling ancestral recombination graph

Visit poster
Yun Deng, Rasmus Nielsen
Affiliations: Center of computational biology, UC Berkeley
Topics: Population genetics, Quantitative genetics
Keywords: Ancestral Recombination Graph, Sequentially Markovian Coalescent, Decoupling
The ancestral recombination graph (ARG) is a structure that describes the joint genealogies of sampled DNA sequences along the genome. Accurate inference and sampling of the ARG, especially for large datasets, has the potential to substantially improve analyses in population genetics. Several previous methods have been developed for this purpose, but they are either not computationally scalable or do not allow accurate sampling of topologies according to a well-defined posterior. Here we provide a new framework for ARG inference based on a principled way of approximating ARGs under the Sequentially Markovian Coalescent (SMC) model. This new framework provides a balance between inference accuracy and computational speed. It can also facilitate the incorporation of various demographic models and allow sampling from a variety of priors for topologies and coalescence times.
P13 Detecting selection in admixed populations using hapFLK

Visit poster
María Inés Fariello, Gastón Rijo [1,2], Bertrand Servin [3], Maria Ines Fariello [4,5]
Affiliations: [1] CICADA, Universidad de la República, [2] Human Evolutionary Genetics, Institut Pasteur, [3] GenPhySE, INRAe, [4] Facultad de Ingeniería, Universidad de la República, [5] UBi, Institut Pasteur Montevideo
Topics: Population genetics, Natural selection
Keywords: Admixture, Selection Footprints
hapFLK is a test designed at the beginning to detect selection in population trees based on haplotypic information from multiple populations. As it is based on the joint distribution of the allele frequencies of the tested populations, estimating their variance covariance matrix is central for estimating the distribution under neutrality. In the orignal test the variance covariance matrix was obtained from the branch lengths of the Neighbor-joining population tree. To extend this test to admixed populations we proposed three different strategies: using the test without taking into account the admixture events, estimating the variance-covariance matrix using TreeMix and estimating the variance covariance matrix empirically. For the empirical estimation of the variance-covariance, we first estimate the frequency of the ancestral alleles based on the Neighbor-Joining tree and then we estimate the covariances of each pair of populations. Through simulations we show that the best strategy is to use the empirical variance covariance matrix to detect selection. We could not reach the same power as when using the theoretical variance-covariance matrix, showing that the main problem relies on this matrix estimation. As shown previously it was easier to detect hard sweeps than soft sweeps.
P14 Some asymptotic results for the Kingman coalescent

Visit poster
Martina Favero, Henrik Hult
Affiliations: University of Warwick
Topics: Population genetics
Keywords: Coalescent, Parent Dependent Mutations, Weak Convergence, Large Sample Size, Importance Sampling
As the sample size grows to infinity, we study the asymptotic behaviour of some sequences related to the Kingman coalescent, under a general finite-alleles, parent-dependent mutation mechanism. We start by showing that the sampling probabilities under the coalescent decay polynomially in the sample size. The degree of the polynomial depends on the number of types in the model, and its coefficient on the stationary density of the dual Wright-Fisher diffusion. Then we present a weak convergence result for a sequence of Markov chains that are composed of block counting jump chains, counting-mutations components and cost components. Finally, we illustrate how these results may be linked to the analysis of asymptotic properties of backward sampling algorithms, in particular the asymptotic behaviour of importance sampling weights, since these are closely related to the cost chains.
P15 Haplotype-based reconstruction of recent effective population size in modern and ancient DNA samples reveals fine-scale demographic history

Visit poster
Romain Fournier [1], David Reich [2, 3, 4, 5], Pier Palamara [1,6]
Affiliations: [1] Department of Statistics, University of Oxford, Oxford, United Kingdom, [2] Department of Genetics, Harvard Medical School, Harvard, Boston, USA, [3] Broad Institute of Harvard and MIT, Cambridge, USA, [4] Department of Human Evolutionary Biology, Harvard University, Cambridge, USA, [5] Howard Hughes Medical Institute, Harvard Medical School, Boston, USA [6] Wellcome Centre for Human Genetics, University of Oxford, Oxford, United Kingdom
Topics: Population genetics
Keywords: Effective Population Size, Identity By Descent, Linkage Disequilibrium, Ancient DNA
Inference of demographic events of the recent past provides new insights into human evolutionary history. The sharing of Identical-By- Descent (IBD) haplotypes may be used to infer recent effective population size fluctuations, but accurate IBD detection is difficult in ancient DNA (aDNA) data and in underrepresented populations with limited reference data. We developed an efficient non-parametric method for inferring recent effective population sizes, called HapNe. HapNe computes an approximate maximum-a-posteriori estimator based on either IBD sharing (HapNe-IBD), or on linkage disequilibrium (HapNe- LD), which does not require phasing and can be computed in low- coverage aDNA data. We simulated a range of demographic scenarios and compared HapNe with IBDNe, an IBD-based method, and GONE, a recent LD- based method. Despite not relying on phasing information, LD-based methods were almost as accurate as IBD-based methods applied to ground truth IBD data, and outperformed IBD-based methods when IBD sharing was inferred. HapNe-LD tended to produce more accurate results than other approaches in the past 50 generations, while requiring fewer computational resources. HapNe may be applied to individuals with heterogeneous sampling time and is robust to missingness and reasonable levels of admixture. We applied HapNe to several populations from both modern (1000 Genomes) and aDNA data sets, detecting multiple instances of recent effective population size variation across these groups.
P16 Haplotype based testing for a better understanding of the selective architecture

Visit poster
Andreas Futschik [1], Howard Chen [2], Marta Pelizzola [3]
Affiliations: [1] Johannes Kepler University Linz, [2] University of Veterinary Medicine Vienna, [3] Aarhus University, Denmark
Topics: Population genetics, Natural selection
Keywords: Experimental Evolution, Multiple Hypothesis Testing
The identification of genomic regions affected by selection is an important goal in population genetics. If temporal data are available, allele frequency changes at SNP positions are often used. When a large number of SNP positions is tested, a multiple testing correction is needed to avoid false positives due to genetic drift. Here we provide a new testing approach that uses haplotype frequencies instead of allele frequencies. With this approach less multiple testing correction is needed, which leads to tests with higher power, especially when the number of candidate haplotypes is small or moderate. For a larger number of haplotypes, we propose haplotype block based methods. The use of haplotypes also permits for a better understanding of selective signatures. For this purpose we propose post-hoc tests for the number and difference in strength of the selected haplotypes.
P17 Inference of recent admixture history and parental admixture proportions from genotype and low depth sequencing data

Visit poster
Genís Garcia-Erill , Kristian Hanghøj, Rasmus Heller, Anders Albrechtsen
Affiliations: Department of Biology, University of Copenhagen, Ole Maaløes Vej 5, 2200 Copenhagen N, Denmark
Topics: Population genetics
Interbreeding between previously isolated populations or species is a common process when populations come into secondary contact. For admixture that took place within the last few generations, the ancestry of each allele in a diploid locus is contingent on the admixture pedigree Hence, this paired allelic ancestry information can be used to infer the admixture proportions of the parents and to identify the recent admixture pedigree. Here we introduce a maximum likelihood method with two complementary models that infer an individual’s paired ancestry proportions and the admixture proportions of the individual’s parents, respectively, without access to genetic data from the parents. The method works with genotypes or genotype likelihood data and does not rely on information about the chromosomal position. This makes it appropriate for RADseq and low depth sequencing data and does not require a chromosome-level genome assembly. We show how we can infer the admixture pedigrees most likely to produce the inferred paired ancestries, and how this can help time recent admixture events and distinguish them from more ancient ones. We validate the performance of the method using admixed family trios from the 1000 Genomes Project. In addition, we show its applicability on identifying recent hybrids from RAD-seq data of Grant’s gazelle (Nanger granti and N. petersii) and whole genome low depth data of waterbuck (Kobus ellipsiprymnus) which are a mixture of up to 4 populations.
P18 Inferring reproduction number from genomic and epidemiological data using MCMC methods

Visit poster
Alicia Gill
Affiliations: Department of Statistics, University of Warwick
Topics: Population genetics, Phylogenetics, Pathogen genomics
Genomic data is increasingly being used to understand infectious disease epidemiology. Unfortunately, phylogenetic trees used to represent the genomic variation are not easy to relate with epidemiological processes such as the reproduction number R(t). Supposing that the epidemic can be modelled as a birth-death process, we use Markov chain Monte Carlo methods (MCMC) to infer its parameters using a dated phylogeny and various qualities of prevalence data. When we have accurate data of prevalence over time, then we show that the Metropolis-Hasting algorithm performs well to infer the parameters. When there is no prevalence data, we have implemented a pseudo- marginal MCMC to infer the parameters using only the phylogeny. The pseudo-marginal method is also able to incorporate partial or noisy prevalence data, which results in improved inference when compared to only observing a dated phylogeny.
P19 Understanding and Inferring Cultural Transmission of Reproductive Success

Visit poster
Jeremy Guez, [1,2], Jean Cury [2], François Bienvenu [3], Bruno Toupance [1], Evelyne Heyer [1], Frederic Austerlitz [1], Flora Jay [2]
Affiliations: [1] MNHN, CNRS, Université de Paris [2] Université Paris-Saclay, CNRS, INRIA, Laboratoire Interdisciplinaire des Sciences du Numérique, 91400, Orsay, France. [3] Institute for Theoretical Studies, ETHZ
Topics: Population genetics, Deep learning in genomics
Keywords: Demographic Inference, Cultural Modeling, Neural Networks
Cultural Transmission of Reproductive Success (CTRS) designates a type of positive correlation in progeny size between parents and children. This process, detected in numerous human and animal populations, results from various non-genetic factors such as the transmission of social rank or resources correlated with reproductive success. Here, we model and simulate CTRS to study its effect on the evolution of genomic diversity through time. We show that it has a double impact on population genetics: (1) effective population size decreases when CTRS starts, mimicking a population contraction, and increases back to its original value when CTRS stops, yielding an expansion-like effect; (2) coalescent trees (at the core of most population genetics studies) are imbalanced during and shortly after CTRS. Consequently, CTRS impacts site frequency spectra (SFS) and creates a strong bias in SFS-based demographic inference. Correctly inferring CTRS from genomic data is thus crucial, not only for understanding the cultural history of populations but also to reconstruct accurately other processes of broad interest. We illustrate how exchangeable convolutional neural networks trained on simulated genomes enable the joint estimation of CTRS and demographic parameters. Similar to recent findings on the intricacy of demography, selection, and gene bias conversion, our study highlights the importance of integrating multiple processes, including CTRS, when studying evolution from genomic data.
P20 Demographic History Inference and the Polyploid Continuum

Visit poster
Ryan Gutenkunst, Paul D. Blischak [1], Mathews Sajan [1], Michael S. Barker [1], Ryan N. Gutenkunst [1]
Affiliations: [1] University of Arizona
Topics: Population genetics
Keywords: Autopolyploid, Allopolyploid, Dadi
Polyploidy is an important generator of evolutionary novelty, occurring within a single lineage (autopolyploidy) or by hybridization between different lineages (allopolyploidy). Historically, these scenarios have been treated as completely separate cases based on chromosome pairing, with autopolyploids pairing among all homologous chromosomes and allopolyploids pairing only among chromosomes from the same parent lineage. But if they are similar enough, chromosomes from distinct lineages may pair and thus alleles may be exchanged between subgenomes. Quantitatively inferring demographic history and rates of subgenomic exchange within polyploids may shed light on the evolutionary history of key species, including many crops. We thus developed diffusion models for genetic variation in polyploids with variable inheritance patterns and implemented them in the dadi software. We used SLiM simulations to validate our models and test their inference properties, focusing on the case in which reads from two subgenomes cannot by distinguished bioinformatically. We then applied these models to infer demographic history and subgenome exchange for shepherd's purse (Capsella bursa-pastoris), a recently formed allotetraploid, to investigate the timing and impact of hybridization and genome duplication.
P21 sstar: A Python package for detecting archaic admixture from population genetic data with S*

Visit poster
Xin Huang, Martin Kuhlwilm
Affiliations: Department of Evolutionary Anthropology, University of Vienna, Djerassiplatz 1, 1030 Vienna, Austria
Topics: Population genetics
Keywords: Introgression, Archaic Admixture, Population Genetics
S* is a widely used statistic for detecting archaic admixture from population genetic data. To apply S*, previous studies used the freezing-archer. This repository is a compilation of Python 2 and R scripts, originally developed to detect Neanderthal and Denisovan introgression into modern humans. However, users need to customize freezing-archer by themselves when studying other species or populations. Here, we implemented a new Python 3 package named sstar, with better performance than freezing-archer. Reliability was also improved after fixing bugs in freezing-archer. Compared with freezing- archer, sstar is more user-friendly, versatile and well-documented.
P22 dadi-cli: Automated and distributed computational inference of demographic history and distributions of fitness effects from population genetic data

Visit poster
Xin Huang [1,2], Travis J. Struck [2], Sean W. Davey[2], Ryan N. Gutenkunst[2]
Affiliations: [1] University of Vienna [2] University of Arizona
Topics: Population genetics
Keywords: Demographic Inference, Distribution Of Fitness Effects, Population Genetics
dadi is a flexible Python package for inferring demographic history and the distribution of fitness effects (DFE) from population genetic data based on the diffusion theory. It has been widely used in various species. However, using dadi requires knowledge of python, patience to tune parameters of models and large amount of computing resources. Here, we implemented dadi-cli to help users to apply dadi in their research with a robust and user-friendly command line interface. With dadi-cli, users can automatically infer demographic or DFE models, and distributedly execute the dadi inference pipeline in high-performance computing clusters or cloud platforms.
P23 hapCon: Estimating contamination of ancient genomes by copying from reference haplotypes

Visit poster
Yilei Huang , Harald Ringbauer
Affiliations: Max Planck Institute for Evolutionary Anthropology
Topics: Population genetics
Keywords: Ancient DNA, Hidden Markov Model, Population Genetics
Human ancient DNA (aDNA) studies have surged in recent years, revolutionizing the study of the human past. Typically, aDNA is preserved poorly, making such data prone to contamination from other human DNA. Therefore, it is important to rule out substantial contamination before proceeding to downstream analysis. As most aDNA samples can only be sequenced to low coverages (<1x average depth), computational methods that can robustly estimate contamination in the low coverage regime are needed. However, the ultra low-coverage regime (0.1x and below) remains a challenging task for existing approaches. We present an innovative new method, hapCon, to estimate contamination in aDNA for male modern humans. It utilizes a Li&Stephens haplotype copying model for haploid X chromosomes, with mismatches modeled as genotyping error or contamination. We assessed hapCon on simulated and down-sampled empirical aDNA data for samples up to 45,000 years old and from diverse ancestries. Our results demonstrate that hapCon outperforms commonly used tools for estimating contamination, with substantially lower variance and narrower confidence intervals, especially in the low coverage regime.We found that hapCon provides useful contamination estimates for coverages as low as 0.1x for SNP capture data (1240k) and 0.02x for whole genome sequencing data (WGS), substantially extending the coverage limit of previous male X chromosome based contamination estimation methods.
P24 Threading sequences into inferred genealogies with recombination

Visit poster
Anastasia Ignatieva, Simon Myers
Affiliations: University of Oxford
Topics: Population genetics
The problem of large-scale genealogy reconstruction has seen significant recent progress, with the development of several tools for inferring sequences of local trees (or ancestral recombination graphs) compatible with a given input dataset. In the absence of recombination, the classical problem of phylogenetic placement—adding an additional sequence to a fixed reference phylogeny—has received substantial interest over the past several decades, motivated in part by the scalability it offers when dealing with very large trees, and the possibility of adding new data without repeating computation. The equivalent problem in the presence of recombination has only recently began gaining momentum. We present a new approximate but principled method for accurately threading a sequence into an inferred genealogy (with the method being, in some sense, exact when the reference genealogy is the ground truth). We use the timing of the mutations that are shared by the reference genealogy and the new sequence, to construct a state space of possible nodes with which the new sequence can coalesce at each genomic location. Simulation studies with human- like parameters demonstrate that the method scales well as the sample size increases, running in a fraction of the time compared to de novo genealogy reconstruction, while maintaining excellent accuracy.
P25 tskit - The Tree Sequence Toolkit

Visit poster
Ben Jeffery, Tskit developers
Affiliations: Big Data Institute, Oxford University
Topics: Population genetics, Phylogenetics
Keywords: Tree Sequence, Software
The ability to store and analyse related genetic sequences is an essential requirement for simulation, inference, and analysis in both population genetics and phylogenetics. The recent introduction of the succinct tree sequence data structure has provided a way to achieve this at population scale. Here we present the tskit software, a high- performance, portable, open-source, community-driven library. Tskit allows the creation, manipulation and analysis of succinct tree sequences, with first-class support for provenance and user-defined metadata. Tskit enables a common foundation across software projects that use succinct tree sequences, which results in unprecedented interoperability, reproducibility and maintainability.
P26 CellDrift: identifying cellular and temporal patterns of perturbation responses from single-cell data

Visit poster
Kang Jin [1,2] , Daniel Schnell [1,5], Guangyuan Li [1,2], Surya Prasath [1,2,3], Rhonda Szczesniak [4], Bruce J Aronow [1,2,3]
Affiliations: [1] Division of Biomedical Informatics, Cincinnati Children’s Hospital Medical Center [2] Department of Biomedical Informatics, University of Cincinnati [3] Department of Electrical Engineering and Computer Science, University of Cincinnati [4] Division of Biostatistics and Epidemiology, Cincinnati Children's Hospital Medical Center [5] Heart Institute and Center for Translational Fibrosis Research, Cincinnati Children's Hospital Medical Center
Topics: Functional genomics, Single cell -omics, Deep learning in genomics
Keywords: Single Cell Sequencing, Perturbation, Temporal Pattern, Generalized Linear Model, Gaussian Process, Function Data Analysis
Shifted gene programs are the key to understanding perturbation responses in single-cell RNA sequencing experiments. With the increasing complexity of perturbational experiments, generative models, such as scGen and CPA, have been used to interrogate perturbation latent features utilizing the power of deep neural networks. However, a lack of interpretability still prevents biologists from straightforwardly understanding perturbation responses. Here we present CellDrift, a generalized linear model (GLM) that accounts for major covariates, including perturbation groups, cell types, and their interactions in perturbational single-cell data. We utilized Gaussian Process (GP) and functional Principal Component Analysis (fPCA) based on the results of GLM for perturbational studies with time series and identified temporal patterns of gene programs in perturbation responses. To illustrate CellDrift, we applied the model to a COVID-19 and sepsis blood single-cell dataset and identified disease-specific temporal patterns of interferon responses in sepsis and COVID-19 patients of different severity levels.
P27 A scalable framework for robust linear mixed model association testing

Visit poster
Georgios Kalantzis, [1], Ali Pazokitoroudi [2], Han Chen [3,4], Sriram Sankararaman [2,5,6], Pier Palamara [1,7]
Affiliations: [1] Department of Statistics, University of Oxford, Oxford, UK, [2] Department of Computer Science, UCLA, Los Angeles, USA, [3] Human Genetics Center, Department of Epidemiology, Human Genetics and Environmental Sciences, School of Public Health, The University of Texas Health Science Center at Houston, Houston, Texas, USA, [4] Center for Precision Health, School of Biomedical Informatics, The University of Texas Health Science Center at Houston, Houston, Texas, USA, [5] Department of Human Genetics, David Geffen School of Medicine, UCLA, Los Angeles, USA, [6] Department of Computational Medicine, David Geffen School of Medicine, UCLA, Los Angeles, USA, [7] Wellcome Centre for Human Genetics, University of Oxford, Oxford, UK
Topics: Quantitative genetics, Methods for GWAS
Keywords: GWAS, Association, LMM, Variance Components, UK Biobank, Population Stratification
Linear mixed models (LMMs) offer state-of-the-art performance in genome-wide association studies but are computationally demanding when the number of analyzed individuals and traits increases. We developed a framework, MTMA, that remains scalable while preserving statistical power and robustness to confounding. MTMA relies on moment-based multiple variance component estimation and can be used to analyze multiple quantitative traits in parallel. We simulated traits with realistic MAF/LD-dependent architectures, using sets of 50K UK Biobank (UKB) samples of varying levels of relatedness or population structure. We compared MTMA with BOLT-LMM, Regenie, FastGWA, and linear regression (linReg). In samples of homogeneous ancestry MTMA achieved an increase in association power similar to that of BOLT-LMM- Inf compared to linReg (+1.5-4% χ2 on top SNPs). In structured samples MTMA and BOLT-LMM-Inf yielded a better calibration of false positive rates compared to other approaches. MTMA was the most scalable LMM method in analyses of up to 50K individuals and remained scalable in larger data sets (e.g. ~5 days for 20 traits, N=446K), where other LMM approximations were faster. We also analyzed highly structured multi- ancestry UKB samples, where different association strategies led to trade-offs between the number of detected signals and their replication rate in Biobank Japan, highlighting the importance of developing scalable and robust LMM algorithms for heterogeneous biobanks.
P28 Machine learning based approaches for phenotypic prediction in natural populations.

Visit poster
Sakshi Khaiwal , Matteo De Chiara, Jonas Warringer, Dr Agnese Seminara, Dr Marco C. Lagomarsino, Dr Gianni Liti
Affiliations: Université Cote d'Azur
Topics: Population genetics, Quantitative genetics, Methods for GWAS, Deep learning in genomics
Keywords: Machine Learning, Phenotype Prediction, S, Cerevisiae, Complex Phenotypes
Complex phenotypes are controlled by multiple genetic loci, the environment, and their interactions. To understand how the genetic variation of an organism influences its phenotype, we explore methods from machine learning (ML) to obtain models for predicting quantitative phenotypes. In this context, we use the budding yeast, S. cerevisiae, as the model organisms for which we have obtained genomic and phenomic data. We use the information at the genomics, proteomics, and phenomics scale as the input features for the prediction of the phenotypes. At the genomic scale, we have pan-genomic information such as variations in sequences as well as in genome content, such as presence/absence, copy number variation, and loss of function. In addition, we also have information about the gene expression level for 1100 genes that can be considered as intermediate phenotypes. The regression methods used for the prediction are Elastic Net, Ridge regression, Support Vector Machine (SVM), and Gradient boosting Machine (GBM). The models are optimized using a combination of randomized hyper-parameter tuning and cross-validation on a part of the dataset. The ML pipeline was utilized for the prediction of phenotypes in ~200 environmental conditions based on different genetic and phenotypic level information as the features. We observed a large variation in the prediction accuracy of different phenotypes, suggesting that some phenotypes are easier to predict than others.
P29 Genetic relatedness through the lens of tree sequences

Visit poster
Brieuc Lehmann [1], Gregor Gorjanc [2], Jerome Kelleher [3], Peter Ralph [4], Georgia Tsambos [5]
Affiliations: [1] University College London, [2] University of Edinburgh, [3] University of Oxford, [4] University of Oregon, [5] University of Melbourne
Topics: Population genetics, Quantitative genetics
Keywords: Ancestral Recombination Graph, Tree Sequence, Principal Component Analysis, Genetic Relatedness
Genetic relatedness is a central concept in genetics, with this year marking the 100th anniversary of the seminal paper by Sewall Wright on 'Coefficients of inbreeding and relationship'. Genetic relatedness between individuals, or between populations, is of interest in its own right in population and quantitative genetics and their applications to human, animal, or plant settings. It also plays a crucial role in many downstream analyses, such as principal components analysis (PCA) in population genetics and phenotype prediction in quantitative genetics. Despite its importance, however, there is no single, agreed-upon measure of genetic relatedness, with numerous efforts having been made to first characterise the concept and subsequently estimate it from the available data. At its most general, genetic relatedness refers to some notion of similarity between individuals, where similarity can be defined according to pedigree, genotype, or genealogy. Here, we attempt to harmonise different notions of genetic relatedness via the tree sequence encoding of Ancestral Recombination Graphs. A tree sequence represents the relationships between a set of DNA sequences and provides an efficient way of storing genetic variation data. It also provides a natural way to describe genetic relatedness via the ancestral relationships encoded in the tree sequence. We demonstrate this using simple examples, and illustrate how to perform common analyses such as PCA directly on the tree sequence.
P30 Reference-free deconvolution of cell mixtures in spatial transcriptomics data

Visit poster
Qunhua Li, Roopali Singh, Xiang Zhu
Affiliations: Pennsylvania State University
Topics: Single cell -omics
Keywords: Spatial Transcriptomics, Deconvolution, Reference-Free, NMF, Bayesian
Recent advances in spatial transcriptomics technologies have enabled the mapping of cell types to their spatial locations in tissues. Lack of single-cell resolution in spatial transcriptomics (ST) technologies produces gene expression from a mixture of potentially heterogeneous cells at each spot. Here, we present RETROFIT, a novel Bayesian non- negative matrix factorization framework to decompose cell type mixtures in ST data free of external single-cell expression references. RETROFIT outperforms existing reference-based methods in estimating cell type proportions and reconstructing gene expressions in simulations with varying spot size and sample heterogeneity, irrespective of the quality or availability of the single-cell reference. RETROFIT recapitulates known cell-type localization patterns in a Slide-seq dataset of mouse cerebellum without using any single-cell data. RETROFIT identifies biologically relevant spatial and temporal patterns in a Visium dataset of adult and fetal human intestine, and identifies cell-type specific genes associated with each developmental stage, revealing insights into human intestine development.
P31 PCAone: ultra-fast and accurate PCA for biobank scale data

Visit poster
Zilong Li, Jonas Meisner, Anders Albrechtsen
Affiliations: Computational and RNA biology section, Department of biology, University of Copenhagen
Topics: Quantitative genetics, Methods for GWAS, Applications to cancer and other diseases, Single cell -omics
Keywords: Principal Component Analysis, Randomized Singular Value Decomposition (Rsvd), GWAS, UK Biobank, Big Data, Single Cell RNA
Principal Component Analysis (PCA) is a widely used linear dimensionality reduction technique in statistics, machine learning and genetics. In the big data era, algorithms that do not require storing all data in memory are the solutions of tackling RAM issues for big datasets. Randomized Singular Value Decomposition (RSVD) and Implicitly Restarted Arnoldi Method (IRAM) are popular truncated SVD methods that approximate the full SVD. In this work, we propose a new RSVD algorithm that can achieve high accuracy within a few iterations running out-of-core. We implemented the new method and faster blockwise versions of the existing IRAM and RSVD in our software PCAone. The benchmarking results show that PCAone is on average 4x, 9x, 12x, 30x faster than PLINK2 (fastPCA), ProPCA, TeraPCA, FlashPCA2 respectively using little memory. The accuracy of our new RSVD is comparable to the IRAM and higher than other RSVD methods. Using PCAone we are able to analyze the 330 thousands unrelated UK biobank individuals with 4 millions imputed SNPs (no LD pruning) in 5 hours to calculate the first 40 PCs. The PCs capture the population structure, signals of selection (SLC45A2), structural variants and low recombination regions. Moreover, PCAone is a general online PCA framework that supports many different formats and data types. For the single cell RNA data with 1.3 million brain cells, it took 35 minutes for PCAone to run using only 4GB. Available here https://github.com/Zilong-Li/PCAone.
P32 spaceNNtime: a neural network to predict the date and sampling location of ancient genomes

Visit poster
Moisès Coll Macià [1, 2], Graham Gower [2], Martin Peter [2] and Fernando Racimo [2]
Affiliations: [1] Bioinformatics Research Centre (BiRC), Aarhus University, Aarhus, Denmark [2] Lundbeck GeoGenetics Centre, GLOBE Institute, University of Copenhagen, Copenhagen, Denmark
Topics: Population genetics, Deep learning in genomics
Keywords: Spatiotemporal Inference, Artificial Neural Network, Adna, Simulation
Mutations arise in the genetic material transmitted in each generation, creating a unique molecular pattern that will only be inherited by the descendants of that lineage. Since individuals that are geographically close are more likely to mate, genetic kinship is generally correlated with geographical location. Moreover, because mutations accumulate over time, they can be used as temporal checkpoints, i.e. the presence or absence of a mutation in a sample bounds its date of existence. Therefore, the location and time in which an individual lived are encoded in their genome and kept in ancient remains. In this project, we develop spaceNNtime, an artificial neural network to predict both the date and the sampling location of ancient DNA samples. In its first phase, we assess the accuracy and precision of the method by training and testing it on simulated individuals. We then test its performance by introducing confounding factors such as sample size variability, lack of reference populations, increasing the mobility rate per individual or simulating genotype likelihoods (with varying error rates) instead of called genotypes. Our preliminary results show that predicting both space and time in the same model increases accuracy for time but not space. We expect spaceNNtime to perform well on real data, and it opens up the possibility of identifying misclassified museum samples, finding long- distance migrants and offering alternative ways to date ancient samples.
P33 Deep Generative Modeling of Pleiotropy in the Genetic Architecture of Traits

Visit poster
Rahul Mehta, Jeremy Berg
Affiliations: University of Chicago
Topics: Population genetics, Natural selection, Quantitative genetics, Deep learning in genomics
Keywords: Deep Generative Model, Pleiotropy, And Selectoin
Genome wide association studies (GWAS) provide a powerful, yet simple link to describe complex traits by their underlying genetic architecture. Several recent studies have demonstrated that natural selection plays an important role in shaping the distribution the genetic variation uncovered by GWAS. A key challenge that remains is reconciling how selection acts on traits when so many of these traits are pleiotropic. Given that genetic variation is a result of small effects across the genome, pleiotropy can cause spurious signals of selection resulting in biological misinterpretations. Here, we motivate and formalize a Bayesian framework to model the impact of natural selection on the genetic architecture of complex traits. In particular we deconstruct the genetic architecture as a series of distributions that unify linkage disequilibrium regression methods with Poisson random field methods and a model for the effect of pleiotropy on the relationship between effect on trait and selection coefficient, without making an assumptions on the distribution of the selection parameters. This results in a generative model that uses GWAS summary statistics to estimate both the distribution of selection coefficients for loci within a specific trait, as well as the strength of stabilizing selection acting on that trait throughout human evolutionary history.
P34 Coalescent census population under selection: full-likelihood inference, applications and implications

Visit poster
Caetano Souto Maior Mendes
Affiliations: [1] BCAM - Basque Center for Applied Mathematics, Bilbao, Spain, [2] Laboratory of Systems Genetics, National Heart Lung and Blood Institute, National Institutes of Health, Bethesda, MD, United States of America
Topics: Population genetics, Natural selection, Phylogenetics, Pathogen genomics
Keywords: Structured Coalescent, Natural Selection, Census Population Size, SARS-CoV-2
The Coalescent process is the workhorse of population genetics . Full- likelihood inference, though the standard for estimating population genetic (or any) parameters, is often not employed when it is difficult to specify an analytical likelihood function, for instance when complex demographies, structure, or selection are present. Approximation schemes such as Approximate Bayesian Computation (ABC) can be used to circumvent some of those hurdles, and Machine Learning algorithms have recently allowed inference without requiring a subjective choice of data summaries — though none of the approaches can completely replace likelihood-based inference, which remains the gold standard. For that reason, population geneticists continue to use the approach and rely on simplifications and constructs like the effective population size (Ne). Here, I show by extending frameworks that allow arbitrarily complex population models, full- likelihood nonparametric estimation of census population and selection explicitly size is possible. In a nutshell, that is accomplished by extending the Bayesian Skyline Model to include births, and exploiting a structured coalescent to describe variation and selection — i.e. selection is readily described by subgroups with higher birth rates. I also discuss potential applications to sequence-based pandemic surveillance, its caveats and implications.
P35 Robust supervised machine learning for population genetic inference with domain adaptation

Visit poster
Ziyi Mo [1, 2] , Adam Siepel [1]
Affiliations: [1] Simons Center for Quantitative Biology, Cold Spring Harbor Laboratory, Cold Spring Harbor, NY, [2] School of Biological Sciences, Cold Spring Harbor Laboratory, Cold Spring Harbor, NY
Topics: Population genetics, Natural selection, Deep learning in genomics
A series of supervised machine learning methods have recently been proposed to address a range of problems in population genetics. Despite their much-improved performance over traditional statistical methods, model mis-specification remains the Achilles’ heel of this new paradigm. Here, we propose that domain adaptation techniques can be a powerful tool to mitigate the effect of model mis-specification on the performance of these methods. We demonstrated that the Deep Reconstruction-Classification Network (DRCN) helped our previously proposed SIA model achieve a better performance in identifying selective sweeps when the parameters used for simulating training data mismatch those underlying the generation of the data at test time. We anticipate that this approach will be widely applicable and can be an important tool to build confidence in the results when adopting supervised machine learning methods for inference.
P36 A balanced measure shows superior performance of pseudobulk methods over mixed models and pseudoreplication approaches in single-cell RNA-sequencing analysis

Visit poster
Alan Murphy, Nathan Skene
Affiliations: [1] UK Dementia Research Institute at Imperial College London, London W12 0BZ, UK [2] Department of Brain Sciences, Imperial College London, London W12 0BZ, UK
Topics: Single cell -omics
Keywords: Single-Cell RNA-Seq Differential Expression, Pseudobulk, Pseudoreplication, Mixed Models
Multiple approaches have been proposed for the analysis of differential expression of single-cell RNA-sequencing data which can be broadly grouped into pseudoreplication, pseudobulk aggregation and mixed models. Recently, Zimmerman et al. [1] proposed the use of mixed models over pseudobulk approaches, reporting improved performance on a novel simulation approach of hierarchical single-cell expression data. However, their reported results could not prove the superiority of mixed models as they are based on separate calculations of type 1 (performance of the models on non-differentially expressed genes) and type 2 error (performance on differentially expressed genes). To correctly benchmark the models, a reanalysis using a balanced measure of performance, considering both the type 1 and type 2 errors (both the differentially and non-differentially expressed genes), is necessary. Our reanalysis demonstrates that pseudobulk approaches are far from being too conservative and are, in fact, the best performing models based on this simulated dataset for the analysis of single-cell expression data. [1] Zimmerman, K. D., Espeland, M. A. & Langefeld, C. D. A practical solution to pseudoreplication bias in single-cell studies. Nat. Commun. 12, 738 (2021).
P37 GADMA2: New Features for Efficient and Flexible Demographic Inference

Visit poster
Ekaterina Noskova, Vladimir Ulyantsev
Affiliations: ITMO University, 49 Kronverkskiy Pr., St. Petersburg 197101, Russia
Topics: Population genetics
Keywords: Demographic History, Demographic Inference
Inference of complex demographic histories requires parameterized models specified manually by the researcher. With an increased variety of methods and tools, each with its own interface, model specification becomes tedious and error-prone. Moreover, the optimization algorithms used to find optimal parameters sometimes turn out to be inefficient. The open source software GADMA addresses these problems, providing automatic demographic inference. It proposes a common interface for several simulation engines and provides global optimization of parameters based on a genetic algorithm. The initial version of GADMA featured only two simulation engines: ∂a∂i and moments.. Both of these engines have been improved since GADMA’s initial release. For example, ∂a∂i introduced inference of inbreeding coefficients, started to support demographic histories involving four and five populations and enabled GPU support. GADMA has to keep up with these changes, support new engines and further enhance its optimization algorithms. Here, we introduce new features of GADMA2, the second version of GADMA software. The new simulation engines include momi2 and momentsLD. Optimization based on a genetic algorithm was improved by tuning hyperparameters. Moreover, a new flexible interface for specification of demographic history parameters enables inference of selection and inbreeding coefficients. We provide a full overview of GADMA2 enhancements and demonstrate several examples of their usage.
P38 Yeast cell cycle specificity impact on demography inference

Visit poster
Louis Ollivier
Affiliations: LISN - Université Paris-Saclay
Topics: Population genetics, Natural selection
Keywords: Recombination, Sexual Reproduction, Selection, SFS
The yeast, Saccharomyces cerevisiae, serves as an eukaryotic model organism for classical genetics and genomics to understand the principles of the eukaryotic life cycle. The canonical life cycle of S. cerevisiae consists of a regular alternation between haploid (n) and diploid phases (2n), known as a haplodiplophasic cycle.  However, recent studies suggest that outcrossing events are less frequent than previously thought. Thus, this would induce less recombination events in the population, meaning less heterozygosity and then less diversity. Indeed, these events are estimated at frequencies ranging from 1 per 1000 generations to 1 per 50000 generations. In this work, we wanted to see the effect of this specificity of S.cerevisiae cell cycle on the demography inference. For instance, we observe interesting results such as an excess of polymorphic sites at frequency 0.5 for simulations of populations with this cell cycle specificity. 
P39 Impact of Cultural Transmission of Reproductive Success on whole-genome data: simulation study and ABC inference

Visit poster
Ferdinand Petit, [1,2], Jérémy Guez [1,2], Fanny Pouyet [2], Evelyne Heyer [1], Bruno Toupance [1], Flora Jay [2], Frédéric Austerlitz [1]
Affiliations: [1] UMR7206 CNRS/MNHN/Université de Paris Eco-Anthropologie, Musée de l'Homme, 75116 Paris, France [2] Université Paris-Saclay, CNRS, INRIA, Laboratoire Interdisciplinaire des Sciences du Numérique, 91400, Orsay, France.
Topics: Population genetics, Deep learning in genomics
Keywords: Demographic Inference, Approximate Bayesian Computation
Previous demographic and genetic studies demonstrated the occurrence in multiple populations of Cultural Transmission of Reproductive Success (CTRS), a positive correlation between the sibling size of an individual and his/her number of children, linked to sociocultural processes. CTRS affects coalescent tree shapes, making them imbalanced. This property was used previously to infer patrilinear and matrilinear CTRS from Y chromosome and mtDNA polymorphism data, respectively. We aim to extend this methodology to nuclear whole- genome data, for which, unlike uniparental data, coalescent trees vary across the genome. We first infer these trees from genomic data, and compute then their imbalance indices. To this end, we compared through simulations the imbalances indices of the true coalescent trees and of those inferred by two programs (tsinfer and Relate), in simulated populations with various demographic histories and CTRS levels, using the forward-in-time simulation program SLiM. We showed that both packages infer tree imbalance levels along the genome correctly, with better performances for tsinfer. We evaluated then how these inferred tree imbalance indices could be used as summary statistics in approximate Bayesian computation, to infer CTRS from genomic data. Our results indicate that such estimation is possible for medium to high levels of CTRS and that the strength and starting time of CTRS are accurately predicted even in presence of confounding demographic factors.
P40 slendr: a framework for simulating spatio-temporal population genomic data on geographic landscapes

Visit poster
Martin Petr [1], Benjamin C. Haller [2], Peter Ralph [3], Fernando Racimo [1]
Affiliations: [1] Lundbeck Foundation GeoGenetics Centre, GLOBE Institute, University of Copenhagen, Denmark, [2] Department of Computational Biology, Cornell University, Ithaca, NY, USA, [3] Institute of Ecology and Evolution, University of Oregon, Eugene, OR, USA
Topics: Population genetics, Phylogenetics
Keywords: Simulation, Spatio-Temporal Statistics, Tree Sequence, Demographic Inference, Ancient DNA
One of the goals of population genetics is to understand how evolutionary forces shape patterns of genetic variation over time. Because evolution acts across time and space, many evolutionary processes have a spatial component, acting through phenomena such as isolation by distance, local mate choice, or uneven distribution of resources. However, the spatial dimension of evolution is often neglected, partly due to the lack of tools designed for building and evaluating complex spatial population genetic models. To address this issue, we present a framework for simulating spatially-explicit genomic data, implemented in a new R package called slendr (www.slendr.net) using a tailor-made built-in back end written in SLiM, a forward population genetic simulator. This framework allows one to programmatically and visually encode population ranges and temporal dynamics (i.e., population displacements, expansions, and contractions) on real Earth landscapes or abstract, user-defined maps, and to specify population splits and gene-flow events, all using a simple declarative language. Additionally, slendr supports simulating traditional, random-mating non-spatial models using an alternative coalescent msprime back end. Together with an R-idiomatic interface to the tskit tree sequence library, slendr opens up the possibility of efficient, reproducible simulation of spatio-temporal genomic data entirely within the R environment, which we will demonstrate with several practical examples.
P41 Joint estimation of migration rates and effective population sizes from trio coalescence times in inferred tree sequences

Visit poster
Nathaniel S. Pope
Affiliations: Pennsylvania State University
Topics: Population genetics
Keywords: Tree Sequence, Demographic Inference, Composite Likelihood
A tree sequence is a succinct representation of the genealogical relationships among recombinant DNA sequences, that can be directly inferred from phased genetic polymorphisms in a sample of organisms. The marginal genealogies embedded within a tree sequence carry information about the joint demographic history of the populations to which samples belong. However, probabilistic models for entire tree sequences are often intractable under even moderately complicated demographic scenarios. We introduce a composite likelihood method based on the first coalescence times of labelled trios, that can be used to jointly estimate effective population sizes and asymmetric migration rates in a piece-wise constant, multi-population demographic model. The underlying likelihood is calculated via a continuous-time Markov process that tracks the movement of three lineages between populations until the first coalescence event. This results in a nonlinear state-space model with a structure that facilitates cheap calculation of the gradient with respect to the demographic parameters, allowing for fast optimization via quasi-Newton methods. With simulated data, we demonstrate that this model is identifiable so long as migration is low, and that estimates of demographic parameters are asymptotically consistent. The computational complexity of the method is independent of the size of the tree sequence, and the summary statistics used as inputs can be efficiently harvested using the tskit library.
P42 Spatial models in popuation genetics

Visit poster
Peter Ralph, Alison Etheridge [1], Tom Kurtz [2], Ian Letter [1], Gilia Patterson [3], Terence Tsui [1]
Affiliations: [1] Department of Statistics, University of Oxford, [2] Departments of Mathematics and Statistics, University of Wisconsin - Madison [3] Institute of Ecology and Evolution, University of Oregon
Topics: Population genetics, Natural selection
Keywords: Spatial Modeling, Geography
Models of spatially explicit populations with intrinsic population regulation are challenging to set up and work with, but vitally important for understanding ongoing ecological processes. I will present a general class of Bolker-Pacala models in which fecundity, mortality, and/or establishment rates depend on local population densities, discuss considerations for stability, and describe how lineages move in these models. This is an important step in understanding how to use genomics to inform inferences of ongoing spatial processes. Simulations are done with SLiM and lienages are recorded using the tree sequence.
P43 Prediction of evolutionary constraint by genomic annotations improves prioritization of causal variants in maize

Visit poster
Guillaume P. Ramstein [1], Edward S. Buckler [2, 3]
Affiliations: [1] Center for Quantitative Genetics and Genomics at Aarhus University, [2] Institute for Genomic Diversity at Cornell University, [3] USDA-ARS
Topics: Quantitative genetics, Causal inference in genetic studies, Functional genomics, Deep learning in genomics
Keywords: Comparative Genomics, Machine Learning, Genomic Prediction, Zea Mays
Traditional quantitative genetics analyses have successfully identified genomic regions responsible for genetic variability, but they have generally lacked the resolution to detect the exact genomic sites causing differences among individuals. To detect causal polymorphisms at single-site resolution, plant geneticists can rely on nucleotide conservation, as a proxy for fitness effect of mutations. However, its usefulness can be limited, due to missing data and low depth of multiple-sequence alignments. In this study, we used genomic annotations to accurately predict nucleotide conservation across Angiosperms. Using only sequence analysis, we annotated non-synonymous mutations in about 25,824 maize gene models, with information from bioinformatics (SIFT scores, GC content, transposon insertion, k-mer frequency) and deep learning (protein representations by UniRep). Predictions of nucleotide conservation by our genomic annotations were validated by experimental information at polymorphisms: within-species conservation, chromatin accessibility, gene expression and Gene Ontology enrichment. Importantly, predicted nucleotide conservation also improved genomic prediction for grain yield (+5% and +38% prediction accuracy within and across maize panels, respectively). Therefore, our approach – Prediction of mutation Impact by Calibrated Nucleotide Conservation (PICNC) – may effectively prioritize the single-site polymorphisms most likely to impact important agronomic traits in maize.
P44 Improved Estimation of the Site Frequency Spectrum from Low-Depth Sequencing Data

Visit poster
Malthe Sebro Rasmussen [1], Genís Garcia-Erill [1], Thorfinn Sand Kornelliussen [2], Carsten Wiuf [3], Anders Albrechtsen [1]
Affiliations: [1] Department of Biology, University of Copenhagen, [2] Lundbeck GeoGenetics Centre, GLOBE Institute, University of Copenhagen, [3] Department of Mathematical Sciences, University of Copenhagen
Topics: Population genetics
Keywords: Site Frequency Spectrum, Optimisation, Expectation-Maximisation, NGS
The site frequency spectrum ("SFS") is the joint distribution of allele frequencies from one or more populations. Several important summary statistics are functions of the SFS, which is also used for inference on selection and demography. It is straightforward to estimate the SFS directly from called genotypes, but this is known to bias the SFS when working from low-depth next- generation sequencing ("NGS") data. To address this problem, the SFS may be estimated from genotype likelihoods using an expectation-maximization ("EM") algorithm, and this has become the standard approach in studies based on low-depth NGS data. The standard EM approach requires holding all data in RAM and iterating over it dozens of times; this places significant demands on computational resources as sample size and number of sites increase, to the point where it is often impossible to run without significantly downsampling the input data. In addition, the standard SFS EM is prone to overfitting in the multidimensional setting. To address these issues, we have developed a new, stochastic "window EM" algorithm to estimate the SFS. Briefly, the algorithm optimizes smaller blocks of data and averages these over larger windows. Using this method, we show that we are able to effectively estimate the SFS much faster and with constant, trivial RAM usage. At the same time, the window-based method induces regularization to greatly reduce overfitting, and we demonstrate that this improves downstream inference.
P45 An improved statistical test for conserved RNA helices

Visit poster
Elena Rivas
Affiliations: Harvard Unversity
Topics: Functional genomics
Keywords: Evolutionary Conservation Of RNA Structure, Functiona RNAs
Many functional RNAs depend on 3D structures. To distinguish RNA sequences with biologically relevant structures from those that do not, we use evolutionary conservation of RNA structure, which induces pairwise co- variation in RNA alignments. However, phylogeny also induce covariation, even in the absence of a conserved structure. Our method R-scape, simulates alignments to obtain the probability distribution of covariation due to phylogeny alone. R-scape infers significance considering base pairs as independent units, but RNA base pairs clus- ter together forming 3D helices. To increase sensitivity, we present a method that aggregates base-pair significance to infer significance for RNA helices. We show that aggregated helix statistics provide a more robust signal, while still showing lack of evidence for conserved structure in some important non-coding RNAs such as HOTAIR and XIST.
P46 TxEWAS: a novel approach to remove bias and improve power in gene-environment interaction studies

Visit poster
Michal Sadowski , Mike Thompson, Richard Border, Sriram Sankararaman, Noah Zaitlen
Affiliations: University of California Los Angeles
Topics: Methods for GWAS, Functional genomics
Keywords: Gene-Environment Interactions, GxE, TWAS, DGLM, Pharmacogenetics
Though gene-environment interactions (GxE) are believed to be common across a wide range of phenotypes, few statistical interactions of this type have been identified by genome-wide association studies. This is because interaction effects at individual loci are typically small, and statistical power to detect them is further reduced by the multiple testing burden. Moreover, when identified, associations at individual SNPs are often difficult to interpret functionally. To improve these aspects of GxE association studies, we extended the TWAS framework to test for GxE effects in an imputed expression-trait association model—TxEWAS. TxEWAS first performs a GxE test using DGLM (logistic regression) for continuous (binary) traits, then employs a hierarchical FDR correction to call significant associations across multiple tissues. Importantly, TxEWAS accounts for the heterogeneity of phenotype variance - a thus far underappreciated source of bias in conventional GxE methods. Realistic simulations show that failing to account for this bias can cause a threefold increase in the false positive rate. We applied TxEWAS to 8 drug exposure-response pairs in the UK Biobank, where responses were measures of targets or side effects of the tested drugs, and identified 192 significant interacting genes. Together, these results demonstrate the power of the TxEWAS approach, and in particular, its potential to pinpoint genes that interact with drug treatments for more focused experimental testing
P47 Benchmarking X chromosome identity-by-descent detection compared to the autosomes

Visit poster
Jens Sannerud , Amy L. Williams
Affiliations: Cornell University
Topics: Population genetics
Keywords: Identity By Descent, X Chromsome
The natural connection between the X chromosome and sample sex makes X chromosome genotypes a powerful resource for sex-informed pedigree inference and demographic analyses. However, the potential issues of relatively low heterozygosity and possibly lower quality for X chromosome data threaten its effective use. Such issues may differentially affect the identification of identity-by-descent (IBD) segments on the X chromosome as compared to the autosomes, hampering interpretation of X chromosome IBD-based analyses. To quantify the quality of X chromosome IBD calling, we simulated 35,000 pairs of half-first cousins using X chromosome haplotypes from UK Biobank individuals as founders. We then performed IBD calling in these pairs using a battery of eight popular IBD detection tools. In parallel, we simulated the same type of data for chromosome 10 and evaluated the quality of the X chromosome IBD calls compared to those of chromosome 10. Our preliminary results suggest that the sensitivity and positive predictive values (PPVs) across all methods were lower for the X chromosome segments than for chromosome 10. Our early results suggest that the X chromosome be treated with special consideration, and not merely as an extension of the autosomes, for the purposes of IBD calling. In accounting for these differences, we intend to improve the utility of the X for pedigree studies in the future.
P48 Inference of natural selection and allele age from allele frequency time series data using exact simulation techniques.

Visit poster
Jaromir Sant [1], Paul A. Jenkins [1,2,3], Jere Koskela [1], Dario Spanò [1]
Affiliations: [1] Department of Statistics, University of Warwick, [2] Department of Computer Science, University of Warwick, [3] The Alan Turing Institute
Topics: Population genetics, Natural selection
Keywords: Wright-Fisher Diffusion, MCMC, Bayesian, Exact Inference, Selection
A standard problem in population genetics is to infer evolutionary and biological parameters such as the effective population size, allele age, and strength of natural selection from DNA samples extracted from a contemporary population. That all samples come only from the present-day has long been known to limit statistical inference; there is potentially more information available if one also has access to ancient DNA so that inference is based on a time-series of historical changes in allele frequencies. In this talk I will introduce a Markov Chain Monte Carlo method for Bayesian inference from allele frequency time-series data based on an underlying Wright-Fisher diffusion model of evolution. The chief novelty is that we show this method to be exact in the sense that it is possible to enable mixing by augmenting the state space with the unobserved diffusion trajectory, despite the fact that the transition function of the diffusion is intractable. We develop an efficient method in which trajectory updates and accept/reject probabilities can be calculated without error, and illustrate its performance on simulated data. This is joint work with Paul Jenkins, Jere Koskela, and Dario Spano (University of Warwick).
P49 Fast genome-wide inference of pairwise coalescence times

Visit poster
Regev Schweiger, Richard Durbin
Affiliations: Department of Genetics, University of Cambridge
Topics: Population genetics, Phylogenetics
Keywords: PSMC, Coalescent, Ancestral Recombination Graph
The pairwise sequentially Markovian coalescent (PSMC) algorithm and its extensions infer the coalescence time of two homologous chromosomes at each position along the genome. This inference is utilized in reconstructing demographic histories, detecting selection signatures, genome-wide association studies, genotype imputation and more. Inference of coalescence times between each pair of haplotypes in a large dataset is of great interest, as they may provide rich information about the population structure and history of the sample. Moreover, they may further be able to be used to construct an ancestral recombination graph - a sequence of local genealogies, documenting the genealogical history of each locus across the sample. To facilitate such large-scale analyses, we introduce a new method, Gamma-SMC, which has the potential to be at least an order of magnitude faster. To obtain this speed up, we represent the posterior coalescence time distributions succinctly as a Gamma distribution with just two parameters; while in PSMC and its extensions, these are held as a vector over discrete intervals of time. Thus, Gamma-SMC has constant time complexity per site, without dependence on a number of discrete time states. Additionally, due to this continuous representation, our method can infer times spanning several orders of magnitude, and as such is robust to parameter misspecification. We describe how this approach works and illustrate its performance on simulated and real data.
P50 A Genomics England reference panel of 156,390 haplotypes and the improved imputation of UK Biobank

Visit poster
Sinan Shi [1], Simone Rubinacci [2], Loukas Moutsianas [3, 4], Alex Stuckey [3], Mark Caulfield [3, 4], Simon Myers [1], Jonathan Marchini [5]
Affiliations: [1] University of Oxford, [2] University of Lausanne, [3] Genomics England, [4] Queen Mary University of London, [5] Regeneron Genetics Center
Topics: Methods for GWAS, Assembly and variant identification
Keywords: Imputation, Rare Variants Analysis, GWAS
The Genomics England (GEL) 100,000 genome project has sequenced over 85,000 genomes across England. We generated a GEL haplotype reference panel, comprising 342 million autosomal variants and 156,390 haplotypes from 79,195 individuals of diverse ancestries. We exploit both the sample size and relatedness structure among individuals, 61.3% of whom possess at least one sequenced first-degree relative, to allow high-precision haplotypic phasing. We observe improvements in imputation performance in some populations. In samples of British origin mean imputation r^2 at allele frequency of 1e-4 is 0.45, 0.67 and 0.74 when using the HRC, TOPMed and GEL reference panel. In samples of South Asian origin mean imputation r^2 at allele frequency of 1e-4 is 0.04, 0.24 and 0.61 when using the HRC, TOPMed and GEL. We then used the reference panel to impute the UK Biobank dataset, that was previously imputed at 39 million variants, using an HRC+UK10K reference panel. Mean info scores across SNPs in both the GEL and HRC- UK10K reference panels were 0.65 and 0.61 respectively. In the lower allele frequency bins the differences are larger. For example, for SNPs with allele frequency between 0.01% to 0.1% mean info scores were 0.88 and 0.66 for the GEL and HRC-UK10K reference panels respectively. The GEL-imputed UK Biobank dataset is being made available to all approved researchers of the UK Biobank. We will also report results of imputed GWAS experiments on height, BMI and blood traits.
P51 Demographic inference for temporally structured samples using the coalescent with recombination

Visit poster
Casper Siu, Aylwyn Scally
Affiliations: University of Cambridge
Topics: Population genetics
Learning demographic history from present-day and ancient molecular sequence data simultaneously may require modelling the temporal separation of samples. We use the ancestral recombination graph at two loci along a pair of sequences to analyse the effects of temporal separation on inference frameworks that exploit patterns of haplotype variation: (1) We develop a recursive backwards-in-time method to compute a range of two-locus statistics under time-stratification and a piecewise-constant population size history. We find that previously inferred population size trajectories alone fail to predict these two- locus statistics in time-stratified European whole-genome sequences. (2) We consider the problem of inferring population size history from pairwise coalescence time data along a present-day and an ancient sequence. In principle this data is also informative of population size in the time period more recent than the age of the ancient sample. By simulation we estimate inferential limits for different recombination rate and ancient sample age parameter regimes.
P53 bacRelate: Powerful inference of bacterial evolution by building genealogies

Visit poster
Leo Speidel [1,2], Hongjin Wu [3], Chao Yang [3], Daniel Falush [3]
Affiliations: [1] Genetics Institute, University College London, UK [2] Francis Crick Institute, UK [3] Institute Pasteur of Shanghai, Chinese Academy of Science, People’s Republic of China
Topics: Population genetics, Phylogenetics, Pathogen genomics
Keywords: Genealogies, Bacteria
Bacteria are our constant companions and our interactions with them determine whether we maintain good health or succumb to disease. The basis of all genomic analysis is a description of how organisms are related to each other. Bacteria evolve by mutation and incorporation of DNA from other lineages into their genome. Each change has the potential to modify its phenotypic properties, as well as providing a marker that can be used to reconstruct how it and the DNA inside it spread from host to host. We have developed a new approach for inferring the genealogical relationships of bacteria from their observed genetic variation by adapting Relate, which was originally designed to infer genome-wide genealogies in humans at scale. In our new approach, we account for the differences in how genetic material is transmitted in humans and bacteria by jointly inferring genome-wide clonal relationships and local genealogical trees incorporating non- clonal inheritance through recombination. This allows us to detect shorter recombination tracts common in bacteria with higher precision and will allow us to accurately map patterns of gene-flow in different parts of bacterial genomes and infer selective pressures using tens of thousands of sequenced strains.
P54 Genome-wide classification of epigenetic signal reveals regions of enriched heritability in complex immune traits

Visit poster
Miriam Stricker ,1,+ Calliope Dendrou,4 Weijiao Zhang4, Wei-Yi Cheng,2 Satu Nahkuri,3,+,* Pier Francesco Palamara1,4,+,*
Affiliations: 1Statistics Department, University of Oxford, 24-29 St Giles’, Oxford, OX1 3LB, United Kingdom 2Roche Pharma Research & Early Development Informatics, Roche Innovation Center New York, 7th Floor, 150 Clove Road, Little Falls, New Jersey 07424, USA 3Roche Pharma Research & Early Development Informatics, Roche Innovation Center Zürich, Roche Glycart AG, Wagistrasse 10, 8952 Schlieren, Switzerland 4 Wellcome Centre for Human Genetics, University of Oxford, Roosevelt Drive, Oxford, OX37BN, United Kingdom + Contact: miriam.stricker@wolfson.ox.ac.uk, satu.nahkuri@roche.com, palamara@stats.ox.ac.uk * Jointly supervised this work
Topics: Applications to cancer and other diseases, Functional genomics
Keywords: Human Immune System, Epigenetics
Epigenetics is known to play a key role in the regulation of adaptive and innate immune-system (AIIS) relevant genes that dictate the therapeutic course for many diseases, but the link between high- dimensional epigenetic data and AIIS traits has not been fully explored. We built a classifier (ConvNet), called EpiNN, that leverages genome-wide epigenetic data to detect AIIS-relevant regions. We trained EpiNN using ChromHMM data for a handcrafted list of 477 AIIS genes (AUPRC=0.94) and scanned the human genome for new AIIS regions to build a genome-wide annotation of predicted AIIS relevance. Regions with predicted relevance >0.5 (N=1964) were enriched for immune-related function (GO PANTHER p=7e-21). We used LD-score regression coupled with genome-wide association statistics for 176 traits (avg N=262k) and detected significant heritability enrichments (|τ*| p<0.05/176) for 20 out of 26 immune-related traits. These effects remained significant after conditioning on 97 functional and evolutionary annotations, as well as Roadmap and ChromHMM annotations used to train EpiNN. In a meta-analysis of 64 independent traits, immune-related traits and diseases were 3.45x more enriched for heritability (p=9e-92) than non-immune system traits, highlighting the specificity of AIIS-relevant regions detected using EpiNN. These results underscore the promise of leveraging supervised learning algorithms and epigenetic data to detect regions implicated in specific classes of heritable traits.
P55 Inferring demographic history from allele frequency spectra with multi-layer perceptron regressors

Visit poster
Linh N. Tran[1,2], Connie K. Sun[2], Mathews Sajan[2], Ryan N. Gutenkunst[2]
Affiliations: [1] Genetics Graduate Interdisciplinary Program, [2] Department of Molecular and Cellular Biology, University of Arizona
Topics: Population genetics, Deep learning in genomics
Keywords: Demographic Inference, Machine Learning
Previously, our group had developed dadi, a software for inferring demographic history. Using dadi requires considerable understanding of the software and can be computationally expensive. In this work, we aim to improve the ease of use and lower the computational burden for users with supervised machine learning. For each dadi-supported demographic model, we simulate the expected allele frequency spectrum (AFS) under different demographic parameter values with dadi and train the scikit-learn Multi-layer Perceptron Regressor (MLPR) algorithm to infer these parameters from input AFS. We demonstrate that the trained MLPRs can infer the population-size-change parameters very well (ρ≈0.98) and other parameters such as migration rate and time of demographic event fairly well (ρ≈0.6-0.7). The trained MLPRs also perform well when tested on AFS generated by the msprime simulator, which includes linkage. Importantly, our trained MLPRs provide parameter predictions instantaneously from input AFS, with accuracy comparable to those inferred by dadi’s likelihood optimization while bypassing its computationally intensive evaluation process. We also implement an accompanying method for quantifying the uncertainty of the point estimates output by the trained regressors, using a scikit- learn-compatible package, MAPIE (Model Agnostic Prediction Interval Estimator). We show that this method provides better coverage for all demographic parameters tested compared to traditional bootstrapping.
P56 Inference of pairwise coalescence times and allele ages using deep neural networks

Visit poster
Zoi Tsangalidou, Juba Nait Saada [1], Zoi Tsangalidou [1], Miriam Stricker [1], Pier Francesco Palamara [1,2]
Affiliations: [1] Department of Statistics, University of Oxford, Oxford, UK, [2] Wellcome Centre for Human Genetics, University of Oxford, Oxford, UK
Topics: Population genetics, Deep learning in genomics
Keywords: Deep Learning, Interpretability, Genomic Variant Dating, Heritability
Accurate inference of time to the most recent common ancestor (TMRCA) between pairs of individuals and of the age of genomic variants is key in several population genetic analyses, but requires complex analytical modeling. We developed CoalNN, a likelihood-free approach which uses deep neural networks to predict pairwise TMRCAs, recombination breakpoints, and allele age from sequencing, SNP array, or imputed genotype data. CoalNN is trained through simulation, can be adapted to varying parameters such as demographic history, recombination rate maps, genotyping or phasing error rates, and scales to large data sets. CoalNN improved over other available methods for TMRCA and allele age prediction in several simulated scenarios. We used saliency maps and input perturbations to demonstrate the interpretability of its predictions, and transfer learning to efficiently train it on multiple demographic models. We inferred allele ages in all 26 populations from the 1000 Genomes Project data and observed high correlation with published estimates (ARGweaver: r=0.66, Relate: r=0.81). CoalNN MAF-adjusted allele age estimates were also correlated with other evolutionary annotations (e.g. ASMC-avg: r=0.30, LLD-AFR: r=0.31) and were predictive of enriched heritability across multiple complex traits in an LD-score regression analysis. These results underscore the efficacy of using deep neural networks to analyze genealogical relationships in large genomic data sets.
P57 Evolutionary genomics of the Motacilla alba (white wagtails) radiation

Visit poster
Per Unneberg, [1], Erik Enbody [2], Tom van der Valk [1], Leif Andersson [1], Per Alström [1]
Affiliations: [1] Uppsala University, [2] UC Santa Cruz
Topics: Population genetics, Natural selection, Assembly and variant identification, Phylogenetics
Keywords: Tree Sequence
The Motacilla wagtails are a classic example of an adaptive radiation that has resulted in an enormous diversity in male ornamented plumage in a short time period. The White Wagtail, Motacilla alba, is a widespread species characterized each by as many as eight phenotypic subspecies across the Northern Hemisphere. Despite this high phenotypic variation, little genetic differentiation is seen between subspecies when sampling thousands of genetic loci, which has made the evolutionary history of Motacilla wagtails difficult to resolve. However, these characteristics provide a potentially powerful model system to describe the molecular basis of sexual selection in a widespread adaptive radiation. Recently, tree sequences have been introduced as a means to store the genetic data for entire populations. In particular, since the data represents the genealogies for all loci in the genome, it can be used to address questions related to population structure and history, as implemented in the method tsinfer. As these tree structures capture haplotype genealogies, they are particularly well suited for investigating recent evolutionary events. Here, we apply tsinfer to the Motacilla wagtails to investigate ancestry and population history. In addition, we attempt to shed light on some inconsistencies in the phylogeny that traditional population genetic approaches fail to resolve, with a particular focus on recent evolutionary history.
P58 Understanding Human Phylogeography Through Reconstructed Genealogies

Visit poster
Andrew Vaughn , Rasmus Nielsen
Affiliations: UC Berkeley
Topics: Population genetics, Phylogenetics
Keywords: Ancestral Recombination Graph, Ancient Human History, Phylogeography
To understand human prehistory, scientists have utilized many diverse disciplines, including linguistics and archaeology. However, with the advent of whole-genome sequencing, and the ever-increasing number of high-quality genomes available, genetic analysis has emerged as a new way to shed light on human history in a way never before possible. This is because hidden in every person’s genome is evidence of their genealogical relationships to other individuals. With the development of methods to infer ancestral recombination graphs, as well as the sequencing of more and more ancient genomes, the information potential of genetic data has been greatly increased. We here use techniques from phylogenetics to model geographic location as a trait evolving along a phylogeny, estimate transition rates across a network of geographic locations, and estimate the locations of an individual’s ancestors. This approach can help answer many questions about human history, such as the timing of different admixture events and the geographic history of selected variants. A benefit of this approach is that we do not rely on a priori models of admixture or the classification of samples into different historic groups. Overall, this presents a very flexible framework for understanding human history that leverages the geographic and temporal diversity of sequences at our disposal.
P59 Sparse Bayesian unbounded linear models with unknown design over finite alphabets

Visit poster
Yuexuan Wang , Andreas Futschik, Ritabrata Dutta
Affiliations: [1] Department of Applied Statistics Johannes Kepler University Linz, Austria [2] Department of StatisticsUniversity of Warwick Coventry, UK. Yuexuan Wang, Andreas Futschik [1] Ritabrata Dutta [2]
Topics: Population genetics
Keywords: Haplotype Reconstruction, Bayesian, Unknown Alphabet
In population genetics, the haplotype structure can provide crucial information. However, for many sequencing methods like pool sequencing, we only have allele frequency data at hand rather than haplotype information. A new method to reconstruct the unknown haplotype structure $S$ and haplotype frequencies $\omega$ from the observed allele frequency data matrix $Y = S\omega+\varepsilon$ has been proposed in \cite{pelizzola2021multiple}. There $Y\in [0,1]^{N\times T}$ contains relative allele frequencies for $N$ SNPs from $T$ samples. Since this approach leads only to point estimates, we provide a Bayesian approach to this problem. More specifically, we propose a hierarchical Bayesian model with carefully calibrated hyperparameters and hyper-priors that also gives us credible intervals. In our case, the joint estimation is not unique if we do not have any constraint for the reconstruction. To achieve the identifiability condition in Bayesian inference, we introduce a shrinkage prior. And for the situation where the number of haplotypes is unknown, we perform model selection within our Bayesian framework to help us choose the number of haplotypes adaptively.
P60 Timesweeper: Detecting positive selection using population genomic time series

Visit poster
Logan Whitehouse, Dan Schrider
Affiliations: UNC Chapel Hill
Topics: Population genetics, Natural selection, Deep learning in genomics
Keywords: Deep Learning, Positive Selection
Despite decades of research, identifying selective sweeps, the genomic footprints of positive selection, remains a core problem in population genetics. Of the many methods that tackle this problem, few take advantage of the potential of genomic time-series data. This is because in most population genetic analyses only a single time point can be sampled. Recent advancements in sequencing technology, including improvements in extracting and sequencing ancient DNA, have made repeated samplings possible. With these advances in mind, we present Timesweeper, a fast convolutional neural network (CNN)-based tool for identifying sweeps in population genomic time series. Timesweeper works by first simulating time-series data under a user- specified demographic model, training a 1D CNN on said model, and inferring which variants in a real dataset were the direct target of a completed or ongoing sweep. We show that Timesweeper is powerful under multiple demographic and sampling scenarios, works with low-coverage ancient DNA, finds selected variants with high resolution, and has high tolerance for model misspecification. In sum, we show that more accurate inferences of sweeps are possible with genomic time-series, as will increasingly be feasible in coming years due to both the sequencing of ancient samples and repeated samplings of populations with short generation times. Methods like Timesweeper will thus help to resolve the controversy over the role of positive selection in the genome.
P61 Joint likelihood-free inference of number of selected snps and selection coefficient in an evolving population

Visit poster
Yuehao Xu , Andreas Futschik, Ritabrata Dutta
Affiliations: [1] Department of Applied Statistics Johannes Kepler University Linz, Austria [2] Department of StatisticsUniversity of Warwick Coventry, UK. Yuehao Xu, Andreas Futschik [1] Ritabrata Dutta [2]
Topics: Population genetics
Keywords: Approximate Bayesian Computation, Selection, Population Genetics
The developments of software tools and methodologies in likelihood- free inference have brought the great potential to the inferential study in the field of population genetics. Approximate Bayesian Computation (ABC), is one of the most popular likelihood-free methods used to infer the parameters when the likelihood function is intractable. A common interest in the field of population genetics is the selection coefficients inference. Some approaches look at the data of the current state, but temporal data will capture more information about the evolutionary forces. To our best knowledge, there has been no detailed investigation of methodology that infers the number of selected SNPs from temporal data. In this work, we present an application of Simulated Annealing ABC (SABC) in inferring the number of selected SNPs and the corresponding selection coefficients from temporal allele frequencies data with the summary statistics providing information for the parameters we are interested in. Also, the discrepancy is measured by adaptive $\ell_1$ penalized logistics classification. We show that our method can accurately estimate the value of the selection coefficient and the number of selected targets across different population parameters and also provides uncertainty quantification. Comparison with numbers of existing methods like WFABC \citep{foll2015wfabc} and CLEAR\citep{iranmehr2017clear} is presented.
P62 Improved Polygenic Risk Estimation with Deep Functional Priors

Visit poster
Shadi Zabad, [1], Doruk Cakmakci [1], Simon Gravel [2], Yue Li [1]
Affiliations: [1] School of Computer Science, McGill University, [2] Department of Human Genetics, McGill University
Topics: Quantitative genetics, Methods for GWAS, Deep learning in genomics
Keywords: PRS, Polygenic Risk Scores, Bayesian, Deep Neural Network, Functional Annotations
With recent advances in statistical modeling and increasing availability of large-scale biobank data, Polygenic risk scores (PRSs) have the potential to become a valuable diagnostic tool for a variety of diseases, providing clinically actionable insights and enabling personalized clinical interventions. Despite their promise, with the exception of a few well-studied traits, polygenic scores have not seen wide-scale adoption in clinical practice because the accuracy and portability of existing PRS methods remain limited. Previous work has shown that Bayesian methods for PRS estimation achieve state-of-the- art predictive performance on traits with varying genetic architectures. In this work, we will discuss a new framework for scalable Bayesian polygenic risk modeling called VIPRS (Zabad et al. 2022, in preparation). VIPRS encompasses a set of Bayesian methods for PRS estimation that use Variational Inference for approximate posterior inference. Our analyses of simulated and real phenotypes in the UK Biobank demonstrate that VIPRS achieves competitive predictive performance, outperforming existing Bayesian methods in many settings. Furthermore, the efficiency of the variational approach allows us to scale VIPRS to 10 million SNPs, a major milestone for Bayesian PRS methods. Finally, we will present early results demonstrating that augmenting the baseline VIPRS model with functional and deep functional priors significantly boosts its predictive performance for many phenotypes.
P63 Rapid genotype imputation of biobank-scale whole genome sequence data using tree sequences

Visit poster
Shing Hei Zhan , Yan Wong, Benjamin Jeffery, Jerome Kelleher
Affiliations: Big Data Institute, Li Ka Shing Centre for Health Information and Discovery, University of Oxford, Oxford, UK
Topics: Population genetics
Keywords: Genotype, Imputation, Succinct Tree Sequence, Simulation, Tskit
In whole genome sequence (WGS) datasets, thousands of genotypes often cannot be confidently determined. This problem impairs our ability to find unknown links between rare genotypes and health-related traits, but it can be mitigated by imputation methods that leverage the information in high-quality reference WGS data. Standard methods require approximate solutions to the Li and Stephens model to handle large reference panels consisting of hundreds of thousands of whole genomes. Succinct tree sequence (STS) data structure — a highly compressed representation of the genetic variation and genealogies of a population of individuals — may provide a scalable, accurate approach to biobank-scale imputation. A major advantage of STS is its exact approach to imputation, while the standard methods take an approximate approach. In principle, STS enables highly accurate imputation across genotype categories, even the ultra-rare ones that have posed a persistent challenge to the standard methods. Here, we present tsimpute, which is a STS-based imputation method part of the tskit software ecosystem, and conduct a side-by-side comparison of tsimpute and standard methods. We utilize genotype data simulated under an Out-of-Africa human demographic model to assess the imputation accuracy of tsimpute and standard methods. Alongside tskit, tsimpute holds the potential to boost analysis of biobank-scale WGS data, thereby expediting biomedical research for many years to come.
P64 Recovering signatures of ghost admixture using ancestral recombination graph

Visit poster
Yulin Zhang, Arjun Biddanda, Colm O’Dushlaine, Priya Moorjani
Affiliations: [1] Center for Computational Biology, UC Berkeley, CA [2] 54gene Inc.,1100 15th St NW, Washington, DC 20005, United States [3] Department of Molecular and Cell Biology, UC Berkeley, CA Yulin Zhang [1], Arjun Biddanda [2], Colm O’Dushlaine [2], Priya Moorjani [1,3]
Topics: Population genetics
Keywords: Ancestral Recombination Graph, Ghost Admixture
While gene flow from archaic hominins into non-Africans has been well characterized, their contribution to Africans remains less understood. Emerging evidence suggests most African groups derive ~2–20% of ancestry from an unknown “ghost” archaic. The underlying model of "ghost" admixture remains unclear, as current methods rely on archaic genomes or an outgroup while neither of which is available. With the availability of new methods to reconstruct ancestral recombination graphs (ARG), it is now possible to reconstruct the entire evolutionary history of a set of sequences. Here we introduce a new approach to characterize archaic ancestry using the ARG built upon modern human genomes alone. Our method leverages two key features: “long branches” -- deep lineages in the coalescent tree and “long haplotypes” -- characteristic of recent gene flow. By conducting simulations under different demographic scenarios, we show the precision and recall of our method using true ARGs as well as ARGs inferred using Relate and ARGweaver. Our method outperforming other available methods to detect archaic admixture with true ARGs. However, simulations show that inferred ARGs have low sensitivity: only a small fraction of archaic segments can be reliably recovered. These results highlight the power and limitation of current ARG-based methods to recover archaic ancestry in modern human but provide a positive outlook for detecting archaic admixture as ARG inference methods improve.
P65 distAngsd: Fast and accurate inference of genetic distances for Next Generation Sequencing data

Visit poster
Lei Zhao, [1], Rasmus Nielsen [2], Thorfinn Sand Korneliussen [1]
Affiliations: [1] Section for Geogenetics, Globe Institute, University of Copenhagen, [2] Departments of Integrative Biology and Statistics, UC Berkeley
Topics: Population genetics, Phylogenetics
Keywords: Phylogeny Reconstruction, Genotype Likelihood, Genetic Distance, High Throughput Sequencing, Next Generation Sequencing, Molecular Evolution
Commonly used methods for inferring phylogenies were designed before the emergence of high throughput sequencing and can generally not accommodate the challenges associated with noisy, diploid sequencing data. In many applications, diploid genomes are still treated as haploid through the use of ambiguity characters; while the uncertainty in genotype calling - arising as a consequence of the sequencing technology - is ignored. In order to address this problem we describe two new probabilistic approaches for estimating genetic distances: distAngsd-geno and distAngsd-nuc, both implemented in a software suite named distAngsd. These methods are specifically designed for next generation sequencing data, utilize the full information from the data, and take uncertainty in genotype calling into account. Through extensive simulations, we show that these new methods are markedly more accurate and have more stable statistical behaviors than other currently available methods for estimating genetic distances - even for very low depth data with high error rates.
P66 A tree-based algorithm for identifying mutator alleles from genetic variation data

Visit poster
Luke Zhu [1], Kelley Harris [2]
Affiliations: [1] University of Washington, Department of Bioengineering, Seattle, WA [2] University of Washington, Department of Genome Sciences, Seattle , WA
Topics: Population genetics, Quantitative genetics
Keywords: Ancestral Recombination Graph
Germline mutator alleles are genetic variants that increases the overall germline mutation rate and have been implicated in driving the differences in germline mutation rate and spectra between species. However, their existence has long eluded detection. Here in this work, we propose a tree-based algorithm for identifying their presence from genetic variation data. At the mutator locus, edges within the subtree that subtends the mutator allele (mutator subtree) should carry a signal of elevated mutation rate. Given that adjacent trees in the tree sequence are highly correlated, we anticipate that a subset of the edges originally within the mutator subtree will stay under the influence of the mutator as we move away from the focal tree. We therefore developed an algorithm to identify all such edges from the tree sequence surrounding the mutator allele locus. Through simulation in msprime, we found that we consistently obtained 10x as many edge length with our approach than just analyzing the focal tree alone, potentially giving us more statistical power to identify mutator alleles. To verify our approach, we performed a mutator simulation in SLiM and recorded the tree sequence. Additionally, we labelled all the edges in the SLiM tree sequence that remained in linkage with the mutator. We found that the set of SLiM edges showed good agreement with the set of edges obtained with our tree-based algorithm.
P67 Effective interpretation of GWAS based on sequence-conserved regulatory elements

Visit poster
Xiang Zhu
Affiliations: The Pennsylvania State University
Topics: Population genetics, Quantitative genetics, Methods for GWAS, Functional genomics
Most findings from genome-wide association studies (GWAS) in humans map to noncoding regions. Approximately 3.5% of noncoding regions are conserved between human and mouse, suggesting a new direction to interpret GWAS results. However, a comprehensive characterization of conserved noncoding sequences in GWAS is still lacking. Here we report significant signatures of GWAS residing in active enhancer-like elements (AELEs) with high sequence conservation between human and mouse. We developed a simple method to identify conserved AELEs from epigenomes of 105 human cell types and tissues. Heritability enrichment analyses of 468 GWAS showed that conserved AELEs were significantly enriched in heritability, conditioning on all AELEs and 96 annotations. We observed stronger enrichments if restricting to conserved AELEs derived from trait-relevant contexts only. Similarly, integration with fine-mapping studies showed concentration of causal variants in context-specific conserved AELEs. We also prioritized variants in conserved AELEs and identified many trait-associated variants that were not identified in standard GWAS or prioritization of all AELEs. De novo motif analyses at the prioritized variants further identified trait-relevant transcription factors, revealing novel regulatory networks at conserved AELEs. In summary, this study demonstrates an effective interpretation of GWAS based on sequence- conserved regulatory elements and establishes a new resource in diverse cell types.
P68 Co-evolution based machine-learning for predicting functional interactions between human genes

Visit poster
Or Zuk, Doron Stupp, Elad Sharon, Idit Bloch, Marinka Zitnik, Yuval Tabach
Affiliations: Dept. of Statistics and Data Science, The Hebrew University of Jerusalem
Topics: Applications to cancer and other diseases, Functional genomics, Phylogenetics
Keywords: Gradient Boosting, Phylogeny, Protein Function Predictions, Phylogenetic Profiling, Co-evolution
Over the next decade, more than a million eukaryotic species are expected to be fully sequenced. This has the potential to improve our understanding of genotype and phenotype crosstalk, gene function and interactions, and answer evolutionary questions. Here, we develop a machine-learning approach for utilizing phylogenetic profiles across 1154 eukaryotic species. This method integrates co-evolution across eukaryotic clades to predict functional interactions between human genes and the context for these interactions. We benchmark our approach showing a 14% performance increase (auROC) compared to previous methods. Using this approach, we predict functional annotations for less studied genes. We focus on DNA repair and verify that 9 of the top 50 predicted genes have been identified elsewhere, with others previously prioritized by high-throughput screens. Overall, our approach enables better annotation of function and functional interactions and facilitates the understanding of evolutionary processes underlying co-evolution.
P69 Polygenic embryo screening for multiple continuous and disease traits

Visit poster
Or Zuk, Shai Carmi, Daniel Backenroth, Ehud Karavani, Todd Lencz
Affiliations: Dept. of Statistics and Data Science, The Hebrew University of Jerusalem, Israel
Topics: Population genetics, Quantitative genetics, Applications to cancer and other diseases
Keywords: Polygenic Scores, Multivariate Gaussian, Embryo Selection, Complex Traits, Extreme Value Theory
The increasing proportion of variance in human complex traits explained by polygenic scores, along with progress in preimplantation genetic diagnosis, suggests the possibility of screening embryos for continuous traits such as height or for genetic liability to adult diseases. However, comprehensive modeling of expected benefits and risks is lacking. We model jointly the polygenic risk of multiple diseases and traits for multiple embryos, using a multivariate Gaussian model for the risk score, together with the liability threshold model for diseases. Using this model, we develop numeric computations and asymptotic formulas for the expected gains when screening for multiple traits simultaneously as a function of the genetic variance explained and genetic correlations between traits, the number of embryos screened, and the parents' risk score and disease status. Based on joint works with Shai Carmi, Daniel Backenroth, Ehud Karavani, Todd Lencz and others

Presenter index

A

Patrick K. Albers , Luke Anderson-Trocmé , Ipsita Agarwal, Siddharth Avadhanam

B

Débora Y. C. Brandt , Jennifer Blanc, Tom Bartlett, Franz Baumdicker, Ronja Jessica Billenstein , Marilou Boddé, Rui Borges

C

David Castellano, Ryan Christ , Claudia Serrano Colome , Trevor Cousins

D

Yun Deng

F

Lino A.F. Ferreira , María Inés Fariello, Martina Favero, Romain Fournier , Andreas Futschik

G

Léa Guyon, Genís Garcia-Erill, Alicia Gill, Jeremy Guez, Ryan Gutenkunst

H

Margaux L.A. Hujoel , Kristian Hanghoej , Xin Huang, Xin Huang , Yilei Huang

I

Anastasia Ignatieva

J

Ben Jeffery, Kang Jin

K

Carolin Kosiol , Jere Koskela, Ava Khamseh, Georgios Kalantzis, Sakshi Khaiwal

L

Ragnhild Laursen, Brieuc Lehmann , Qunhua Li, Zilong Li

M

Marc de Manuel , Jilong Ma, Moisès Coll Macià , Rahul Mehta, Caetano Souto Maior Mendes, Ziyi Mo , Alan Murphy

N

Ekaterina Noskova

O

Louis Ollivier

P

Alice Pearson, Verónica Miró Pina, Pier Palamara, Ferdinand Petit, Martin Petr , Nathaniel S. Pope

R

Iker Rivas-González, Peter Ralph, Guillaume P. Ramstein , Malthe Sebro Rasmussen , Elena Rivas

S

Armin Scheben , Théophile Sanchez, Erik Fogh Sørensen, Michal Sadowski, Jens Sannerud, Jaromir Sant , Regev Schweiger, Sinan Shi , Casper Siu, Leo Speidel , Miriam Stricker

T

Hélène Tonnelé , Linh N. Tran, Zoi Tsangalidou

U

Per Unneberg

V

Andrew Vaughn

W

April Wei , Yan Wong , Yuexuan Wang, Logan Whitehouse

X

Alexander Xue, Yuehao Xu

Z

Shadi Zabad, Shing Hei Zhan, Yulin Zhang, Lei Zhao, Luke Zhu , Xiang Zhu, Or Zuk, Or Zuk