Phylogenetic trees define a metric space over their vertices, an observation that underlines distance-based phylogenetic inference. Several authors, including Layer and Rhodes (2017), have noted that we can embed leaves of a phylogenetic tree into high-dimensional Euclidean spaces in such a way that it minimizes (or avoids) the distortion of the tree distances. In particular, Jiang et al. (2021) used a deep learning approach to build a mapping from the space of sequences to the Euclidean space such that the mapped sequences accurately preserve the leaf distances on the reference tree. Their tool, DEPP, uses this map to place a new query sequence onto the tree by first embedding it, an idea that was particularly promising for updating a species tree given data from a single gene despite the potential discordance of the gene tree and the species tree. In focusing on Euclidean spaces, these recent papers have ignored strong theory that suggests Hyperbolic spaces are more appropriate for embedding vertices of a tree. In this paper, we show that by moving to hyperbolic spaces, we can train neural network models that achieve the same distortion of distances with fewer dimensions. The distortion of distances obtained using hyperbolic embeddings is lower than Euclidean embeddings with the same number of dimensions, both in training (backbone) and testing (query). The low-distortion embedding of distances results in better topological accuracy in updating species trees using a single gene compared to its Euclidean counterpart. However, it achieves a similar accuracy in placing queries for high dimensional embeddings. This may be due to the resilience of distance-based placement to noise, challenging hyperparameter tuning, and nonlinear (and unstable) functions involved in training hyperbolic neural networks.
Partitioning genomes into syntenic blocks has many uses in comparative genomics, such as inferring phylogenies or ancestral gene order. These blocks are usually required to contain enough genes to be qualified as syntenic. This leads to the problem of finding a common partition of the genomes in which the size of the blocks are above a certain threshold (usually at least two). When this is not feasible, one can ask to remove a minimum number of “noisy” genes so that such a partition exists. This is known as the Strip Recovery problem and is similar to the well-known Minimum Common String Partition problem, but also quite different since the latter has no restriction on the block sizes. The algorithmic aspects of Strip Recovery are not well-understood, especially in the presence of duplicated genes. In this work, we present several new complexity results. First, we solve an open problem mentioned by Bulteau and Weller in 2019 who asked whether, in polynomial time, one can decide if a common partition with block sizes at least two can be achieved without deleting any genes. We show that the problem is actually NP-hard for any fixed list of allowed block sizes, unless all block sizes are a multiple of the minimum allowed size. The problem is also hard on fixed alphabets if this list is part of the input. However, the minimum number of required gene deletions can be found in polynomial time if both the allowed blocks sizes and alphabet are fixed.
A fundamental step in any comparative whole genome analysis is the annotation of homology relationships between segments of the genomes. Traditionally, this annotation has been based on coding segments, where orthologous genes are inferred and then syntenic blocks are computed that agglomerate sets of homologous genes into homologous regions. Today, whole genomes, including intergenic regions, are being aligned de novo as whole genome alignments (WGA). In this article we develop a test to measure to what extent sets of homology relationships given by two different software are hierarchically related to one another, where matched segments from one software may contain matched segments from the other and vice versa. Such a test should be used as a sanity check for an agglomerative syntenic block software, and provides a mapping between the blocks that can be used for further downstream analyses. We show that, in practice, it is rare that two collections of homology relationships are perfectly hierarchically related. Therefore we present an optimization problem to measure how far they are from being so. We show that this problem, which is a generalization of the assignment problem, is NP-Hard and give a heuristic solution and implementation. We apply our distance measure to data from the Alignathon competition, as well as to Mycobacterium tuberculosis, showing that many factors effect how hierarchically related two collections are, including sensitivities to guide trees and the use or omission of an outgroup. These findings inform practitioners on the pitfalls of homology relationship inference, and can inform development of more robust inference tools.
Placing a new sequence onto an existing phylogenetic tree is increasingly used in downstream applications ranging from microbiome analyses to epidemic tracking. Most such applications deal with noisy data, lack of complete references, and model misspecifications, all of which make the correct placement uncertain. While recent placement methods have increasingly enabled placement on ultra-large backbone trees with tens to hundreds of thousands of species, they have mostly ignored the issue of uncertainty. Here, we build on the recently developed distance-based phylogenetic placement methodology and show how the distribution of placements can be estimated per input sequence. We compare parametric and non-parametric sampling methods, showing that non-parametric bootstrapping is far more accurate in estimating uncertainty. Finally, we design and implement a linear algebraic implementation of bootstrapping that makes it faster, and we incorporate the computation of support values as a new feature in the APPLES software.
Dissimilarity measures for phylogenetic trees have long been used for analyzing inferred trees and understanding the performance of phylogenetic methods. Given their importance, a wide array of such measures have been developed, some of which are based on the tree topologies alone, and others that also take branch lenghts into account. Similarly, a number of dissimilarity measures of phylogenetic networks have been developed in the last two decades. However, to the best of our knowledge, all these measures are based solely on the topologies of phylogenetic networks, and ignore branch lengths. In this paper, we propose two phylogenetic network dissimilarity measures that take both topology and branch lengths into account. We demonstrate the behavior of these two measures on pairs of related networks. Furthermore, we show how these measures can be used to cluster a set of phylogenetic networks obtained by an inference method, illustrating this application on the posterior sample of phylogenetic networks. Both measures are implemented in the publicly available software package PhyloNet.
Chromothripsis is a mutational phenomenon representing a unique type of tremendously complex genomic structural variation. It was initially described and was broadly observed in cancerous genomes with lower frequencies in other genetic disorders. Chromothripsis manifests massive genomic structural alterations during a single catastrophic event in the cell. It is considered to be characterized by the simultaneous shattering of chromosomes followed by random reassembly of the DNA fragments, ultimately resulting in newly formed, mosaic derivative chromosomes and with a potential for a drastic oncogenic transformation. Here we consider a question whether the genomic locations involved in chromothripsis rearrangements' are randomly distributed in 3D or have some spatial organization's predispositions. To that end we investigated the structural variations (SVs) observed in previously sequenced cancer genomes via juxtaposition of involved breakpoints onto the Hi-C contact genome map of normal tissue. We found that the average Hi-C contact score for SVs breakpoints appearing at the same chromosome (cis-SVs) in an individual patient is significantly higher than the average Hi-C matrix signal, which indicates that SVs tend to involved spatially proximal regions of the chromosome. Furthermore, we overlaid the chromothripsis annotation of groups of SVs' breakpoints and demonstrated that the Hi-C signals for both chromothripsis breakpoint regions as well as regular SVs breakpoints are statistically significantly higher and from random control suggesting that chromothripsis cis-SVs have the same tendency to rearrange the proximal sites in 3D-genome space. Last, but not least, our analysis revealed a statistically higher Hi-C score for all pairwise combinations of breakpoints involved in chromothripsis event when compared to both background Hic signal as well as to combination of non-chromothripsis breakend pairs. This observation indicates that breakpoints could be assumed to describe a given chromothripsis 3D-cluster as a proximal bundle in genome space. These results provide valuable new insights into the spatial relationships of the SV loci for both chromothripsis and regular genomic alterations, laying the foundation for the development of a more precise method for chromothripsis identification and annotation.
The microbiome is an interconnected network of microorganisms, which exist and influence a wide array of in natural and synthetic environments. Genetic information is constantly spread across the members of the microbial community, in a process called horizontal gene transfer (HGT), causing exposure of genetic alterations and modifications to all members of the community. In order to accurately and effectively engineer microbiomes, genetic modifications must be introduced to certain species, as selectivity is a key factor in creating and fixing functional abilities within microbial environments. Moreover, introduction of genes into unwanted hosts may cause unpresidented ecological impacts, posing a major biosafety issue. Technologies in the field are usually experimentally developed for a specific host or environment, and the lack of atomization and generalization limit them to a specific microbiome. Additionally, they only deal with the transformation process itself at best and do not modulate the different elements of the genetic material, neglecting considerations related to the interactions between the new genetic material and the population. This work presents a set of computational models that automatically design a microbiome-specific plasmid that is selectively expressed in certain parts of the bacterial population. The underlying algorithm fine-tunes genetic information to be optimally expressed in the wanted hosts of the plasmid, while simultaneously impairing expression in unwanted hosts. We take into account and selectively optimize the main elements linked to gene expression and heredity. In addition, we have provided both in-silico and in-vitro analysis supporting our claim. This study was part of the TAU IGEM 2021 project (https://2021.igem.org/Team:TAU_Israel)
The problem of analysing bacterial isolates in order to detect plasmids has been widely studied. With the development of Whole Genome Sequencing (WGS) technologies, several approaches have been proposed to bin contigs into putative plasmids. Reference-based approaches aim to bin contigs by mapping or comparing their sequences against databases of previously identified plasmids or plasmid genes. On the other hands, de novo approaches use contig features such as read coverage and length for plasmid binning. Hybrid approaches that combine both strategies have also been proposed recently. We present PlasBin, a mixed integer linear programming (MILP) based hybrid approach for plasmid binning. We evaluate the performance of several binning methods on a real data set of bacterial samples.
Intronless genes (IGs) or single-exon genes lacking an intron are found across most Eukaryotes. Notably, IGs display a higher transcriptional fidelity as they are not regulated through alternative splicing, suggesting better predictability biomarkers and easier regulation as targets for therapy. Cancer is a complex disease that relies on progressive uncontrolled cell division linked with multiple dysfunctional biological processes. Tumor heterogeneity remains the most challenging feature in cancer diagnosis and treatment. Given the clinical relevance of IGs, we aim to identify their unique expression profiles and interactome, that may act as functional signatures across eight different cancers. We identified 940 protein-coding IGs in the human genome, of which about 35% were differentially expressed across the analyzed cancer datasets. Specifically, ~78% of differentially expressed IGs were undergoing transcriptional reprogramming with elevated expression in tumor cells. Remarkably, in all the studied tumors, a highly conserved induction of a group of deacetylase-histones located in a region of chromosome 6 enriched in nucleosome and chromatin condensation processes. This study highlights that differentially expressed human intronless genes across cancer types are prevalent in epigenetic regulatory roles participating in specific PPI networks for ESCA, GBM, and LUAD tumors. We determine that IGs play a key role in the tumor phenotype at transcriptional and post-transcriptional levels, with important mechanisms such as interactomics rewiring.
A phylogenetic network is a simplex (or 1-component tree-child) network if the child of every reticulation node is a network leaf and each tree node has at most one reticulation node as its child. Simplex networks are a superclass of phylogenetic trees and a subclass of tree-child networks. Generalizing the Sackin index to phylogenetic networks, we prove that the expected Sacking index of a random simplex network is asymptotically in the uniform model.
Single Cell RNA Sequencing (scRNA-seq) technology has enabled the biological research community to explore gene expression at a single-cell resolution. By studying differences in gene expression, it is possible to differentiate cell clusters and types within tissues. One of the major challenges in a scRNA-seq study is feature selection in high dimensional data. Several statistical and machine learning algorithms are available to solve this problem, but their performances across data sets lack systematic comparison. In this research, we benchmark different penalized regression methods, which are suitable for scRNA-seq data. Results from four different scRNA-seq data sets show that Sparse Group Lasso (SGL) implemented by the SGL package in R performs better than other methods in terms of area under the receiver operating curve (AUC). The computation time for different algorithms varies between data sets with SGL having the least average computation time. Based on our findings, we propose a new method that applies SGL on a smaller pre-selected subset of genes to select the differentially expressed genes in scRNA-seq data. The reduction in the number of genes before SGL reduce the computation hardware requirement from 32 GB RAM to 8 GB RAM. The proposed method also demonstrates a consistent improvement in AUC over SGL.
Horizontal transfer of genetic material between different organisms is one of the most important evolutionary processes in microbial evolution. Such horizontal transfer events can result in the transfer of genomic fragments containing multiple complete genes, complete single genes, or partial genes. However, partial gene transfer (PGT) remains poorly understood and generally underappreciated. Indeed, existing phylogenetic approaches for studying microbial evolution and horizontal gene transfer largely ignore PGT, leading to potential biases and errors in downstream inferences. In this work, we (i) perform the first systematic study of the impact of PGT on the ability to correctly reconstruct the evolutionary histories of gene families (i.e., gene trees) and (ii) propose a simple, yet effective approach, called trippd, to detect if a given gene family has been affected by PGT. Our analysis, using simulated and real biological datasets, reveals many interesting insights related to when and how PGT affects gene tree reconstruction and demonstrates how trippd can help to easily and accurately identify gene families that have or have not been significantly impacted by PGT.
We generalize the reconciliation approach, used for inferring the evolution of a single gene family given a species tree, to groups of co-localized genes, also called syntenies. More precisely, given a set X of syntenies in a set Σ of genomes, a tree T for X and a tree S for Σ, the problem is to find a most parsimonious history for X with respect to a given evolutionary model. We extend a previous model involving segmental duplications and losses, to also include segmental horizontal gene transfers (HGTs) and gene gains. We present a polynomial-time dynamic programming algorithm to solve the problem. We apply it to CRISPR-associated (Cas) gene syntenies. These genes are part of CRISPR-Cas systems, one of its members (CRISPR-Cas9) well-known as currently the most reliable and accurate molecular scissor technology for genome editing. The inferred evolutionary scenario is a plausible explanation of the diversification of this system into its different types. An implementation of the algorithm presented in this paper is available at: https://github.com/UdeM-LBIT/superrec2/releases/tag/rcg2022.
The production of cheese and other dairy products relies on the constant monitoring of viruses, called bacteriophages, that attack the organisms responsible for the fermentation process. Bacteriophage species are characterized by a stable core genome, and a 'genetic reservoir' of gene variants that are exchanged through recombination. Phylogenetic analysis of phage populations are notably difficult due not only to extreme levels of horizontal exchange at the borders of functional modules, but also inside of them. In this paper we present the first known attempt at directly modeling gene flux between phage populations, by introducing a new genome rearrangement operation that models the various assortments found in phage communities. This represents an important departure from gene-based alignment and phylogenetic reconstruction, shifting focus to a genetic reservoir-based evolutionary inference. We present a combinatorial framework for the comparison of bacteriophage populations, and use it to compute recombination scenarios that generate one population from another. We apply our heuristic, based on this framework, to four populations sampled from Dutch dairy factories by Murphy et al. (2016}. We find that, far from being random, these scenarios are highly constrained. We use our method to test for factory-specific diversity, and find that there was likely a large amount of recombination in the ancestral population.
We compare several types of evolutionary divergence of synteny blocks: sequence level divergence of the genes in the block, loss of genes affecting the length and structure of the blocks and spatial position of the block in relation to the the chromosomal centromere, and suggest other dimensions, such as the predominant functional characteristic of the genes in the block. We focus on the evolutionary history of the allotetraploid Coffea arabica genome and its two progenitor genomes through three major genomic events spanning 120 million years.
Genome rearrangement events are widely used to estimate a minimum-size sequence of mutations capable of transforming a genome into another. The size of this sequence is called distance and determining it is the main objective in genome rearrangement distance problems. Problems in the genome rearrangement field differ regarding the set of rearrangement events allowed and the genome representation. In this work, we consider the scenario where the genomes share the same set of genes, gene orientation is known, and intergenic regions (structures between a pair of genes and at the extremities of the genome) are taken into account. We use two models, the first model allows only conservative events (reversals and moves). The second model includes non-conservative events (insertions and deletions) in the intergenic regions. We show that both models result in NP-hard problems and we present algorithms with an approximation factor of 2.
Sorting by Genome Rearrangements is a classic problem in Computational Biology. Several models have been considered so far, each of them defines how a genome is modeled (for example, permutations when assuming no duplicated genes, strings if duplicated genes are allowed, and/or use of signs on each element when gene orientation is known), and which rearrangements are allowed. Recently, a new problem, called Sorting by Multi-Cut Rearrangements, was proposed. It uses the k-Cut rearrangement which cuts a permutation (or a string) at k ≥ 2 places and rearranges the generated blocks to obtain a new permutation (or string) of same size. This new rearrangement may model chromoanagenesis, a phenomenon consisting of massive simultaneous rearrangements. Similarly as the Double-Cut-and-Join, this new rearrangement also generalizes several genome rearrangements such as reversals, transpositions, revrevs, transreversals, and block-interchanges. In this paper, we extend a previous work based on unsigned permutations and strings to signed permutations. We show the complexity of this problem for different values of k, that the approximation algorithm proposed for unsigned permutations with any value of k can be adapted to signed permutations, and a 1.5-approximation algorithm for the specific case k = 4.
Metagenomics sequencing enables the direct study of microbial communities revealing important information such as taxonomy and relative abundance of species. Metagenomics binning facilitates the separation of these genetic materials into different taxonomic groups. Moving from second-generation sequencing to third-generation sequencing techniques enables the binning of reads before assembly thanks to the increased read lengths. The limited number of long-read binning tools that exist, still suffer from unreliable coverage estimation for individual long reads and face challenges in recovering low-abundance species. In this paper, we present a novel binning approach to bin long reads using the read-overlap graph. The read-overlap graph (1) enables a fast and reliable estimation of the coverage of individual long reads; (2) allows to incorporate the overlapping information between reads into the binning process; (3) facilitates a more uniform sampling of long reads across species of varying abundances. Experimental results show that our new binning approach produces better binning results of long reads and results in better assemblies especially for recovering low abundant species. The source code and a functional Google Colab Notebook are available at https://www.github.com/anuradhawick/oblr.