Help

Foreword

Reliable orthology prediction is central to comparative genomics and the annotation of newly sequenced genomes. Since orthology and paralogy are both evolutionary concepts, phylogeny-based strategies are expected to provide the most accurate predictions. However, given the high computational cost associated to phylogenetic analyses, the majority of automated orthology prediction methods rely on faster but less accurate pairwise sequence comparisons. Only recently, thanks to the availability of faster computers and better algorithms, it is feasible to use phylogeny-based orthology prediction at genomic scale.

 

Motivation

Recently, several projects have addressed the reconstruction of large collections of high-quality phylogenetic trees from which orthology can be inferred. This provides us with the opportunity to infer the evolutionary relationships of two genes from multiple, independent, phylogenetic trees and use the consistency across predictions as a reliability measure of an orthology assignment. By using phylogenetic trees available at PhylomeDBEnsembl, TreeFam and Fungal Orthogroups databases and those reconstructed for EggNOGOrthoMCL, and COG, we predict orthology and paralogy relationships for over millions proteins in 829 fully-sequenced genomes and provide a reliability score for all of them, based on the number of independent trees and the consistency across predictions.

 

Methodology

In total, 705 123 phylogenetic trees have been processed in order to compute comprehensive set of orthology and paralogy predictions. Resources of 7 popular databases have been used:

  Database Species Trees Homologs Link
1 PhylomeDB 1) 717 459 447 1 269 M*
2 Ensembl 59 2)  266 122 003 109 M
3 EggNOG 2.0 3) 55 17 085' 30 M
4 OrthoMCL 4.0 4) 138 76 673" 31 M orthomcl.org_images_masthead.jpg
5 COG 5) 66 4 875" 3.4 M
6 Fungal Orthogroups 6) 23 8 983 1.4 M NO LOGO
7 TreeFam 8 7) 78 16 064 27 M

* redundant predictions from all phylomes were counted separately; ' trees were reconstructed for alignments of EuNog & KOG; " orthologous groups defined by repository were used to reconstruct alignments and trees 

PhylomeDB

MetaPhOrs uses pre-computed phylogenetic trees from PhylomeDB (Sept 2010), a database of complete collections of gene phylogenies (phylomes) ranging a number of model species. Such a comprehensive collection of data may be used to infer a high-quality, a large-scale, and a phylogeny-based orthologous and paralogous relationships. Computational cost of inferring relationship is restricted only to retrieving data from the source database, mapping evolutionary events in the trees, and calculation of the consistency score of each prediction.

In order to produce orthology and paralogy predictions for all sequences deposited in PhylomeDBover 450 thousands phylogenetic trees regarding over 600 species, were processed as follow:

  1. Retrieve every tree in which given protein pair is present
  2. Remove low-quality trees
    1. Likelihood filter - trees having likelihood lower than three times of the likelihood of the best tree for given pair (likelihood threshold of 3.0 was applied), were discarded from further analyses.
  1. Deposit information in the database

Resources of PhylomeDB itself provided us with over 1 201 millions of redundant orthology and paralogy predictions.

Ensembl and EnsemblGenomes

Ensembl provide phylogenetic tree for every protein family for Vertebrates (18.4k families). Recently, similar repositories were released for Bacteria (31.7k*), Fungi (7.8k), Metazoa (25.6k), Plants (27.8k) and Protists (9.4k).

All phylogenetic trees from Ensembl (ver.59) and EnsemblGenomes (ver. 5) were downloaded. Speciations and duplications were mapped on each tree using species overlap algorithm. On the basis, orthologs and paralogs, respectively, were retrieved. The vast majority of these predictions overlaps with the data already present in database - Ensembl signal were added. The Ensembl unique predictions were uploaded to MetaPhOrs.

* Sum of all EnsemblBacteria subgroups families (10 subgroups): Bacillus, Borellia, Buchnera, Echerichia-Schigella, Mycobacterium, Neisseria, Pyrrococcus, Staphylococcus, Streptococcus, Wolbachia.

EggNOG

EggNOG 2.0 provides ML trees for their families, but these contain in most cases several multi-furcations, which causes problems in the species overlap algorithm. Therefore, ML trees derived with the same pipeline implemented in PhylomeDB (kindly provided by Diego Kormes) have been used here. Noteworthy, the support is calculated using aLRT statistics (instead of bootstrap). Only EggNOG trees for euKaryotic Orthologous Groups (KOG) and euNOG where used. 

OrthoMCL

OrthoMCL 4.0 contains 116 536 groups of orthologous sequences. Noteworthy, significant subset of groups from OrthoMCL contained only 2 or 3 members (39 778 and 20 758 groups, respectively). Sets of homologous protein sequences containing at least 3 members were aligned using MUSCLE 3.6. Positions in the alignment with gaps in more than 10% of the sequences were eliminated before the phylogenetic analysis, unless this procedure removed more than one-third of the positions in the alignment. In such cases the percentage of sequences with gaps allowed was automatically increased until at least two-thirds of the initial positions were conserved. 

NJ trees were derived using scoredist distances as implemented in BioNJ with 7 evolutionary models (JTT, WAG, MtREV, VT, LG, Blosum62, and Dayhoff). ML trees were derived from the alignments using PhyML_aLRT. The evolutionary model best fitting the data was determined by comparing the likelihood of the used models according to the AIC criterion. In all cases a discrete gamma-distribution model with four rate categories plus invariant positions was used, the gamma parameter and the fraction of invariant positions were estimated from the data. 

Resulting trees were rooted by mid-poing and species overlap algorithm were applied, in order to map speciations and duplications. 

COG

Sets of homologous protein sequences underlying 4 875 Clusters of Orthologous Groups were aligned using 3 multiple sequence alignment programs (MUSCLE 3.6, MAFFT 6.7, and DIALIGN-TX 1.0), both forward and reverse. Consensus meta-alignments were created using M-Coffee. Positions in the alignment with gaps in more than 10% of the sequences were eliminated before the phylogenetic analysis, unless this procedure removed more than one-third of the positions in the alignment. In such cases the percentage of sequences with gaps allowed was automatically increased until at least two-thirds of the initial positions were conserved. 

NJ trees were derived using scoredist distances as implemented in BioNJ with 7 evolutionary models (JTT, WAG, MtREV, VT, LG, Blosum62, and Dayhoff). ML trees were derived from the alignments using PhyML_aLRT. The evolutionary model best fitting the data was determined by comparing the likelihood of the used models according to the AIC criterion. In all cases a discrete gamma-distribution model with four rate categories plus invariant positions was used, the gamma parameter and the fraction of invariant positions were estimated from the data. 

Resulting trees were rooted by mid-poing and species overlap algorithm were applied in order to map speciations and duplications. 

Fungal Orthogroups

Fungal Orthogroups provides phylogenetic tree for every fungal orthologous group. Almost 9 thousands trees are present. Fully resolved trees reconstructed with Synergy algorithm were used. Trees were rooted using outgroup rooting. The rest of methodology, as with Ensembl. 

TreeFAM

Similarly like Ensembl, TreeFAM provides rooted phylogenetic trees for multiple protein families. Previously, manually curated trees were use. However, due to high number of errors and extremely small subset (merely 1.2k trees), we gave up this idea. In current release, so-called 'CLEAN trees' are used. Methodology, as with Ensembl. 

 

Assigning orthology and paralogy relationships

Duplications and speciations were mapped using the species overlap algorithm as implemented in the Environment for Tree Exploration program (ETE).

 Species-overlap algorithm 

The species-overlap algorithm is an alternative approach of inferring evolutionary events from gene phylogenies. The only evolutionary information required by such algorithm is a rooted gene tree. This method requires neither a fully-resolved species phylogeny, nor reconciliation steps. To decide whether a given node represents a speciation or a duplication event, this algorithm employs the level of overlap between species represented under its two descendant nodes. In brief, a species-overlap score (SOS) is calculated for every node as the proportion of shared species between child branches over the total number of organisms under the node. If the SOS is higher than given threshold, the parental node is mapped as duplication, otherwise as speciation event. The best performance of the algorithm has been reported to be associated with the use of a SOS threshold equal to 0.0, so speciation is only assumed if no species overlap is detected between its descendant nodes8).

 

Reliability

MetaPhOrs is equipped with a few values describing the quality of its predictions: Consistency Score, Evidence Level, number of trees used, and number of trees rejected by filter. Trees used for every single prediction are linked to their source databases.

 

Consistency score

Orthology/paralogy assignment in MetaPhOrs is based on Consistency Score (CS). Consistency score ranges from 0 to 1. In brief, the closer the value of CS to 1.0, the more confident the prediction.

Consistency score is the ratio of the number of trees confirming given relationship over the total number of trees that were used to infer the relationship between particular protein pair;

Orthology Consistency Score (CSo) is calculated for orthology searches, respectively paralogy Consistency Score (CSp) for paralogy queries, as follows:

  • CSo = To / (To + Tp)
  • CSp = Tp / (To + Tp)

where:

  • To stands for number of trees confirming orthology
  • Tp for number of trees confirming paralogy relationship.

The recommended CSo threshold for orthology prediction is 0.5. The CS might be altered by the user in order to adjust sensitivity/positivity of each query accordingly. All homology relationships are returned when CS cut-off of 0.0 is applied, while CS cut-off of 1.0 returns only fully consistent predictions.

Evidence level

Evidence level defines the number of independent sources (external repositories or phylomes), in which trees confirming each prediction have been found. In general the higher evidence level, the better reliability of the prediction as more sources were used to infer it.

Evidence level may vary from 1 to 10. The Evidence Level cut-off has to altered with care, as external databases overlap partially, and for some pairs of species there is only one source of data (Evidence Level of 1). It's recommended to start queries with Evidence Level cut-off of 1, and then eventually increase the cut-off.

 

Citation

Pryszcz, L.P., Huerta-Cepas, J., and Gabaldon, T. (2011) MetaPhOrs: orthology and paralogy predictions from multiple phylogenetic evidence using a consistency-based confidence score. Nucleic Acids Res. 39: e32.

 

1) Huerta-Cepas J, et al: PhylomeDB: a database for genome-wide collections of gene phylogenies. Nucleic Acids Res 2008, 36:D491-6.
2) Flicek P, et al: Ensembl 2008. Nucleic Acids Res 2008, 36:D707-14.
3) Muller J, et al: eggNOG v2.0: extending the evolutionary genealogy of genes with enhanced non-supervised orthologous groups, species and functional annotations. Nucleic Acids Res 2009, 38:D190-5.
4) Chen F, et al: OrthoMCL-DB: querying a comprehensive multi-species collection of ortholog groups. Nucleic Acids Res 2006, 34:D363-D368.
5) Tatusov RL, et al: A genomic perspective on protein families. Science 1997, 278:631-7.
6) Wapinski I, et al: Automatic genome-wide reconstruction of phylogenetic gene trees. Bioinformatics 2007, 23: i549-i558.
7) Ruan J, et al: TreeFam: 2008 Update. Nucleic Acids Res 2008, 36:D735-40.
8) Huerta-Cepas J, Dopazo H, Dopazo J, Gabaldón T: The human phylome. Genome Biol 2007, 8:R109.
q=help