MetaPhOrs is a public repository of phylogeny-based orthologs and paralogs that were computed using phylogenetic trees available in 13 public repositories. Currently, over 117,131,162 of unique homologs are deposited in MetaPhOrs database. These predictions were retrieved from 8,246,911 Maximum Likelihood trees for 4,094 species. For each prediction, MetaPhOrs provides a Consistency Score and Evidence Level.
The last metaPhOrs paper:
MetaPhOrs 2.0: integrative, phylogeny-based inference of orthology and paralogy across the tree of life
Uciel Chorostecki, Manuel Molina, Leszek P Pryszcz, Toni Gabaldón
Nucleic Acids Research, gkaa282
The metaPhOrs consistency-based algorithm is described here:
MetaPhOrs: orthology and paralogy predictions from multiple phylogenetic evidence using a consistency-based confidence score
LP Pryszcz, J Huerta-Cepas, T Gabaldon
Nucleic acids research 39 (5), e32-e32
|Ensembl vertebrates||290,695||Vilella AJ, Severin J, Ureta-Vidal A, Heng L, Durbin R, Birney E. EnsemblCompara GeneTrees: Complete, duplication-aware phylogenetic trees in vertebrates. Genome Res. 2009 Feb;19(2):327-35. doi: 10.1101/gr.073585.107. Epub 2008 Nov 24. [Link]|
|Ensembl bacteria||37,806||Vilella AJ, Severin J, Ureta-Vidal A, Heng L, Durbin R, Birney E. EnsemblCompara GeneTrees: Complete, duplication-aware phylogenetic trees in vertebrates. Genome Res. 2009 Feb;19(2):327-35. doi: 10.1101/gr.073585.107. Epub 2008 Nov 24. [Link]|
|Ensembl fungi||117,579||Vilella AJ, Severin J, Ureta-Vidal A, Heng L, Durbin R, Birney E. EnsemblCompara GeneTrees: Complete, duplication-aware phylogenetic trees in vertebrates. Genome Res. 2009 Feb;19(2):327-35. doi: 10.1101/gr.073585.107. Epub 2008 Nov 24. [Link]|
|PhylomeDB||5,221,848||Huerta-Cepas J, Capella-Gutiérrez S, Pryszcz LP, Marcet-Houben M, Gabaldón T. PhylomeDB v4: zooming into the plurality of evolutionary histories of a genome. Nucleic Acids Res. 2014;42(Database issue):D897–D902. doi:10.1093/nar/gkt1177 [Link]|
|Orthomcl||77,270||Chen F, Mackey AJ, Stoeckert CJ Jr, Roos DS. OrthoMCL-DB: querying a comprehensive multi-species collection of ortholog groups. Nucleic Acids Res. 2006;34(Database issue):D363–D368. doi:10.1093/nar/gkj123 [Link]|
|Eggnog||1,846,371||Huerta-Cepas J, Szklarczyk D, Forslund K, et al. eggNOG 4.5: a hierarchical orthology framework with improved functional annotations for eukaryotic, prokaryotic and viral sequences. Nucleic Acids Res. 2016;44(D1):D286–D293. doi:10.1093/nar/gkv1248 [Link]|
|Hogenom||185,176||Penel S, Arigon AM, Dufayard JF, et al. Databases of homologous gene families for comparative genomics. BMC Bioinformatics. 2009;10 Suppl 6(Suppl 6):S3. Published 2009 Jun 16. doi:10.1186/1471-2105-10-S6-S3 [Link]|
|Treefam||15,321||Schreiber F, Patricio M, Muffato M, Pignatelli M, Bateman A. TreeFam v9: a new website, more species and orthology-on-the-fly. Nucleic Acids Res. 2014;42(Database issue):D922–D925. doi:10.1093/nar/gkt1055 [Link]|
|EvolclustDB||60,955||Marcet-Houben M, Gabaldón T. EvolClust: automated inference of evolutionary conserved gene clusters in eukaryotes. Bioinformatics. 2020 Feb 15;36(4):1265-1266. doi: 10.1093/bioinformatics/btz706. [Link]|
Duplications and speciations are computed using the species overlap algorithm.
(Genome Biol. 2007;8(6):R109.The human phylome.Huerta-Cepas J1, Dopazo H, Dopazo J, Gabaldón T.)
The species-overlap algorithm is an alternative approach of inferring evolutionary events from gene phylogenies. The only evolutionary information required by such algorithm is a rooted gene tree. This method requires neither a fully-resolved species phylogeny, nor reconciliation steps. To decide whether a given node represents a speciation or a duplication event, this algorithm employs the level of overlap between species represented under its two descendant nodes. In brief, a species-overlap score (SOS) is calculated for every node as the proportion of shared species between child branches over the total number of organisms under the node. If the SOS is higher than given threshold, the parental node is mapped as duplication, otherwise as speciation event. The best performance of the algorithm has been reported to be associated with the use of a SOS threshold equal to 0.0, so speciation is only assumed if no species overlap is detected between its descendant nodes.This the SOS used in MetaPhOrs.
MetaPhOrs combines information from multiple strains into single meta-proteome for each species. As a result, the phylogenetic signals from multiple strains of one species present in given tree are counted multiple times and number of trees in orthology tables may be slightly larger than number of trees retrieved in tree page.
Orthology/paralogy assignments from different trees are combined into a single orthology/paralogy predictions using a consistency-based approach. For this a Consistency Score (CS) is computed. CS ranges from 0 to 1. In brief, the closer the value of CS to 1.0, the more confident the prediction.
Consistency score is the ratio of the number of trees confirming given relationship over the total number of trees that were used to infer the relationship between particular protein pair. Orthology Consistency Score (CSo) is calculated for orthology searches, respectively paralogy Consistency Score (CSp) for paralogy queries, as follows:
CSo = To / (To + Tp)
CSp = Tp / (To + Tp)
The recommended CSo threshold for orthology prediction is 0.5. The CS might be altered by the user in order to adjust sensitivity/positivity of each query accordingly. All homology relationships are returned when CS cut-off of 0.0 is applied, while CS cut-off of 1.0 returns only fully consistent predictions.
Evidence level defines the number of independent sources (databases), that support the prediction. In general the higher evidence level, the better reliability of the prediction as more sources were used to infer it.
Evidence level may vary from 1 to 13 (as trees were retrieved from 13 databases). The Evidence Level cut-off has to be altered with care, as external databases overlap only partially, and for some pairs of species there is only one source of data (Evidence Level of 1). It's recommended to start queries with Evidence Level cut-off of 1, and then eventually increase the cut-off.