EFI-EST, a web tool for generating sequence similarity networks to visualize sequence-function relationships in protein families
As genome sequencing has become routine, the rate of increase in the number of uncharacterized, unknown or hypothetical proteins in the sequence databases has exceeded the ability to assign their biological functions. Addressing this challenge requires tools to focus experimental efforts. A sequence similarity network (SSN) is an example of such a tool—it enables facile visualization of sequence-function relationships in protein families, thereby focusing sequence-based functional annotation efforts .
The dendrograms and trees that have dominated similarity analyses can become unwieldy when dealing with tens of thousands of protein sequences. In contrast, SSNs are easy to both calculate and manipulate, thereby allowing analyses of sequence-function relationships for even large protein families. SNNs are graphical representations where pairs of sequences (nodes) are connected via a line (an edge) if they share a degree of similarity above a user-defined threshold.
The use of SSNs was pioneered by the Structure-Function Linkage Database (SFLD), an NIGMS‑supported resource at the University of California-San Francisco . The SFLD provides manually curated SSNs for a modest number of functionally diverse protein families that can be visualized using Cytoscape. In order to equip the biochemical community with tools to address the “annotation problem”, the Enzyme Function Initiative (U54GM093342) provided “open access” to the software and web tools that provide the ability for anyone to generate SSNs.
The EFI’s Enzyme Similarity Tool (EFI-EST) is a web-based resource developed and maintained at the Institute for Genomic Biology at the University of Illinois. EFI-EST provides a user with the ability to generate a SSN for any protein family. Sequences can be retrieved from the UniProt database by similarity to a user-supplied seed sequence; Alternatively, all sequences in a Pfam or InterPro family can be used. Or, the user can upload a FASTA file of sequences. EFI-EST first performs an all-by-all BLAST to determine the pairwise sequence similarities and then generates the SSN. Representative node (metanode) networks are provided for visualization of very large families. More than 30 node attributes are provided from various annotation sources for sequences obtained from UniProt that can be used to provide insights into sequence-function relationships.
For functional discovery, the SSN is most useful when filtered by a sequence identity threshold (alignment score) that achieves isofunctional fractionation – that is, all sequences within a cluster share the same function. At this level of clustering, the user can transfer function between sequences within the cluster. Although no universal alignment score exists to achieve isofunctional fractionation for all protein families, an SSN can be filtered using one or more node attributes that can associated with functional divergence. For example, identifying the nodes with experimentally characterized functions as described by the SwissProt database and choosing an alignment score that separates these functions can be used to fractionate the SSN into clusters with known and unknown functions.
EFI-EST is available online (http://efi.igb.illinois.edu/efi-est/) and without charge for the generation of SSNs with 150,000 sequences or less. Generation of an SSN takes, on average, 6.5 hours, depending on the number of sequences. A detailed description of and tutorial for using EFI-EST, including example applications, was recently published in Biochimica et Biophysica Acta . In addition, an on-line tutorial that is regularly updated is available that describes the use of EFI-EST (http://efi.igb.illinois.edu/efi-est/index.php).
Users interested in generating larger networks are welcome to contact email@example.com for access to the software on the cluster at the University of Illinois.
- H.J. Atkinson, J.H. Morris, T.E. Ferrin, and P.C. Babbitt, Using sequence similarity networks for visualization of relationships across diverse protein superfamilies. PLoS One 2009, 4, e4345. PMCID: PMC2631154
- E. Akiva, S. Brown, D.E. Almonacid, A.E. Barber, 2nd, A.F. Custer, M.A. Hicks, C.C. Huang, F. Lauck, S.T. Mashiyama, E.C. Meng, D. Mischel, J.H. Morris, S. Ojha, A.M. Schnoes, D. Stryke, J.M. Yunes, T.E. Ferrin, G.L. Holliday, and P.C. Babbitt, The Structure-Function Linkage Database. Nucleic Acids Res 2014, 42, D521-30. PMCID: PMC3965090
- J.A. Gerlt, J.T. Bouvier, D.B. Davidson, H.J. Imker, B. Sadkhin, D.R. Slater, and K.L. Whalen, Enzyme Function Initiative-Enzyme Similarity Tool (EFI-EST): A web tool for generating protein sequence similarity networks. Biochim Biophys Acta 2015, 1854, 1019-1037. PMCID: PMC4457552