This site uses cookies to enhance the user experience and for statistical purposes. Our privacy policy is available here. Accept Decline

Note: necessary session cookies are used to ensure proper functioning and cannot be switched off.

COMER web server help

Table of contents

  1. Introduction
  2. Homology search
  3. Workflow
  4. Input
  5. Profile construction
  6. Profile databases
  7. Results and supported services
  8. Homology examples
  9. Execution times
  10. Limits
  11. References

Introduction

The COMER web server provides services for protein analysis by homology. It is powered by the COMER2 homology search engine. COMER2 is a sensitive and specific homology search method [1] that accelerates homology searches by harnessing the power of GPUs [2]. The COMER2 software architecture allows for simultaneously running multiple instances of homology search on the same GPU independently. This property allows the webserver to efficiently exploit computational resources and distribute workload across multiple dedicated GPUs.

The server allows for an analysis of the results of COMER2. COMER2 is a profile-profile comparison and alignment method whose homology search is based on comparing the query profile (sequence family) model with the entries (profiles) of selected database(s). It employs a fast GPU-accelerated search engine.

Workflow

The workflow for using the COMER2 tool in the webserver environment is shown in Fig. 1. The user is expected to enter her or his queries, select databases, adjust parameters if desired, and hit the Submit button. Information displaying the progress of the job will show up. The results will appear once the job has finished. An e-mail with a link to the results will be sent if the e-mail address was provided by the user.

COMER web server
Figure 1. Flowchart for the COMER web server

The user can provide sequences, plain and in FASTA format, MSAs in aligned FASTA, STOCKHOLM, and A3M formats, and COMER2 profiles—all in the same input field. The server automatically determines the format of input data.

The COMER web server can be instructed to build informative and diverse MSAs for user queries (profiles excluded). In that case, the server performs additional sequence searches with user queries using HHblits [3], HMMER3 [4], or both and builds MSAs from statistically significant hits. When using both these tools, an MSA for each query results from combining sequence alignments produced independently by them, which is useful for query sequences with low homology.

Each sequence and MSA corresponding to a user query is subjected to profile construction. COMER2 profiles include secondary structure (SS) predictions [5] made for query proteins.

The format in which the results of a profile-profile search appear includes for each user query a list of pairwise alignments between the query and a database sequence, both representing constructed profiles. Produced alignments are amenable to further analysis. Their different selection combinations for a query allow the user to construct different MSAs and generate 3D structural models by homology to detected proteins [6] in bulk when those protein sequences have a known structure.

Input

The user is allowed to enter multiple queries in different formats. Processing user queries in bulk removes the limitation of focusing on a single protein of interest at a time and allows the server to utilize computing power efficiently. Results will be provided by the server for each query. The number of queries in the input field is limited to 100 for COMER2 searches. Maximum size of queries in the text field is 5 MB, and maximum file size for upload of queries is 50 MB.

Individual queries are separated by the line "//". The format of a query is recognized automatically. Currently, these formats are supported for sequences: plain sequence and FASTA. A multiple sequence alignment (MSA) can be provided in these formats: aligned FASTA (FASTA with gaps "-" allowed), STOCKHOLM, and A3M. Also, the server recognizes COMER2 profiles as individual queries.

An example of an input record containing multiple queries is shown below:

KAITKERDSDFVKELFHQPELMADEEELQRATEQVVEMIENFKSFSLHFPLFWYDERYCISNILELEDLDEVWHTLPNDESCVDLWNMYKN
//
>1BKV_C T3-785; COLLAGEN, HYDROXYPROLINE, HYDROGEN BONDING, TRIPLE; HET: HYP, ACY; 2.0A {N/A} SCOP: k.3.1.1
PPGPPGPPGITGARGLAGPPGPPGPPGPPG
>tr|H9FJ52|H9FJ52_MACMU Collagen alpha-1(XXIV) chain (Fragment) OS=Macaca mulatta GN=COL24A1 PE=2 SV=1
PRGEKGFRGETGPQGPRGQPGPPGPPGAPG
>tr|B4KBY1|B4KBY1_DROMO Uncharacterized protein OS=Drosophila mojavensis GN=Dmoj\GI10190 PE=4 SV=2
PPGPPGPPGppgPpgpPGPRGFTGATGPRGRDAPRT
//
>OsSUT1, RICE LEAVES, 538 residues|
MARGSGAGGGGGGGGGGLELSV-GVGG---GGG--------A-RGG----GG-G-E-A--AAAV----ETA---AP-I--S-LGR---L-I-L--SGM--V-A
>UniRef50_A0A444XBP0/16-506 (16-506) Uncharacterized protein n=2 Tax=Arachis hypogaea TaxID=3818 RepID=A0A444XBP0_ARAHY  Expect=3.8e-184  Expect=3.8e-184
---------------------------------------------------N-S-L-Q--LESG----SRE---QP-S--P-IWK---L-V-A--VAS--I-A
>UniRef50_A0A251VK29/1-393 (1-393) Putative sucrose/H+ symporter, plant, Major facilitator superfamily domain protein n=1 Tax=Helia  Expect=1e-129  Expect=1e-129
--------------------------------------------------------------------------------------------M--VSA--I-A
//

The last separator in the input field is optional. If only one MSA in aligned FASTA format is provided, the user can activate a checkbox to instruct the server to treat each sequence in the MSA as an individual query (gaps are ignored). In that case, the server will process multiple query sequences, corresponding to the number of sequences in the MSA, instead of one MSA.

For example, the following input

>OsSUT1, RICE LEAVES, 538 residues|
MARGSGAGGGGGGGGGGLELSV-GVGG---GGG--------A-RGG----GG-G-E-A--AAAV----ETA---AP-I--S-LGR---L-I-L--SGM--V-A
>UniRef50_A0A444XBP0/16-506 (16-506)
---------------------------------------------------N-S-L-Q--LESG----SRE---QP-S--P-IWK---L-V-A--VAS--I-A
>UniRef50_A0A251VK29/1-393 (1-393)
--------------------------------------------------------------------------------------------M--VSA--I-A

can be configured to represent three different queries (sequences in the MSA).

Profile construction

The COMER2 profile is constructed from an MSA. A profile and an MSA can be provided directly (see Input). Otherwise, or optionally in the case of MSAs, the query is searched in sequence database(s) to build an MSA. Sequence search can be configured to be conducted using either HHblits [3] or HMMER [4] or both. The latter means that HHblits and HMMER are used independently, and their resulting pairwise sequence alignments are combined into a final MSA. Note that if an MSA provided as a query is to be searched for, it first undergoes (HHblits and/or HMMER) profile construction followed by a profile-to-sequence search.

Profile databases

The following profile databases are available for COMER2 search:

Multiple databases can be selected (Ctrl key + Mouse's left-click) before submitting a job.

Results and supported services

The format of COMER2 results is as follows. First, the list of links to the results for each query is displayed. By clicking on the links, the detailed results for individual queries are shown (Fig. 2).

Results show the list of pairwise profile-profile (COMER) alignments.

Query results
Figure 2. COMER results

A profile-profile alignment between a query and an identified protein is shown along with predicted secondary structures, which help visually assess how well structural features align.

Alignment statistics accompany each alignment:

  • The length and the effective number of observations (ENO) per position of the query and the identified protein represent the profile length and informativeness, respectively. ENO rarely exceeds 14, and a profile is considered informative if it is greater than 4.
  • E-value and P-value = 1 - exp(-(E-value)) are the statistical significance estimates of the produced alignment.
  • Score is the profile-profile alignment score upon which the statistical significance estimates depend.
  • Bit score given in parentheses is a database size-independent measure (as opposed to E-value and P-value) and can be used to compare alignment significance across different database searches.
  • Identities, Positives, and Gaps show the number of pairs of identical amino acids, positive profile-profile substitution scores, and gaps in the alignment, respectively. Their percentages sum to 100%.
  • Note that an aligned pair of identical amino acids may not imply a positive substitution score if the amino acid is not conserved and the amino acid distributions differ at the corresponding profile positions.
  • The table of the statistical parameters Lambda and K follows the profile-profile alignment. K and Lambda estimated for gapped alignments are used to calculate E-value and P-value, and their values depend on the values computed from the profile-profile score matrix (Computed ungapped). The computed value of Lambda is an indicator of compositional similarity between the profiles (sequence families) being compared. Compositional similarity increases with decreasing values of Lambda. The measure of compositional similarity is used when calculating E-value and P-value.

The original input, final MSA, profile, log files, and the result file in JSON format are all available for download for each query.

The COMER web server offers services at the sequence, structure, and function levels. At the sequence level, alignments produced by COMER2 can be selected individually or as a group for building an MSA based on accurate profile-profile alignments. The latter option is particularly useful when the number of produced alignments is large, and only their subset with an E-value within a given interval is of interest. Visualization of a resulting MSA allows for interactive analysis of aligned sequences.

Structural analysis is possible through 3D structural model generation with Modeller [6] using the structures of identified proteins as templates and produced alignments as restraints. A multiple selection option allows the user to generate one model using multiple templates (up to 7) or multiple models (currently up to 15), one for each selected alignment, with one click of a button. Interactive visual analysis of generated models is maintained.

The functional inference is based on the results of searching annotated databases. For example, a statistically significant alignment between a query and a protein from the deeply annotated UniProtKB/Swiss-Prot database supplemented with SS predictions may provide strong evidence for a functional relationship. In general, simultaneous searching across all available profile databases enables the analysis of the proteins of interest at the sequence, structure, and function levels at the same time.

The services are activated by submitting a job. Once the results are ready, a visualization layer is added to depict results graphically.

Homology examples

PF11821

The most significant hit identified by COMER2 with E-value=1.6e-7 for Pfam DUF family PF11821 when searching the UniProtKB/Swiss-Prot90 (2021_03) database was P23461 (6% sequence identity), Rhodobacter capsulatus protein PucD. The search took 37 sec to run on the COMER web server.

In Pfam 35.0, PF11821 was classified as the ActD subunit (PDB ID: 6btm_D) of the Alternative complex III (ACIII). The function of the ActD subunit is unknown, but some evidence suggests that it may play a role in stabilizing the ACIII complex, a key component of bacteria respiratory and/or photosynthetic electron transport chains.

PucD is a subunit of the LHII light-harvesting complex, involved in the electron transport chain. Although the function of PucD is unknown, experiments have shown PucD to be involved in stabilizing the LHII complex.

The AlphaFold2 model of PucD shows the same fold as ActD, where ActD has an insertion of two transmembrane helices (see the Figure). The structural and molecular context similarity suggests that the relationship between ActD (PF11821) and PucD (P23461) identified by COMER2 is homologous.

PF11821
Figure 3. Structural similarity between ActD (PF11821; left) and the AlphaFold2 model of PucD (P23461; center and right). The segments aligned by COMER2 are colored blue to red from N- to C-terminus (left and center). The confidence of the AlphaFold2 model is shown on the right

PF09196

In Pfam 35.0, the PF09196 family consists of a single sequence, a Sulfolobus tokodaii maltooligosyl trehalose synthase domain (PDB ID: 3hje:642--704), with an unknown function (see the Figure below). No significant hits to the UniProtKB/Swiss-Prot90 (2021_03) database have been produced.

A more informative MSA (29 sequences) for PF09196 was obtained by conducting 3 HMMER iterations against the MGnify sequence database. The whole process, including COMER2 profile construction and searching the Swiss-Prot90 database, took 36 min to run on the COMER web server.

This time, the most significant was the alignment (E-value=0.0011; 24% sequence identity) with an annotated entry P9WQ20, Mycobacterium tuberculosis putative maltooligosyl trehalose synthase. COMER2 aligned the corresponding domain of P9WQ20 that shares the same fold with PF09196 (see the Figure). The high overall structural similarity of 3hje and P9WQ20 (TM-score=0.947) suggests that the two share a common functional mechanism.

PF09196
Figure 4. Structural similarity between PF09196 (colored; top) and the corresponding domain of the AlphaFold2 model of P9WQ20 (colored; bottom). The segments aligned by COMER2 are colored blue to red

Execution times

The two tables below show the execution times for various settings when conducting COMER2 searches against the UniProtKB/Swiss-Prot90 and PDB70 profile databases, respectively, with Swiss-Prot90 sequences. These times were obtained using the server's API. Another dozen seconds would be required to render results in the graphical user interface.

The sequences were searched against the HHsuite databases UniRef30 [14] and BFD [15] using HHblits and against the sequence databases UniRef50 [16] and MGnify [17] using HMMER. MSAs obtained from the HHblits search against UniRef30 (MSA query) and COMER2 profiles constructed from these MSAs (Profile query) were used separately to query the server instructed to skip sequence searches. All other settings were set to default values.

UniProtKB/Swiss-Prot90 searches
Seq. ID Length Execution time (sec)
Profile queryMSA queryHHblits+UniRef30HHblits+BFDHMMER+UniRef30HMMER+MGnify
Q27YE21011719731212171351
Q052391011820741252171344
P290691011719691152171356
A4PBQ02061620821382331356
P197432061620801342161358
Q86W672061620801362171346
O9457750222301201893331600
Q6GZV850224301181962281713
P5263850323291001982281338
P505351039405227243912212844
Q0WVX5104045216*920161214753967
P42835104140522263726232168
Q6UDF2203375993995867553865
Q54GV020367397525864555012240
Q9257620397821979776630136190
A2AAE1500516344712851456513310240
Q9SRU2509819853416171617537313979
Q8SX835560231291108617913210639097
Q9N4M485453941555*564452522118518931
W6RTA489223521182*484734072872717752
Q8I3Z1#9999345683245249566133558382

*MSA reduced to a maximum size of 50MB

#truncated sequence

PDB70 searches
Seq. ID Length Execution time (sec)
Profile queryMSA queryHHblits+UniRef30HHblits+BFDHMMER+UniRef30HMMER+MGnify
Q27YE2101912661132091274
Q05239101912661182101268
P29069101911611082101280
A4PBQ0206913751302251276
P19743206813731262081279
Q86W67206812721282081266
O945775028141041743181513
Q6GZV85028141021812131626
P52638503815861832121250
P505351039102224240911892746
Q0WVX5104011178*882157814403870
P42835104110221963425922070
Q6UDF2203314383385256943773
Q54GV020361438466805549312149
Q9257620391515973770329486099
A2AAE150052632611641317499210103
Q9SRU250983338214651455520213806
Q8SX835560339388815823188138869
Q9N4M48545581206*529549142082818636
W6RTA4892251876*454131042841517445
Q8I3Z1#999949388215746566103158086

*MSA reduced to a maximum size of 50MB

#truncated sequence

Limits

  • Maximum query length = 9999.
  • Maximum number of queries in the input field for COMER searches = 100.
  • Maximum text size in the input field = 5MB.
  • Maximum file size of queries for upload = 50MB.
  • Maximum number of single-template 3D structural models for generating them simultaneously = 50.
  • Maximum number of structural templates for one 3D model = 7.

The maximum duration of a job is limited to 24 hours. A job will be canceled and the user will be notified when that time expires.

If some of a job's queries fail, the error log will contain a short description of a problem, which will be visible to the user. The results of the other successfully completed queries of the same job will be available for analysis as usual.

References

  1. Margelevičius M. Bayesian nonparametrics in protein remote homology search. Bioinformatics. 2016;32(18):2744-52.
  2. Margelevičius M. COMER2: GPU-accelerated sensitive and specific homology searches. Bioinformatics. 2020;36(11):3570-3572.
  3. Remmert M, Biegert A, Hauser A, Söding J. HHblits: lightning-fast iterative protein sequence searching by HMM-HMM alignment. Nat Methods. 2011;9(2):173-5.
  4. Eddy SR. Accelerated Profile HMM Searches. PLoS Comput Biol. 2011;7(10):e1002195.
  5. Jones D. Protein secondary structure prediction based on position-specific scoring matrices. J Mol Biol 1999;292(2):195–202.
  6. Webb B, Sali A. Protein Structure Modeling with MODELLER. Methods Mol Biol. 2017;1654:39-54.
  7. Burley S, Bhikadiya C, Bi C, Bittrich S, Chen L, Crichlow G, Christie C, Dalenberg K, Di Costanzo L, Duarte J, et al. RCSB Protein Data Bank: powerful new tools for exploring 3D structures of biological macromolecules for basic and applied research and education in fundamental biology, biomedicine, biotechnology, bioengineering and energy sciences. Nucleic Acids Res. 2021;49(D1):D437-D451.
  8. Chandonia J, Fox N, Brenner S. SCOPe: classification of large macromolecular structures in the structural classification of proteins-extended database. Nucleic Acids Res. 2019;47(D1):D475-D481.
  9. Mistry J, Chuguransky S, Williams L, Qureshi M, Salazar G, Sonnhammer E, Tosatto S, Paladin L, Raj S, Richardson L, et al. Pfam: The protein families database in 2021. Nucleic Acids Res. 2021;49(D1):D412-D419.
  10. UniProt Consortium. UniProt: the universal protein knowledgebase in 2021. Nucleic Acids Res. 2021;49(D1):D480-D489.
  11. Schaeffer R, Liao Y, Cheng H, Grishin N. ECOD: new developments in the evolutionary classification of domains. Nucleic Acids Res. 2017;45(D1):D296-D302.
  12. Galperin M, Wolf Y, Makarova K, Vera Alvarez R, Landsman D, Koonin E. COG database update: focus on microbial diversity, model organisms, and widespread pathogens. Nucleic Acids Res. 2021;49(D1):D274-D281.
  13. Lu S, Wang J, Chitsaz F, Derbyshire M, Geer R, Gonzales N, Gwadz M, Hurwitz D, Marchler G, Song J, et al. CDD/SPARCLE: the conserved domain database in 2020. Nucleic Acids Res. 2020;48(D1):D265-D268.
  14. Mirdita M, von den Driesch L, Galiez C, Martin M, Söding J, Steinegger M. Uniclust databases of clustered and deeply annotated protein sequences and alignments. Nucleic Acids Res. 2017;45(D1):D170-D176.
  15. Steinegger M, Mirdita M, Söding J. Protein-level assembly increases protein sequence recovery from metagenomic samples manyfold. Nat Methods. 2019;16(7):603-606.
  16. Suzek B, Wang Y, Huang H, McGarvey P, Wu C, and the UniProt Consortium. UniRef clusters: a comprehensive and scalable alternative for improving sequence similarity searches. Bioinformatics. 2015;31(6):926-32.
  17. Mitchell A, Almeida A, Beracochea M, Boland M, Burgin J, Cochrane G, Crusoe M, Kale V, Potter S, Richardson L, et al. MGnify: the microbiome analysis resource in 2020. Nucleic Acids Res. 2020;48(D1):D570-D578.