Prophane Documentation

Prophane Workflow Description

Prophane provides a tailored and fully automated workflow for metaproteomics analysis with special focus on metaprotein taxonomic and functional annotation. For the annotations you can choose between different databases and algorithm (see link to about - databases). In metaproteomics, protein identifications can be ambiguous when a identified peptide (or subset of identfied peptides) can be assigned to more than one protein. The most common approach is, to group these ambiguous identifications to so-called protein groups (synonyme: metaprotein, protein family). Prophane provides consistent protein annotation based on sequence homology and condensates metaprotein annotation based on a lowest common ancestor approach (LCA). Moreover, quantitative data is provided in a sample-specific manner based on normalized spectral abundance factors (NSAF). While the standard analysis needs minimal input, an expert mode is available allowing detailed customizations of each workflow step and the selection of various annotation databases and search algorithms.
Different input formats are supported and results are output in 2 different formats: a tab-seperated summary.txt file or in the standard mzTab format ("lca-summary.mztab" AND "protein_summary.mztab" and LCA_results per protein group). Additionally the summary results are visualized via Krona plots.
 Workflow

The minimum input for a prophane analysis consists of a proteomic search result file (the "Report File") and the matching FASTA file.

Report files

The report file is the protein identification file of the mass spectrometry analysis and must contain information about protein groups (also known as metaprotein or protein-families). Prophane supports various different input formats, as listed below. If you plan to use another input format on a regulary basis, please contact us (support@prophane.de).
So far, Prophane supports proteomic search result data provided in:

  • generic input format
  • mzTab format
  • mzIdent1.2 format
It also accommodates data produced by search software such as:

  • MetaProteome Analyzer in single or sample comparison format
  • Proteome Discoverer protein group output
  • Scaffold output

FASTA file

The FASTA file can be plain or gzipped and should contain sequences for all identified proteins. Ensure that the protein accessions or IDs in the FASTA file match those used in the protein identification report. You can use the same FASTA file you used for peptide-spectrum matching. However, if the FASTA file is excessively large, it can significantly increase Prophane's runtime. Ideally, the FASTA file should contain all sequences present in the identification file and no more. Avoid using sequence accession IDs containing pipes ('|') to ensure consistent data processing by Prophane. Sequence IDs with two or more pipes, such as Uniprot protein accessions, will be processed (e.g. from sp|O70558|SPR2G_MOUSE to 070558).

Report Files Formats

Generic Format

A generic table format (tab separated values) that is designed to be easy to recreate. The column "protein accessions" contains all protein accessions of one protein group and the quantification is based on spectral counts per protein group. Example table:

sample categorysample nameprotein accessionsspectrum count
sample Areplicate 1WP_003131952.1, NP_268346.1, KZK33282.11
sample Areplicate 1WP_000665196.12

mzTab 1.0.0 exchange format

Data provided in the mzTab exchange format can be processed by prophane. Protein group information must be provided in the ambiguity_members column. PSM section must be present. To distinguish between different samples, the entries in the column spectra_ref in the PSM section must be in the following format: ms_run[1-n]:"SPECTRA_REFERENCE". This format must also be kept if only one ms run is analyzed. Quantification is performed by counting the spectra/PSMs assigned to a protein group.

mzIdentML1.2.0 exchange format

Data provided in the mzIdentML 1.2.0 exchange format containing ProteinAmbiguityGroup can be processed by Prophane. In the mzIdent 1.2.0 version there is the possibility of summarizing similar proteins in 'ProteinAmbiguityGroup' sections. These groups are essential for performing metaproteome analysis with Prophane. Since mzIdent version 1.1.0 doesn't support protein groups, these files cannot be processed by Prophane. Quantification is performed by counting the spectra/PSMs assigned to a protein group.

MPA (single experiment)

Data provided by Metaproteome Analyzer (MPA) can be used by exporting metaprotein report (CSV).

MPA (multiple experiment)

Data provided by Metaproteome Analyzer (MPA) can be used by exporting metaprotein multisample report (CSV).

Protein Discoverer protein group

Proteome Discoverer protein-group data export: 1. check for filters set under "Display Folters -> Proteins", and remove the filter for master proteins 2. go to Protein Groups table and request "Show associated tables -> Proteins" 3. export this record as follows: File -> Export -> to Microsoft Excel Abundance of the master protein is used for the protein group abundances/ as quantification values. If multiple abundances are available, the order of selected abundances is as follows: 'Abundance: sample' (=raw abundance), 'Abundances (Scaled): sample', 'Abundances (Normalized): sample', If the master_protein row of a protein group contains no values for all samples in the abundance columns, these protein groups are ignored. If you use scaled or normalized abundance values we strongly recommend selecting "Raw value (no normalization)" in step "Quantification method".

Scaffold

Data provided by Scaffold can be used by exporting Scaffold's protein report (TSV/XLS).


Usage of sample groups

The definition of sample groups are relevant for getting mean quantification values about different samples. This allows a comparison of different sample groups, e.g untreated samples vs. treated samples. Sample group mean quantification values are reported in the "summary.txt" or "lca_summary.mztab" file and displayed in the Krona-Plots. The quantification values for each sample are reported additionally. If no groups are specified, quantification is carried out for each sample separately.

Sample name extraction of different report files by Prophane

The following table describes how Prophane extract sample names from different protein reports. Use the same sample names within sample group definition.

input filesample columns or elementscommentexampleextracted sample name
GENERIC tablesample category, sample namesample name is generated by fusing the sample category and sample name column with :: as separatorsample1 | replicateA | acc1 | 2 sample1::replicateA
mzTabMTD section of mzTab. All ms_run[number] elements are extracted.Samples are identified in the MTD section if they match the following pattern: ms_run[number]."MTD ms_run[1]-location null" ms_run[1]
mzIdent1.2.0SpectrumIdentificationResult Element, spectraData_ref attributesample names are extracted from the XML Element >SpectrumIdentificationResult> ,attribute spectraData_ref>SpectrumIdentificationResult spectraData_ref="qExactive2.mgf" spectrumID="index=5708" id="SIR_1769">qExactive2.mgf
MPA's single sample Metaprotein Reportcolumn 'sample' or Noneif column 'sample' is present in given report, content of this sample will be used. Else generic sample name `sample1` is used.no sample information provided sample1
MPA's Multiprotein Reportcolumn headers starting with EXP:sample_namesample names are extracted from column headers that start with "EXP:", with the string after the colon (':') being the sample name ... | Protein Accessions | EXP:S13.mgf | EXP:S20.mgf | EXP:S23.mgf | ... S13.mgf, S20.mgf, S23.mgf
Proteome Discoverer's protein group reporte.g. Abundances (Scaled): F1: SampleSample names are extracted from the abundance columns in protein group headers that follow the schema "Abundance (x): Fnumber" ... | Abundances (Scaled): F1: Sample | Abundances (Scaled): F2: Sample | ... F1, F2
Scaffold's protein reportboth columns: Biological sample category AND Biological sample name are combined.values of both columns are fused using :: as separator ... | sample A | replicate 1 | ... sample1::replicateA

Normalized Spectral Abundance Factor (NSAF) is a label-free quantification method that normalizes the number of identified spectra to the metaprotein size (to the largest, smallest protein of the protein group or to the mean of all protein group members).

quant methoddescription
NSAF (normalized to longest metaprotein sequence)NSAF are calculated. Normalization is based on the largest sequence length of the respective protein group.
NSAF (normalized to shortest metaprotein sequence)NSAF are calculated. Normalization is based on the smallest sequence length of the respective protein group.
NSAF (normalized to mean metaprotein sequence)NSAF are calculated. Normalization is based on the averaged sequence length of the respective protein group.
Raw value (no normalization)Unprocessed quantification values are shown. No NSAF are calculated.

Taxonomic annotation searches the selected reference database for similar proteins, retrieves the associated taxon ID and specifies the entire taxonomic lineage of this protein. The search algorithm DIAMOND BLASTP searches for sequence homologies in the specified protein database.
Taxonomic Annotation With DIAMOND Blast p

Prophane supports following protein sequence databases:
  • NCBI protein nr
  • UniprotKB (Swiss-Prot & TrEMBL)
  • Swiss-Prot
  • TrEMBL
Prophane reports always the taxonomic lineage of the protein with the highest scoring match. The evalue defines how similar the reported proteins have to be to be reported. The smaller the evalue, the more similar the protein match. For information regarding additional parameters, please check the DIAMOND documentation: https://github.com/bbuchfink/diamond/raw/master/diamond_manual.pdf

Prophane supports following protein functional databases:
  • EggNog
  • PFAMs
  • TIGRFAMs
  • FOAM
  • CAzY/dbCAN
  • ResFAMs (full)
  • ResFAMs (core)

Functional Annotation with Profile-HMM Databases

Prophane utilizes the HMMER3 algorithm to search functional profile-HMM databases, which include PFAM, TIGRFAM, Resfams, CAzY/dbCAN, and FOAM. You have the option to choose between two search modes:

hmmscan:
This mode searches protein sequences against a profile HMM database.
hmmsearch:
In contrast, hmmsearch matches multiple sequence alignments or searches profile HMMs against a sequence database.
The functional annotation results include a hierarchically ordered function description, the protein family name, and the match ID.

For additional details about the parameters used by hmmscan and hmmsearch, please refer to the HMMER user guide: https://eddylab.org/software/hmmer/Userguide.pdf
Functional Annotation via HMMER3

Functional Annotation with EggNOG Database and emapper

In Prophane, when utilizing the EggNOG database and emapper for functional annotation, the approach is based on precomputed orthology assignments rather than homology. Orthologs, also known as orthologous genes, are genes found in different species that share a common origin from a single gene in the last common ancestor. This method has been demonstrated to provide more accurate predictions compared to homology-based approaches.
The EggNOG database is built upon a carefully curated selection of representative species, encompassing a diverse range of organisms across the tree of life. The orthologs identified from this database are associated with common annotations, including GO terms, protein descriptions, KEGG annotations, PFAMs, and COG categories.
The evalue defines how similar the reported proteins have to be to be reported. The smaller the evalue, the more similar the protein match. For additional information about the parameters used for emapper and the EggNOG database, please refer to the EggNog wiki: https://github.com/eggnogdb/eggnog-mapper/wiki

Functional Annotation via EMAPPER

Prophane allows users to upload custom maps for taxonomic or functional annotations. These user-specified custom maps enable the mapping of identified protein accessions to specific taxonomic or functional information. The "acc2annot_mapper" within Prophane matches annotations based on protein accessions.
One common use case for custom maps is assigning taxa to protein groups in experiments with a known sample composition. Custom maps containing protein accessions and species lineages allow for tailored annotation, even excluding species with similar sequences.
Annotations are extracted from a user-provided tab-separated table (TSV) file. The names of the annotation levels are derived from the column names in the user-provided table, which serves as the basis for annotation. It's important to avoid using spaces, commas, tabulators, and semicolons in the column names.

Example Table taxonomic annotation:
accsuperkingdomphylumgenus
KZK33282.1BacteriaFirmicutesLactococcus
PKC80086.1BacteriaActinobacteriaBifidobacterium
Example Table functional annotation:
acclevel1level2level3
1000565.METUNv1_03812information storage and processingTranslation, ribosomal structure and biogenesisATPase
362663.ECP_0061information storage and processingReplication, recombination and repairDNA polymerase

We are very grateful to all people putting a lot of effort in the development of innovative search algorithms and doing a great job in the curation and maintenance of protein annotation databases. Prophane provides the following annotation combinations:

Annotation Databases

DatabaseScopeSearch AlgorithmDescriptionurl
NCBI nrtaxonomydiamond blastpfull NCBI protein databasehttps://www.ncbi.nlm.nih.gov/protein
https://ftp.ncbi.nlm.nih.gov/pub/taxonomy/taxdump.tar.gz
UniprotKBtaxonomydiamond blastpfull UniprotKB database (TrEMBL & Swiss-Prot)https://www.uniprot.org/
Swiss-Prottaxonomydiamond blastpsmall, expertly curated protein databasehttps://www.expasy.org/resources/uniprotkb-swiss-prot
ftp://ftp.expasy.org/databases/uniprot/current_release/knowledgebase/complete/uniprot_sprot.fasta.gz
TrEMBLtaxonomydiamond blastpautomatic annotation of Uniprot cds sequenceshttps://www.uniprot.org/
ftp://ftp.expasy.org/databases/uniprot/current_release/knowledgebase/complete/uniprot_trembl.fasta.gz
eggNOGfunctioneggnog-mapperprotein funtion database of 12535 organism in 17M ortologous groupshttp://eggnogdb.embl.de/
http://eggnog6.embl.de/download/emapperdb-5.0.2/eggnog.db.gz
PFAMsfunctionhmmer3large protein families database with 20795 entries in 659 clanshttps://pfam.xfam.org/
ftp://ftp.ebi.ac.uk/pub/databases/Pfam/releases/Pfam36.0/Pfam-A.hmm.gz
TIGRFAMsfunctionhmmer3bacterial protein funtion database with 4488 HMMshttp://tigrfams.jcvi.org/cgi-bin/index.cgi
https://ftp.ncbi.nlm.nih.gov/hmm/TIGRFAMs/release_15.0/TIGRFAMs_15.0_HMM.LIB.gz
FOAMfunctionhmmer3protein funtion database "Functional Assignments for Metagenomes"https://osf.io/5ba2v/
https://osf.io/download/bdpv5/
https://osf.io/download/muan4/
Resfamsfunctionhmmer3curated database of protein families confirmed for antibiotic resistance functionhttp://www.dantaslab.org/resfams
http://dantaslab.wustl.edu/resfams/Resfams-full.hmm.gz
http://dantaslab.wustl.edu/resfams/Resfams.hmm.gz
CAzY/dbCANfunctionhmmer3curated database of Carbohydrate-active enzymeshttp://bcb.unl.edu/dbCAN2/
http://bcb.unl.edu/dbCAN2/download/Databases/dbCAN-HMMdb-V8.txt
http://bcb.unl.edu/dbCAN2/download/Databases/CAZyDB.07302020.fam-activities.txt"

Annotation Algorithms Parameters

Database(s)Search Algorithmmandatory parameter [default]optional parameters [default]
UniprotKB (TrEMBL & Swiss-Prot)
NCBI protein nr
diamond blastpevalue [0.001] algo [0], band [], block-size [2.0], comp-based-stats [1], dbsize [40000000], frameshift [0], freq-sd[],gapextend [0], gapopen [0], gapped-xdrop [ ], hit-band: [ ], hit-score: [ ], id [0.0], id2 [ ], index-mode [0], masking [1], max-hsps [1], rank-ratio [ ], rank-ratio2 [ ], sensitive: [ ], shape-mask [ ], matrix: ['BLOSUM62'], strand: [both], unapped-score [ ], window [ ], shapes [0], xdrop [ ], query-cover [0], more-sensitive [ ], min-score [20]
PFAMs
TIGRFAMs
FOAM
DBcan
ResFam
hmmscan
OR
hmmsearch
evalue [0.001] E [10], T [0.0], domE [10], domT [0.0], incE [10], incT [0.0], incdomE [10], incomT [0.0], cut_ga [ ], cut_nc [ ], cut_tc [ ], max [ ], F1 [0.02], F2 [0.001], F3 [0.00001], nobias [ ], nonull2 [ ], domZ [int], seed [42], Z [int]
EggNOGemapperm:
diamond
OR
hmmer
gapextend [0], gapopen [0], go_evidence ['experimental'], gguessdb [131567], hmm_maxhits [1], hmm_maxseqlen [5000], hmm_qcov [0], hmm_score [20.0], query-cover [0], seed_ortholog_score [60.0], subject-cover [0], target_orthologs [one2one], tax_scope [131567], Z [40000000]

Use following resources for more information about the usage of different parameters:

The LCA approach searches hierarchical data to find the lowest common node shared by all members of a group. For every functional or taxonomical annotation at each annotation level, Prophane determines an LCA to represent all protein group members. Often, proteins within a group or metaproteins have different annotations. To obtain a unified representative for all members, Prophane determines the lowest common ancestor among them. If this is not possible the assigned LCA-value is referred to as "various".
Two different methods und two advanced options are available for LCA determination:

LCA method: LCA per group

The "LCA per group" method determines the LCA based on annotations of proteins within a protein group. It allows users to set a threshold value for each group, ranging from 0 to 1.
If multiple annotations passing the threshold, the annotation with highest support is chosen. If no annotation passing this threshold, prophane returns the value "various".
A threshold value of 1 returns an LCA only if all annotations of all protein group members share the same annotation; otherwise the LCA is reported as "various". With a threshold value of 0.51, Prophane returns an LCA if more than half of the annotations are the same.

group LCA method

LCA method: democratic LCA

The "democratic LCA" method selects the annotation that occurs most frequently across all protein groups from each protein group.
The democratic LCA method chooses of all annotations always the one, which is the most often identified about all protein groups. So with this method you get the smallest variance in the results. Depending on the chosen LCA-method, you get different results. Compare Picture 1 und 2.

democratic LCA method

Advanced Option: ignore_unclassified

This advanced option considers only proteins for LCA determination if an annotation was found (excluding "unclassified" annotations).

parameter ignore_unclassified

Advanced Option: minimum_number_of_annotations

The "minimum_number_of_annotations" advanced option in Prophane empowers users to specify a minimum count of annotations within the annotation lineage that will be taken into account by the LCA methods. This feature proves especially useful for effectively managing annotations that might be considered extraneous or less informative.
For instance, when dealing with annotations like vector annotation that typically consist of lineage with only one-level annotations, this option allows users to filter out these less informative annotations, ensuring a more meaningful and accurate LCA determination process.

parameter min_nb_of-annotations


Once an analysis is completed you can download all results as a single archive in the Job Control section or via link for unregistered users.

Structure of the Result Folder

LocationContentFile Type
summary.txtMain result file: summary table of all analyses TSV
lca_summary.mztabMain result file: summary table containing protein group information (LCA and Quantification)mztab
protein_summary.mztabMain result file: annotations on protein level mztab
plots/plot_of_{annot_type}_annot_by_{tool}_on_{db_type}.v{db_version}.task{taskid}.htmlinteractive Krona plots HTML
pgs/protein_groups.sqlSQL database of protein groups TSV
seqs/all.faaall accessions from the proteomic search result and their respective protein sequences FASTA
seqs/missing_taxa.faa FASTA file containing sequences of taxonomically undefined proteins (those not covered by the optionally provided taxmap input file. In most cases, this file is the same as seqs/all.faa) FASTA
| seqs/ambiguous_sequences.txt Usually not created; accessions, which could not be found in provided input FASTA.TXT
seqs/missing_sequences.txt Usually not created; accessions, which could not be found in provided input FASTA.TXT
segs/pgs/pg.{n}.faa sequence information of protein group n FASTA
tasks/quant.tsvquantification information for each taxonomic or functional lowest common ancestor (LCA) TSV
tasks/{annot_type}_annot_by_{tool}_on_{db_type}.v{db_version}.task{taskid}.best_hits best hits extracted from raw annotation result TXT
tasks/{annot_type}_annot_by_{tool}_on_{db_type}.v{db_version}.task{taskid}.lca LCA of each protein groupTSV
tasks/{annot_type}_annot_by_{tool}_on_{db_type}.v{db_version}.task{taskid}.log log of annotation command TXT
tasks/{annot_type}_annot_by_{tool}_on_{db_type}.v{db_version}.task{taskid}.map respective accession-annotation linking TSV
tasks/{annot_type}_annot_by_{tool}_on_{db_type}.v{db_version}.task{taskid}.quant LCA of each sample and replicate and their calculated quantificationTSV
tasks/{annot_type}_annot_by_{tool}_on_{db_type}.v{db_version}.task{taskid}.result raw annotation result TXT/TSV
tasks/{annot_type}_annot_by_{tool}_on_{db_type}.v{db_version}.task{taskid}.result-cmd.txt annotation command, as executed in the shell TXT
tasks/{annot_type}_annot_by_{tool}_on_{db_type}.v{db_version}.task{taskid}.xmlinput for creation of krona plot XML
tasks/{annot_type}_annot_by_{tool}_on_ {db_type}.v{db_version}.task{taskid}.yaml annotation task parameters (full path to db, annotation algorithm, command line parameters, shortname of task, annotation type)YAML
tax/taxmap.txt taxonomic information extracted from user-defined taxmapsTXT
tax/missing_taxa_map.txt accessions without any taxonomic data/prediction TXT
tax/ambiguous_taxa.txtUsually not created; accessions with ambiguous taxonomies extracted from user-defined taxmapsTXT

The summary.txt File


The summary.txt file is a tab-separated file that can be easily imported into Excel via the data panel for further analysis. The file begins with a header row, followed by information about protein groups. Protein groups consist of two types of rows: "group" and "member" rows and have a unique protein group number (column "#pg").

Group Rows (column "level" = group):
These rows summarize information about the protein group:

  • "members_count": The numbers of proteins in this metaprotein group.
  • "members_identifier": Accesions of all protein group members
  • Task Results: Information about the lowest common ancestor (LCA) assigned to a group.
  • Quantification Results: Quantification values of samples and sample groups
  • Lca Support: Lca support values

Member Rows (column "level" = member):
These rows contain results of the analyses per protein. Multiple rows are possible for a protein if the task yields different results. If no results are obtained for a protein, it is marked as 'unclassified.'
  • "members_identifier": Accession of protein
  • Task Results: Funtional or taxonomical annotation of this protein.

Task Results:
Columns display results of taxonomic and functional annotations using the naming schema:
"task_{task_number}::{task_name}::{task_level}"
Group rows contain the weighted LCA for each task, while member rows show potentially different annotation results.

Quantification Results:
Columns providing raw quantification values of samples and samples groups use following naming schema for the given raw quant values (e.g. spectra count) and standard deviation (sd):
  • raw_quant::group_name (mean)
  • raw_quant_sd::group_name (mean)
  • raw_quant::sample_name
  • raw_quant_sd::sample_name

Columns providing normalized quantification values and standard deviation (sd) of samples and samples groups use following naming schema:
  • quant::group_name(mean)
  • quant_sd::group_name(mean)
  • quant::sample_name
  • quant_sd::sample_name

Spectra columns:
If the spectra information are available (possible for mzTab and mzIdent1.2), they are shown in the columns "{sample_name}:: spectra_IDs"

LCA Support Results:
Columns indicating LCA support use following naming schema:
task_{task_nb}::{task}::{level}::lca-support
and following value format:
number_supporting_proteins/number_all_protein_group_proteins (e.g. 150/150)
OR if threshold is not passed
(highest_possible_number_supporting_proteins)/number_all_protein_group_proteins (e.g. (60)/150)

The first number indicates the number of proteins (or spectra if available) supporting the chosen LCA.
The second number represent the total count of identified proteins (or spectra if available) for the entire group.
Values in brackets, e.g., "(60)/150," indicate that no LCA with support above the given threshold could be identified.
A value of "0/0" is used when no lineage information is available, corresponding to 'unclassified' entries.

Krona Plots

Krona plots are a powerful visualization tool in Prophane, allowing users to explore and interpret the taxonomic or functional abundances of metagenomic data with ease. Krona plots are automatically generated for each task in Prophane. The size of each field in the Krona plot corresponds to the normalized quantification values, providing a visual representation of the relative abundance. Krona plots support the viewing of results for different sample groups and different samples and provide an interactive zoom function.

Krona Plot

This project is funded by:
  • Deutsche Forschungsgesellschaft (DFG), "Research Software Sustainability"
  • de.NBI, computational ressources and support

In case of questions or problems, do not hesitate to contact us:

support@prophane.de