Prophane Documentation

Prophane Workflow Description

Prophane provides a tailored and fully automated workflow for metaproteomics analysis with special focus on metaprotein taxonomic and functional annotation. For the annotations you can choose between different databases and algorithm (see link to about - databases). In metaproteomics, protein identifications can be ambiguous when a identified peptide (or subset of identfied peptides) can be assigned to more than one protein. The most common approach is, to group these ambiguous identifications to so-called protein groups (synonyme: metaprotein, protein family). Prophane provides consistent protein annotation based on sequence homology and condensates metaprotein annotation based on a lowest common ancestor approach (LCA). Moreover, quantitative data is provided in a sample-specific manner based on normalized spectral abundance factors (NSAF). While the standard analysis needs minimal input, an expert mode is available allowing detailed customizations of each workflow step and the selection of various annotation databases and search algorithms.
Different input formats are supported and results are output in 2 different formats: a tab-seperated summary.txt file or in the standard mzTab format ("lca-summary.mztab" AND "protein_summary.mztab" and LCA_results per protein group). Additionally the summary results are visualized via Krona plots.

Input Files

The minimum input for a prophane analysis consists of a proteomic search result file (the "Report File") and the matching FASTA file.

Report files

The report file is the protein identification file of the mass spectrometry analysis and must contain information about protein groups (also known as metaprotein or protein-families). Prophane supports various different input formats, as listed below. If you plan to use another input format on a regulary basis, please contact us (support@prophane.de).
So far, Prophane supports proteomic search result data provided in:

generic input format
mzTab format
mzIdent1.2 format

It also accommodates data produced by search software such as:

MetaProteome Analyzer in single or sample comparison format
Proteome Discoverer protein group output
Scaffold output

FASTA file

The FASTA file can be plain or gzipped and should contain sequences for all identified proteins. Ensure that the protein accessions or IDs in the FASTA file match those used in the protein identification report. You can use the same FASTA file you used for peptide-spectrum matching. However, if the FASTA file is excessively large, it can significantly increase Prophane's runtime. Ideally, the FASTA file should contain all sequences present in the identification file and no more. Avoid using sequence accession IDs containing pipes ('|') to ensure consistent data processing by Prophane. Sequence IDs with two or more pipes, such as Uniprot protein accessions, will be processed (e.g. from sp|O70558|SPR2G_MOUSE to 070558).

Report Files Formats

Generic Format

A generic table format (tab separated values) that is designed to be easy to recreate. The column "protein accessions" contains all protein accessions of one protein group and the quantification is based on spectral counts per protein group. Example table:

sample category	sample name	protein accessions	spectrum count
sample A	replicate 1	WP_003131952.1, NP_268346.1, KZK33282.1	1
sample A	replicate 1	WP_000665196.1	2

mzTab 1.0.0 exchange format

Data provided in the mzTab exchange format can be processed by prophane. Protein group information must be provided in the ambiguity_members column. PSM section must be present. To distinguish between different samples, the entries in the column spectra_ref in the PSM section must be in the following format: ms_run[1-n]:"SPECTRA_REFERENCE". This format must also be kept if only one ms run is analyzed. Quantification is performed by counting the spectra/PSMs assigned to a protein group.

mzIdentML1.2.0 exchange format

Data provided in the mzIdentML 1.2.0 exchange format containing ProteinAmbiguityGroup can be processed by Prophane. In the mzIdent 1.2.0 version there is the possibility of summarizing similar proteins in 'ProteinAmbiguityGroup' sections. These groups are essential for performing metaproteome analysis with Prophane. Since mzIdent version 1.1.0 doesn't support protein groups, these files cannot be processed by Prophane. Quantification is performed by counting the spectra/PSMs assigned to a protein group.

MPA (single experiment)

Data provided by Metaproteome Analyzer (MPA) can be used by exporting metaprotein report (CSV).

MPA (multiple experiment)

Data provided by Metaproteome Analyzer (MPA) can be used by exporting metaprotein multisample report (CSV).

Protein Discoverer protein group

Proteome Discoverer protein-group data export: 1. check for filters set under "Display Folters -> Proteins", and remove the filter for master proteins 2. go to Protein Groups table and request "Show associated tables -> Proteins" 3. export this record as follows: File -> Export -> to Microsoft Excel Abundance of the master protein is used for the protein group abundances/ as quantification values. If multiple abundances are available, the order of selected abundances is as follows: 'Abundance: sample' (=raw abundance), 'Abundances (Scaled): sample', 'Abundances (Normalized): sample', If the master_protein row of a protein group contains no values for all samples in the abundance columns, these protein groups are ignored. If you use scaled or normalized abundance values we strongly recommend selecting "Raw value (no normalization)" in step "Quantification method".

Scaffold

Data provided by Scaffold can be used by exporting Scaffold's protein report (TSV/XLS).

Sample Groups

Usage of sample groups

The definition of sample groups are relevant for getting mean quantification values about different samples. This allows a comparison of different sample groups, e.g untreated samples vs. treated samples. Sample group mean quantification values are reported in the "summary.txt" or "lca_summary.mztab" file and displayed in the Krona-Plots. The quantification values for each sample are reported additionally. If no groups are specified, quantification is carried out for each sample separately.

Sample name extraction of different report files by Prophane

The following table describes how Prophane extract sample names from different protein reports. Use the same sample names within sample group definition.

input file	sample columns or elements	comment	example	extracted sample name
GENERIC table	sample category, sample name	sample name is generated by fusing the sample category and sample name column with :: as separator	sample1 \| replicateA \| acc1 \| 2	sample1::replicateA
mzTab	MTD section of mzTab. All ms_run[number] elements are extracted.	Samples are identified in the MTD section if they match the following pattern: ms_run[number].	"MTD ms_run[1]-location null"	ms_run[1]
mzIdent1.2.0	SpectrumIdentificationResult Element, spectraData_ref attribute	sample names are extracted from the XML Element >SpectrumIdentificationResult> ,attribute spectraData_ref	>SpectrumIdentificationResult spectraData_ref="qExactive2.mgf" spectrumID="index=5708" id="SIR_1769">	qExactive2.mgf
MPA's single sample Metaprotein Report	column 'sample' or None	if column 'sample' is present in given report, content of this sample will be used. Else generic sample name `sample1` is used.	no sample information provided	sample1
MPA's Multiprotein Report	column headers starting with EXP:sample_name	sample names are extracted from column headers that start with "EXP:", with the string after the colon (':') being the sample name	... \| Protein Accessions \| EXP:S13.mgf \| EXP:S20.mgf \| EXP:S23.mgf \| ...	S13.mgf, S20.mgf, S23.mgf
Proteome Discoverer's protein group report	e.g. Abundances (Scaled): F1: Sample	Sample names are extracted from the abundance columns in protein group headers that follow the schema "Abundance (x): Fnumber"	... \| Abundances (Scaled): F1: Sample \| Abundances (Scaled): F2: Sample \| ...	F1, F2
Scaffold's protein report	both columns: Biological sample category AND Biological sample name are combined.	values of both columns are fused using :: as separator	... \| sample A \| replicate 1 \| ...	sample1::replicateA

Quantification

Normalized Spectral Abundance Factor (NSAF) is a label-free quantification method that normalizes the number of identified spectra to the metaprotein size (to the largest, smallest protein of the protein group or to the mean of all protein group members).

quant method	description
NSAF (normalized to longest metaprotein sequence)	NSAF are calculated. Normalization is based on the largest sequence length of the respective protein group.
NSAF (normalized to shortest metaprotein sequence)	NSAF are calculated. Normalization is based on the smallest sequence length of the respective protein group.
NSAF (normalized to mean metaprotein sequence)	NSAF are calculated. Normalization is based on the averaged sequence length of the respective protein group.
Raw value (no normalization)	Unprocessed quantification values are shown. No NSAF are calculated.

Taxonomic Annotation

Taxonomic annotation searches the selected reference database for similar proteins, retrieves the associated taxon ID and specifies the entire taxonomic lineage of this protein. The search algorithm DIAMOND BLASTP searches for sequence homologies in the specified protein database.

Prophane supports following protein sequence databases:

NCBI protein nr

UniprotKB (Swiss-Prot & TrEMBL)

Swiss-Prot

TrEMBL

Prophane reports always the taxonomic lineage of the protein with the highest scoring match. The evalue defines how similar the reported proteins have to be to be reported. The smaller the evalue, the more similar the protein match. For information regarding additional parameters, please check the DIAMOND documentation: https://github.com/bbuchfink/diamond/raw/master/diamond_manual.pdf

Functional Annotation

Prophane supports following protein functional databases:

EggNog

PFAMs

TIGRFAMs

FOAM

CAzY/dbCAN

ResFAMs (full)

ResFAMs (core)

Functional Annotation with Profile-HMM Databases

Prophane utilizes the HMMER3 algorithm to search functional profile-HMM databases, which include PFAM, TIGRFAM, Resfams, CAzY/dbCAN, and FOAM. You have the option to choose between two search modes:

hmmscan:: This mode searches protein sequences against a profile HMM database.
hmmsearch:: In contrast, hmmsearch matches multiple sequence alignments or searches profile HMMs against a sequence database.

The functional annotation results include a hierarchically ordered function description, the protein family name, and the match ID.

For additional details about the parameters used by hmmscan and hmmsearch, please refer to the HMMER user guide: https://eddylab.org/software/hmmer/Userguide.pdf

Functional Annotation with EggNOG Database and emapper

In Prophane, when utilizing the EggNOG database and emapper for functional annotation, the approach is based on precomputed orthology assignments rather than homology. Orthologs, also known as orthologous genes, are genes found in different species that share a common origin from a single gene in the last common ancestor. This method has been demonstrated to provide more accurate predictions compared to homology-based approaches.
The EggNOG database is built upon a carefully curated selection of representative species, encompassing a diverse range of organisms across the tree of life. The orthologs identified from this database are associated with common annotations, including GO terms, protein descriptions, KEGG annotations, PFAMs, and COG categories.
The evalue defines how similar the reported proteins have to be to be reported. The smaller the evalue, the more similar the protein match. For additional information about the parameters used for emapper and the EggNOG database, please refer to the EggNog wiki: https://github.com/eggnogdb/eggnog-mapper/wiki

Custom Map

Prophane allows users to upload custom maps for taxonomic or functional annotations. These user-specified custom maps enable the mapping of identified protein accessions to specific taxonomic or functional information. The "acc2annot_mapper" within Prophane matches annotations based on protein accessions.
One common use case for custom maps is assigning taxa to protein groups in experiments with a known sample composition. Custom maps containing protein accessions and species lineages allow for tailored annotation, even excluding species with similar sequences.
Annotations are extracted from a user-provided tab-separated table (TSV) file. The names of the annotation levels are derived from the column names in the user-provided table, which serves as the basis for annotation. It's important to avoid using spaces, commas, tabulators, and semicolons in the column names.

Example Table taxonomic annotation:

acc	superkingdom	phylum	genus
KZK33282.1	Bacteria	Firmicutes	Lactococcus
PKC80086.1	Bacteria	Actinobacteria	Bifidobacterium

Example Table functional annotation:

acc	level1	level2	level3
1000565.METUNv1_03812	information storage and processing	Translation, ribosomal structure and biogenesis	ATPase
362663.ECP_0061	information storage and processing	Replication, recombination and repair	DNA polymerase

Annotation Databases & Algorithms & Parameters

We are very grateful to all people putting a lot of effort in the development of innovative search algorithms and doing a great job in the curation and maintenance of protein annotation databases. Prophane provides the following annotation combinations:

Annotation Databases

Database	Scope	Search Algorithm	Description	url
NCBI nr	taxonomy	diamond blastp	full NCBI protein database	https://www.ncbi.nlm.nih.gov/protein https://ftp.ncbi.nlm.nih.gov/pub/taxonomy/taxdump.tar.gz
UniprotKB	taxonomy	diamond blastp	full UniprotKB database (TrEMBL & Swiss-Prot)	https://www.uniprot.org/
Swiss-Prot	taxonomy	diamond blastp	small, expertly curated protein database	https://www.expasy.org/resources/uniprotkb-swiss-prot ftp://ftp.expasy.org/databases/uniprot/current_release/knowledgebase/complete/uniprot_sprot.fasta.gz
TrEMBL	taxonomy	diamond blastp	automatic annotation of Uniprot cds sequences	https://www.uniprot.org/ ftp://ftp.expasy.org/databases/uniprot/current_release/knowledgebase/complete/uniprot_trembl.fasta.gz
eggNOG	function	eggnog-mapper	protein funtion database of 12535 organism in 17M ortologous groups	http://eggnogdb.embl.de/ http://eggnog6.embl.de/download/emapperdb-5.0.2/eggnog.db.gz
PFAMs	function	hmmer3	large protein families database with 20795 entries in 659 clans	https://pfam.xfam.org/ ftp://ftp.ebi.ac.uk/pub/databases/Pfam/releases/Pfam36.0/Pfam-A.hmm.gz
TIGRFAMs	function	hmmer3	bacterial protein funtion database with 4488 HMMs	http://tigrfams.jcvi.org/cgi-bin/index.cgi https://ftp.ncbi.nlm.nih.gov/hmm/TIGRFAMs/release_15.0/TIGRFAMs_15.0_HMM.LIB.gz
FOAM	function	hmmer3	protein funtion database "Functional Assignments for Metagenomes"	https://osf.io/5ba2v/ https://osf.io/download/bdpv5/ https://osf.io/download/muan4/
Resfams	function	hmmer3	curated database of protein families confirmed for antibiotic resistance function	http://www.dantaslab.org/resfams http://dantaslab.wustl.edu/resfams/Resfams-full.hmm.gz http://dantaslab.wustl.edu/resfams/Resfams.hmm.gz
CAzY/dbCAN	function	hmmer3	curated database of Carbohydrate-active enzymes	http://bcb.unl.edu/dbCAN2/ http://bcb.unl.edu/dbCAN2/download/Databases/dbCAN-HMMdb-V8.txt http://bcb.unl.edu/dbCAN2/download/Databases/CAZyDB.07302020.fam-activities.txt"

Annotation Algorithms Parameters

Database(s)	Search Algorithm	mandatory parameter [default]	optional parameters [default]
UniprotKB (TrEMBL & Swiss-Prot) NCBI protein nr	diamond blastp	evalue [0.001]	algo [0], band [], block-size [2.0], comp-based-stats [1], dbsize [40000000], frameshift [0], freq-sd[],gapextend [0], gapopen [0], gapped-xdrop [ ], hit-band: [ ], hit-score: [ ], id [0.0], id2 [ ], index-mode [0], masking [1], max-hsps [1], rank-ratio [ ], rank-ratio2 [ ], sensitive: [ ], shape-mask [ ], matrix: ['BLOSUM62'], strand: [both], unapped-score [ ], window [ ], shapes [0], xdrop [ ], query-cover [0], more-sensitive [ ], min-score [20]
PFAMs TIGRFAMs FOAM DBcan ResFam	hmmscan OR hmmsearch	evalue [0.001]	E [10], T [0.0], domE [10], domT [0.0], incE [10], incT [0.0], incdomE [10], incomT [0.0], cut_ga [ ], cut_nc [ ], cut_tc [ ], max [ ], F1 [0.02], F2 [0.001], F3 [0.00001], nobias [ ], nonull2 [ ], domZ [int], seed [42], Z [int]
EggNOG	emapper	m: diamond OR hmmer	gapextend [0], gapopen [0], go_evidence ['experimental'], gguessdb [131567], hmm_maxhits [1], hmm_maxseqlen [5000], hmm_qcov [0], hmm_score [20.0], query-cover [0], seed_ortholog_score [60.0], subject-cover [0], target_orthologs [one2one], tax_scope [131567], Z [40000000]

Use following resources for more information about the usage of different parameters:

Lowest Common Ancestor (LCA)

The LCA approach searches hierarchical data to find the lowest common node shared by all members of a group. For every functional or taxonomical annotation at each annotation level, Prophane determines an LCA to represent all protein group members. Often, proteins within a group or metaproteins have different annotations. To obtain a unified representative for all members, Prophane determines the lowest common ancestor among them. If this is not possible the assigned LCA-value is referred to as "various".
Two different methods und two advanced options are available for LCA determination:

LCA method: LCA per group

The "LCA per group" method determines the LCA based on annotations of proteins within a protein group. It allows users to set a threshold value for each group, ranging from 0 to 1.
If multiple annotations passing the threshold, the annotation with highest support is chosen. If no annotation passing this threshold, prophane returns the value "various".
A threshold value of 1 returns an LCA only if all annotations of all protein group members share the same annotation; otherwise the LCA is reported as "various". With a threshold value of 0.51, Prophane returns an LCA if more than half of the annotations are the same.

LCA method: democratic LCA

The "democratic LCA" method selects the annotation that occurs most frequently across all protein groups from each protein group.
The democratic LCA method chooses of all annotations always the one, which is the most often identified about all protein groups. So with this method you get the smallest variance in the results. Depending on the chosen LCA-method, you get different results. Compare Picture 1 und 2.

Advanced Option: ignore_unclassified

This advanced option considers only proteins for LCA determination if an annotation was found (excluding "unclassified" annotations).

Advanced Option: minimum_number_of_annotations

The "minimum_number_of_annotations" advanced option in Prophane empowers users to specify a minimum count of annotations within the annotation lineage that will be taken into account by the LCA methods. This feature proves especially useful for effectively managing annotations that might be considered extraneous or less informative.
For instance, when dealing with annotations like vector annotation that typically consist of lineage with only one-level annotations, this option allows users to filter out these less informative annotations, ensuring a more meaningful and accurate LCA determination process.

Output

Once an analysis is completed you can download all results as a single archive in the Job Control section or via link for unregistered users.

Structure of the Result Folder

Location	Content	File Type
summary.txt	Main result file: summary table of all analyses	TSV
lca_summary.mztab	Main result file: summary table containing protein group information (LCA and Quantification)	mztab
protein_summary.mztab	Main result file: annotations on protein level	mztab
plots/plot_of_{annot_type}_annot_by_{tool}_on_{db_type}.v{db_version}.task{taskid}.html	interactive Krona plots	HTML
pgs/protein_groups.sql	SQL database of protein groups	TSV
seqs/all.faa	all accessions from the proteomic search result and their respective protein sequences	FASTA
seqs/missing_taxa.faa	FASTA file containing sequences of taxonomically undefined proteins (those not covered by the optionally provided taxmap input file. In most cases, this file is the same as seqs/all.faa)	FASTA
\| seqs/ambiguous_sequences.txt	Usually not created; accessions, which could not be found in provided input FASTA.	TXT
seqs/missing_sequences.txt	Usually not created; accessions, which could not be found in provided input FASTA.	TXT
segs/pgs/pg.{n}.faa	sequence information of protein group n	FASTA
tasks/quant.tsv	quantification information for each taxonomic or functional lowest common ancestor (LCA)	TSV
tasks/{annot_type}_annot_by_{tool}_on_{db_type}.v{db_version}.task{taskid}.best_hits	best hits extracted from raw annotation result	TXT
tasks/{annot_type}_annot_by_{tool}_on_{db_type}.v{db_version}.task{taskid}.lca	LCA of each protein group	TSV
tasks/{annot_type}_annot_by_{tool}_on_{db_type}.v{db_version}.task{taskid}.log	log of annotation command	TXT
tasks/{annot_type}_annot_by_{tool}_on_{db_type}.v{db_version}.task{taskid}.map	respective accession-annotation linking	TSV
tasks/{annot_type}_annot_by_{tool}_on_{db_type}.v{db_version}.task{taskid}.quant	LCA of each sample and replicate and their calculated quantification	TSV
tasks/{annot_type}_annot_by_{tool}_on_{db_type}.v{db_version}.task{taskid}.result	raw annotation result	TXT/TSV
tasks/{annot_type}_annot_by_{tool}_on_{db_type}.v{db_version}.task{taskid}.result-cmd.txt	annotation command, as executed in the shell	TXT
tasks/{annot_type}_annot_by_{tool}_on_{db_type}.v{db_version}.task{taskid}.xml	input for creation of krona plot	XML
tasks/{annot_type}_annot_by_{tool}_on_ {db_type}.v{db_version}.task{taskid}.yaml	annotation task parameters (full path to db, annotation algorithm, command line parameters, shortname of task, annotation type)	YAML
tax/taxmap.txt	taxonomic information extracted from user-defined taxmaps	TXT
tax/missing_taxa_map.txt	accessions without any taxonomic data/prediction	TXT
tax/ambiguous_taxa.txt	Usually not created; accessions with ambiguous taxonomies extracted from user-defined taxmaps	TXT

The summary.txt File

The summary.txt file is a tab-separated file that can be easily imported into Excel via the data panel for further analysis. The file begins with a header row, followed by information about protein groups. Protein groups consist of two types of rows: "group" and "member" rows and have a unique protein group number (column "#pg").

Group Rows (column "level" = group):
These rows summarize information about the protein group:

"members_count": The numbers of proteins in this metaprotein group.
"members_identifier": Accesions of all protein group members
Task Results: Information about the lowest common ancestor (LCA) assigned to a group.
Quantification Results: Quantification values of samples and sample groups
Lca Support: Lca support values

Member Rows (column "level" = member):
These rows contain results of the analyses per protein. Multiple rows are possible for a protein if the task yields different results. If no results are obtained for a protein, it is marked as 'unclassified.'

"members_identifier": Accession of protein
Task Results: Funtional or taxonomical annotation of this protein.

Task Results:
Columns display results of taxonomic and functional annotations using the naming schema:
"task_{task_number}::{task_name}::{task_level}"
Group rows contain the weighted LCA for each task, while member rows show potentially different annotation results.

Quantification Results:
Columns providing raw quantification values of samples and samples groups use following naming schema for the given raw quant values (e.g. spectra count) and standard deviation (sd):

raw_quant::group_name (mean)
raw_quant_sd::group_name (mean)
raw_quant::sample_name
raw_quant_sd::sample_name

Columns providing normalized quantification values and standard deviation (sd) of samples and samples groups use following naming schema:

quant::group_name(mean)
quant_sd::group_name(mean)
quant::sample_name
quant_sd::sample_name

Spectra columns:
If the spectra information are available (possible for mzTab and mzIdent1.2), they are shown in the columns "{sample_name}:: spectra_IDs"

LCA Support Results:
Columns indicating LCA support use following naming schema:
task_{task_nb}::{task}::{level}::lca-support
and following value format:
number_supporting_proteins/number_all_protein_group_proteins (e.g. 150/150)
OR if threshold is not passed
(highest_possible_number_supporting_proteins)/number_all_protein_group_proteins (e.g. (60)/150)

The first number indicates the number of proteins (or spectra if available) supporting the chosen LCA.
The second number represent the total count of identified proteins (or spectra if available) for the entire group.
Values in brackets, e.g., "(60)/150," indicate that no LCA with support above the given threshold could be identified.
A value of "0/0" is used when no lineage information is available, corresponding to 'unclassified' entries.

Krona Plots

Krona plots are a powerful visualization tool in Prophane, allowing users to explore and interpret the taxonomic or functional abundances of metagenomic data with ease. Krona plots are automatically generated for each task in Prophane. The size of each field in the Krona plot corresponds to the normalized quantification values, providing a visual representation of the relative abundance. Krona plots support the viewing of results for different sample groups and different samples and provide an interactive zoom function.

Funding and Support

This project is funded by:

Deutsche Forschungsgesellschaft (DFG), "Research Software Sustainability"
de.NBI, computational ressources and support

Contact

In case of questions or problems, do not hesitate to contact us:

support@prophane.de