pygenesig core¶
Abstract classes and cross validation (pygenesig.validation
)¶
pygenesig’s main module: Contains the abstract base classes for signature generation and testing and provides tools for cross-validation.
- class pygenesig.validation.SignatureGenerator(expr, target)[source]¶
Bases:
object
Abstract base-class for generating gene signatures. Child classes build gene signatures from a gene expression matrix and target labels.
When initializing
SignatureGenerator
the input data is checked for consistency. You can override the__init__
method in your implementation ofSignatureGenerator
to pass additional parameters, however, make always sure to call this method in your__init__
function to take advantage of the consistency check:def __init__(self, expr, target, your_param): super(YourSignatureGenerator, self).__init__(expr, target) # your code goes here
- Parameters
expr (np.ndarray) – m x n matrix with m samples and n genes
target (array-like) – m-vector with true tissue for each sample
- abstract _mk_signatures(expr, target)[source]¶
Implement this method. Generates signatures for a list of tissues given a gene expression matrix. This method is called internally by the public method
mk_signatures
, which takes care of subsetting the expression data.- Parameters
expr – m x n matrix with m samples and n genes
target – vector with true tissue for each sample
- Returns
tissue_name -> [list, of, enriched, genes]. The indices correspond to expr.index.
- Return type
dict
Note
When implementing this method, make sure, that every tissue in
this.target
has an entry in the dictionary, even when the signature does not contain any genes.
- mk_signatures(subset=None)[source]¶
Make gene signatures on a subset of the samples in the gene expression matrix.
- Parameters
subset – Sample indices (columns of expression matrix) to use for signature generation. Useful for cross-validation. If omitted, all samples will be used.
- Returns
signature dictionary. Maps target labels to a list of row-indices of the gene expression matrix.
Example:
signatures = { 'adipose tissue' : [ 42, 124, 256, 1038, ... ], 'skeletal muscle' : [ 52, 679, ... ], ... }
- Return type
dict
- class pygenesig.validation.SignatureTester(expr, target)[source]¶
Bases:
object
Abstract base-class for testing gene signatures. Child classes test if signatures are able to identify their respective samples, given an expression matrix and a list of the actual tissues.
When initializing
SignatureTester
the input data is checked for consistency. You can override the__init__
method in your implementation ofSignatureTester
to pass additional parameters, however, make always sure to call this method in your__init__
function to take advantage of the consistency check:def __init__(self, expr, target, your_param): super(YourSignatureTester, self).__init__(expr, target) # your code goes here
- Parameters
expr (np.ndarray) – m x n gene expression matrix with m samples and n genes
target (array-like) – m-vector with true tissue for each sample (
ground truth
)
- abstract _score_signatures(expr, signatures)[source]¶
Implement this method.
Generates a score for each sample and signature.
- Parameters
expr (np.ndarray) – m x n gene expression matrix with m genes and n samples
signatures (dict of list) – Signatures dictionary with j signatures. Dictionary: tissue_name -> [list, of, enriched, genes].
- Returns
j x n score matrix with j signatures and n samples containing the scores generated by this method for each sample and signature. The rows are sorted by
SignatureGenerator.sort_signatures
.- Return type
np.ndarray
Important
Make sure the rows in the result matrix are sorted by
SignatureGenerator.sort_signatures
. Make sure that a row for each signature insignatures
is included in the matrix, even if the signatures is empty (number of genes = 0).
- classify(signatures, score_matrix, subset=None)[source]¶
Test signatures based on the expression matrix. Predicts the class labels using the signatures. A sample is predicted as the label associated with the highest scoring signature.
- Parameters
signatures (dict[str, list]) – Signatures dictionary returned by
SignatureGenerator.mk_signature
.score_matrix – j x n score matrix with j signatures and n samples generated by
SignatureTester.score_signatures
subset – sample indices (columns of expression matrix) to use for testing. Useful for crossvalidation.
- Returns
list of actual labels and predicted labels.
- Return type
(list, list)
- confusion_matrix(signatures, actual, predicted)[source]¶
Make a confusion matrix.
This is a wrapper for the
sklearn.metrics.confusion_matrix
. Makes the matrix contain all labels available in the signature.Note
The rows and columns are sorted with
SignatureTester.sort_signatures
. You can therefore usesort_signatures
to label you confusion matrix.- Parameters
signatures – dict of signatures that was used to predict the labels
actual – list of actual labels
predicted – list of predicted labels
- Returns
confusion matrix
- Return type
numpy.matrix
- score_signatures(signatures, subset=None)[source]¶
Generates a score for each sample and signature. A high score represents a high representation of the signature in the sample.
- Parameters
signatures (dict of list) – Signatures dictionary with j signatures. Dictionary: tissue_name -> [list, of, enriched, genes].
subset – sample indices (columns of expression matrix) to use for testing. Useful for crossvalidation.
- Returns
j x n score matrix with j signatures and n samples containing the scores generated by this method for each sample and signature. The rows are sorted by
SignatureGenerator.sort_signatures
.- Return type
np.ndarray
- static sort_signatures(signatures)[source]¶
Retrieve the signatures in a consistent order.
The signature dictionary is a python default dict, which is unordered. However, when displaying results, it is desirable to display the signatures in a consistent order.
You can use this method to iterate over the signature dictionary:
for tissue in sort_signatures(signatures): print(tissue, signatures[tissue])
Confusion matrices generated with
SignatureTester.confusion_matrix
are also sorted withsort_signatures
.- Parameters
signatures – signature dictionary.
- Returns
sorted keys of the signature dictionary.
- Return type
list
- pygenesig.validation.cv_score(expr_file, target_file, signature_generator, signature_tester, splitter=StratifiedKFold(n_splits=10, random_state=None, shuffle=False), sg_kwargs={}, st_kwargs={})[source]¶
Perform a crossvalidation by generating and testing signatures on a given expression matrix.
- Parameters
expr_file (str) – path to numpy.array (binary) file containing expression matrix saved with
numpy.save()
target_file (str) – path to plaintext file containing target classes, line-by-line
signature_generator (SignatureGenerator) –
SignatureGenerator
used to derive the signaturessignature_tester (SignatureTester) –
SignatureTester
used to check the quality of signaturessplitter (sklearn.model_selection._BaseKFold) – crossvalidation method from scikit-learn.
- Returns
list of signature dictionaries containing the signatures generated on each fold, list of confusion matrices containing the confusion matrices calculated on each fold, list of indices used for training for each fold, list of indices used for testing for each fold.
- Return type
(list, list, list, list)
- pygenesig.validation.filter_samples(exprs, target, n_splits)[source]¶
Remove all categories and the respective samples from the input data that have not at least n_splits samples per category.
This is useful when performing a cross-validation with n folds, as we need at least n samples per category.
- Parameters
exprs –
target –
n_splits –
- Returns
target:
- Return type
exprs
File input/output (pygenesig.file_formats
)¶
- pygenesig.file_formats.make_rosetta_dict(array, inverse=False)[source]¶
convert an array to a dictonary mapping the index to the corresponding array entry. Use
inverse
to reverse the mapping, i.e. the array entry to the index.- Parameters
array (array-like) –
inverse (boolean) –
- Returns
the map
- Return type
dict
- pygenesig.file_formats.read_expr(expr_file)[source]¶
Read a m x n gene expression matrix from a numpy object.
- pygenesig.file_formats.read_gct(file)[source]¶
Read a GCT file to a gene expression matrix.
- Parameters
file (str) – path to GCT file
- Returns
gene expression matrix.
- Return type
np.array
- pygenesig.file_formats.read_gmt(file)[source]¶
Read a GMT file into a signature dictionary.
- Parameters
file – path to GMT file
- Returns
signature dictionary
Example:
{ "tissue1" : [list, of, gene, ids], "tissue2" : [list, of, other, genes], ... }
- Return type
dict of list
- pygenesig.file_formats.read_rosetta(rosetta_file, as_dict=True, inverse=False)[source]¶
Given a m x n gene expression matrix with m genes and n samples. Read an m-array with one identifier for each gene.
If
as_dict
is True, this will be converted into a dictionary mapping the gene-index to the gene identifier. This can be used to map the index-based signatures back to gene symbols.If
inverse
is True, the mapping is inverse, and the gene-symbol will be mapped to the gene-index.- Parameters
rosetta_file (str) – path to file
as_dict (boolean) – If True, a dictionary will be returned. Else it will be a flat numpy.array.
inverse (boolean) – If true, map gene-symbol to index.
Important
Be carefule when using
inverse = True
when the list in rosetta_file is not unique. In that case only the last entry makes it into the list!- Returns
mapping or numpy array.
- Return type
dict or np.array
- pygenesig.file_formats.read_target(target_file)[source]¶
Given a m x n gene expression matrix with m genes and n samples. Read an n-array with one target annotation for each sample.
- pygenesig.file_formats.write_expr(expression_matrix, file)[source]¶
Store a m x n gene expression matrix as numpy object.
- pygenesig.file_formats.write_gct(file, exprs, samples=None, description=None, name=None)[source]¶
Write a gct file.
- Parameters
file (str) – path to output file
exprs (np.array) – m x n matrix with m genes and n samples
samples – n-array with the labels for the n samples
description – m array with a description for each gene (e.g. gene symbol)
name – m array with the name for each gene (e.g. gene index)
- pygenesig.file_formats.write_gmt(signatures, file, description='na')[source]¶
Writes signatures to a GMT file.
- Parameters
signatures (dict of list) – signature dictionary
file – path to output file
description – text to fill in the gmt description field.
Signature Generators¶
Gini Index (pygenesig.gini
)¶
The Gini-coefficient (gini index) measures the inequality among values of a frequency distribution. Originally applied to measure income inequality we can also use it on gene expression data to find genes that are over-represented in certain samples.
- class pygenesig.gini.GiniSignatureGenerator(expr, target, min_gini=0.7, max_rk=3, min_expr=1, max_rel_rk=0.33, aggregate_fun=<function median>)[source]¶
Bases:
pygenesig.validation.SignatureGenerator
Use gini index to generate gene signatures.
Genes, which are specific for a tissue result in a high gini index, whereas genes equally present in all tissues have a gini index close to zero.
The idea is, that genes with a high gini index will reliably identify their tissue of origin.
- Parameters
expr (np.ndarray) – m x n matrix with m samples and n genes
target (array-like) – m-vector with true tissue for each sample
min_gini (float) – gini cutoff, genes need to have a gini index larger than this value.
max_rk (int) – rank cutoff, include genes if they rank
<= max_rank
among all tissues.max_rel_rk (float) – rank cutoff, include genes if they rank <= max_rel_rk * m among all genes in the current sample. Default: .33, i.e. genes need to appear in the first third of the sample.
min_expr (float) – genes need to have at least an expression
>= min_expr
to be included.aggregate_fun (function) – function used to aggregate samples of the same tissue.
- pygenesig.gini.get_gini_signatures(df_aggr, min_gini=0.7, max_rk=3, max_rel_rk=None, min_expr=None)[source]¶
Generate gene signatures using gini index.
Finds over-represented genes for each sample group, given the specified thresholds.
- Parameters
df_aggr (pd.DataFrame) – m x n pandas DataFrame with m Genes and n tissues
min_gini (float) – gini cutoff, genes need to have a gini index larger than this value.
max_rk (int) – rank cutoff, include genes if they rank <= max_rank among all tissues.
max_rel_rk (float) – rank cutoff, include genes if they rank <= mal_rel_rk * m among all genes in the current sample.
min_expr (float) – minimal expression
- Returns
A signature dictionary.
Example:
{ "tissue1" : [list, of, gene, ids], "tissue2" : [list, of, other, genes], ... }
- Return type
dict of list
- pygenesig.gini.get_rogini_format(df_aggr, min_gini=0.7, max_rk=None, min_expr=1)[source]¶
Imitate the rogini output format.
- Parameters
df_aggr (pd.DataFrame) – m x n pandas DataFrame with m Genes and n tissues
min_gini (float) – gini cutoff, genes need to have a gini index larger than this value.
max_rk (int) – rank cutoff, include genes if they rank <= max_rank among all tissues.
min_expr (float) – minimal expression
- Returns
equal to rogini output.
Example:
GENEID CATEGORY VALUE RANKING GINI_IDX 54165 Whole blood (ribopure) 491.359000 1 0.441296 54165 CD34 cells differentiated to erythrocyte lineage 148.124000 2 0.441296 54165 Mast cell - stimulated 68.973000 3 0.441296 101927596 CD4+CD25-CD45RA+ naive conventional T cells 15.505000 1 0.948804 101927596 CD8+ T Cells (pluriselect) 15.376000 2 0.948804 101927596 CD4+CD25-CD45RA- memory conventional T cells 10.769000 3 0.948804
- Return type
pd.DataFrame
- pygenesig.gini.gini(array)[source]¶
Calculate the Gini coefficient of a numpy array.
Based on: https://github.com/oliviaguest/gini
- Parameters
array (array-like) – input array
- Returns
gini-index of
array
- Return type
float
>>> a = np.zeros((10000)) >>> a[0] = 1.0 >>> '%.3f' % gini(a) '1.000' >>> a = np.ones(100) >>> '%.3f' % gini(a) '0.000' >>> a = np.random.uniform(-1,0,1000000) >>> '%.2f' % gini(a) '0.33'
Differential Expression (pygenesig.limma
)¶
MCPCounter (pygenesig.mcp_counter
)¶
MCPCounter is a method described by Becht et al., Genome Biology (2015) for deconvolution of tissue-infiltrating immune and stromal cell populations.
Besides presenting a method for gene-expression based deconvolution they put a lot of effort into curating signatures.
This module
re-implements MCP counter in python as SignatureTester
implements a SignatureGenerator based on the MCP method for curating signatures.
- class pygenesig.mcp_counter.MCPSignatureGenerator(expr, target, min_fc=2, min_sfc=1.5, min_auc=0.97)[source]¶
Bases:
pygenesig.validation.SignatureGenerator
Implements the procedure described by Becht et al. for curating signatures. A gene is considered a valid
Transcription Marker
(TM) if it meets the following criteriafold change >=
min_fc
(default=2)specific fold change >=
min_sfc
(default=1.5)AUC ROC >=
min_auc
(default=0.97)
- Parameters
expr – m x n gene expression matrix with m genes and n samples.
target – m-vector with true tissue for each sample
min_fc – minimal fold change for a gene to be considered as valid TM
min_sfc – minimal specific fold change for a gene to be considered as valid TM
min_auc – minimal ROC AUC for a gene to be considered as valid TM
- pygenesig.mcp_counter.fold_change(expr, positive_mask)[source]¶
Compute the fold change between the positive and negative samples for a single gene
G
.According to Becht et al. the fold change is defined as
\[FC = X - \overline{X}\]where \(X\) is the mean of positive and \(\overline{X}\) is the mean of the negative samples.
- Parameters
expr (np.ndarray) – expression of
G
for each sample.positive_mask (np.ndarray, dtype=np.bool) – boolean mask for
expr
indicating which samples belong to the positive class.
- Returns
fold change
- Return type
float
>>> expr = np.array([[2, 3, 5, 4, 9, 15]]) >>> target = np.array(["A", "B", "C", "A", "B", "C"]) >>> fold_change(expr, target == "A") array([-5.])
- pygenesig.mcp_counter.roc_auc(expr, positive_mask)[source]¶
Compute the Receiver Operator Characteristics Area under the Curve (ROC AUC) for a single gene
G
. This tells how well the gene discriminates between the two classes.This is a wrapper for the scikit-learn roc_auc_score function.
- Parameters
expr (np.ndarray) – expression of
G
for each sample.positive_mask (np.ndarray, dtype=np.bool) – boolean mask for
expr
indicating which samples belong to the positive class.
- Returns
roc auc score
- Return type
float
>>> expr = np.array([2, 3, 5, 4, 9, 15]) >>> target = np.array(["A", "B", "C", "A", "B", "C"]) >>> roc_auc(expr, target == "A") 0.125
- pygenesig.mcp_counter.specific_fold_change(expr, positive_mask, negative_masks)[source]¶
Compute the specific fold change of the positive class with respect to all other classes for a single gene
G
.According to Becht et al. the specific fold change is defined as
\[sFC = (X - \overline{X}_{min})/(\overline{X}_{max} - \overline{X}_{min})\]where \(X\) is the mean of positive and \(\overline{X}_{max}\) is the maximum mean over all negative classes and \(\overline{X}_{min}\) is the minimal mean over all negative classes.
- Parameters
expr (np.ndarray) – expression of
G
for each sample.positive_mask (np.ndarray) – boolean mask for
expr
indicating which samples belong to the positive class.negative_masks (list of np.ndarray) – list of boolean masks for
expr
indicating which samples belong to the different negative classes.
- Returns
specific fold change
- Return type
float
>>> expr = np.array([[2, 3, 5, 4, 9, 15], [2, 3, 5, 4, 9, 15]]) >>> target = np.array(["A", "B", "C", "A", "B", "C"]) >>> specific_fold_change(expr, target == 'A', [target == "B", target == "C"]) array([-0.75, -0.75])
Signature Testers¶
BioQC (pygenesig.bioqc
)¶
MCPCounter (pygenesig.mcp_counter
)¶
MCPCounter is a method described by Becht et al., Genome Biology (2015) for deconvolution of tissue-infiltrating immune and stromal cell populations.
Besides presenting a method for gene-expression based deconvolution they put a lot of effort into curating signatures.
This module
re-implements MCP counter in python as SignatureTester
implements a SignatureGenerator based on the MCP method for curating signatures.
- class pygenesig.mcp_counter.MCPSignatureTester(expr, target)[source]¶
Implements the MCPCounter described by Becht et al. in python.
The principle is super-simple: take the mean of all marker genes as indicator. Also see their R script.
Auxilary functions (pygenesig.tools
)¶
- pygenesig.tools.collapse_matrix(mat, group_by, axis=0, aggregate_fun=<function median>)[source]¶
Aggregate expression by annotation (collapse samples of the same tissue)
- Parameters
mat (np.array) – m x n gene expression matrix with m genes and n samples.
group_by (list-like) – list of length m (if axis=0) or list of length n (if axis=1)
axis (int) – 0 for rows, 1 for columns
aggregate_fun (function) – aggregate to apply, defaults to
numpy.median
- Returns
collapsed matrix with annotation from
group_by
.- Return type
pd.DataFrame
- pygenesig.tools.combine_signatures(*args, function=<method 'intersection' of 'set' objects>)[source]¶
Combine signatures (e.g. by taking the intersection) :param *args: list of signature dictonaries :param function: set operation to combine the signatures
- Returns
combined signature dictionary.
- Return type
dict of set
>>> s1 = {"A": {1, 2, 3}, "B": {2, 3, 4}} >>> s2 = {"A": {1, 3, 4, 5}, "B": {42}} >>> pprint(combine_signatures(s1, s2)) {'A': {1, 3}, 'B': set()}
- pygenesig.tools.jaccard_ind(set1, set2, *args)[source]¶
Computes the Jaccard-Index of two or more sets.
- Parameters
set1 (list-like) –
set2 (list-like) –
*args – arbitrary number of more sets.
- Returns
jaccard index of all sets
- Return type
float
- pygenesig.tools.jaccard_mat(sigs1, sigs2, colname1='set_1', colname2='set_2', as_matrix=False)[source]¶
Compute a matrix of jaccard indices to compute the overlap of two signature sets.
- Parameters
sigs1 – signature dictionary
sigs2 – signature dictionary
colname1 – Name of the column for sigs1 in the dataframe
colname2 – Name of the column for sigs2 in the dataframe
as_matrix – if False, a long-form dataframe will be returned, if True, a 2d matrix will be returned instead.
- Returns
Matrix of Jaccard indices in long format
- Return type
pd.DataFrame
Plot the overlap of signatures: >>> import seaborn as sns >>> signatures = load_gmt(TESTDATA / “bioqc/test_bioqc_log_pvalue.gmt”) >>> df = jaccard_mat(signatures, signatures) >>> sns.heatmap(df.pivot(*df.columns)) # doctest: +ELLIPSIS <AxesSubplot:…>
- pygenesig.tools.pairwise_jaccard_ind(list_of_signatures)[source]¶
Compute the pairwise jaccard index for a list of signature sets. Useful for calculating the pariwise overlap between the different crossvalidation folds.
- Parameters
list_of_signatures – list of signature dicts.
- Returns
signature_name -> [list, of, jaccard, indices]
- Return type
dict
Note
takes the signature names from the first dict in list_of_signatures to build the output dictionary.
- pygenesig.tools.performance_per_tissue(list_of_confusion_matrices, sig_labels, perf_fun)[source]¶
Compute per-tissue performance measures from all-against-all confusion matrices.
- Parameters
list_of_confusion_matrices (list of np.array) – list of confusion matrices
sig_labels (array-like) – list of signatures in the same order as in the confusion matrices.
perf_fun (function) –
(TP, FN, TP, TN)
computing a performance measure from the binary confusion matrix. Seeperfmeasures
module.
- Returns
signature_name -> list of performance meausures for each confusion matrix provided.
- Return type
dict
- pygenesig.tools.translate_signatures(signatures, rosetta, ignore_missing=False)[source]¶
Translate gene identifiers in a signature dictionary.
- Parameters
signatures (dict of list) – signature dictionary
rosetta (dict) – translation table mapping one gene identifier to another
ignore_missing (boolean) – If true, no error will be raised if an identifier is not in the translation dictionary. Respective entries will be skipped.
- Returns
translated signature dictionary
- Return type
dict of list
- Raises
KeyError – if a gene is not in the rosetta dictionary unless ignore_missing is specified
Collection of functions to compute various performance measures from a 2x2 confusion matrix. As an introduction to evaluating classificators, we recommend reading this paper about ROC analysis.
TP = true positive
FP = false positive
TN = true negative
FN = false negative