pygenesig core

Abstract classes and cross validation (pygenesig.validation)

pygenesig’s main module: Contains the abstract base classes for signature generation and testing and provides tools for cross-validation.

class pygenesig.validation.SignatureGenerator(expr, target)[source]

Bases: object

Abstract base-class for generating gene signatures. Child classes build gene signatures from a gene expression matrix and target labels.

When initializing SignatureGenerator the input data is checked for consistency. You can override the __init__ method in your implementation of SignatureGenerator to pass additional parameters, however, make always sure to call this method in your __init__ function to take advantage of the consistency check:

def __init__(self, expr, target, your_param):
    super(YourSignatureGenerator, self).__init__(expr, target)
    # your code goes here
Parameters
  • expr (np.ndarray) – m x n matrix with m samples and n genes

  • target (array-like) – m-vector with true tissue for each sample

abstract _mk_signatures(expr, target)[source]

Implement this method. Generates signatures for a list of tissues given a gene expression matrix. This method is called internally by the public method mk_signatures, which takes care of subsetting the expression data.

Parameters
  • expr – m x n matrix with m samples and n genes

  • target – vector with true tissue for each sample

Returns

tissue_name -> [list, of, enriched, genes]. The indices correspond to expr.index.

Return type

dict

Note

When implementing this method, make sure, that every tissue in this.target has an entry in the dictionary, even when the signature does not contain any genes.

mk_signatures(subset=None)[source]

Make gene signatures on a subset of the samples in the gene expression matrix.

Parameters

subset – Sample indices (columns of expression matrix) to use for signature generation. Useful for cross-validation. If omitted, all samples will be used.

Returns

signature dictionary. Maps target labels to a list of row-indices of the gene expression matrix.

Example:

signatures = {
    'adipose tissue' : [
        42,
        124,
        256,
        1038,
        ... ],
    'skeletal muscle' : [
        52,
        679,
        ... ],
    ...
}

Return type

dict

class pygenesig.validation.SignatureTester(expr, target)[source]

Bases: object

Abstract base-class for testing gene signatures. Child classes test if signatures are able to identify their respective samples, given an expression matrix and a list of the actual tissues.

When initializing SignatureTester the input data is checked for consistency. You can override the __init__ method in your implementation of SignatureTester to pass additional parameters, however, make always sure to call this method in your __init__ function to take advantage of the consistency check:

def __init__(self, expr, target, your_param):
    super(YourSignatureTester, self).__init__(expr, target)
    # your code goes here
Parameters
  • expr (np.ndarray) – m x n gene expression matrix with m samples and n genes

  • target (array-like) – m-vector with true tissue for each sample (ground truth)

abstract _score_signatures(expr, signatures)[source]

Implement this method.

Generates a score for each sample and signature.

Parameters
  • expr (np.ndarray) – m x n gene expression matrix with m genes and n samples

  • signatures (dict of list) – Signatures dictionary with j signatures. Dictionary: tissue_name -> [list, of, enriched, genes].

Returns

j x n score matrix with j signatures and n samples containing the scores generated by this method for each sample and signature. The rows are sorted by SignatureGenerator.sort_signatures.

Return type

np.ndarray

Important

Make sure the rows in the result matrix are sorted by SignatureGenerator.sort_signatures. Make sure that a row for each signature in signatures is included in the matrix, even if the signatures is empty (number of genes = 0).

classify(signatures, score_matrix, subset=None)[source]

Test signatures based on the expression matrix. Predicts the class labels using the signatures. A sample is predicted as the label associated with the highest scoring signature.

Parameters
  • signatures (dict[str, list]) – Signatures dictionary returned by SignatureGenerator.mk_signature.

  • score_matrix – j x n score matrix with j signatures and n samples generated by SignatureTester.score_signatures

  • subset – sample indices (columns of expression matrix) to use for testing. Useful for crossvalidation.

Returns

list of actual labels and predicted labels.

Return type

(list, list)

confusion_matrix(signatures, actual, predicted)[source]

Make a confusion matrix.

This is a wrapper for the sklearn.metrics.confusion_matrix. Makes the matrix contain all labels available in the signature.

Note

The rows and columns are sorted with SignatureTester.sort_signatures. You can therefore use sort_signatures to label you confusion matrix.

Parameters
  • signatures – dict of signatures that was used to predict the labels

  • actual – list of actual labels

  • predicted – list of predicted labels

Returns

confusion matrix

Return type

numpy.matrix

score_signatures(signatures, subset=None)[source]

Generates a score for each sample and signature. A high score represents a high representation of the signature in the sample.

Parameters
  • signatures (dict of list) – Signatures dictionary with j signatures. Dictionary: tissue_name -> [list, of, enriched, genes].

  • subset – sample indices (columns of expression matrix) to use for testing. Useful for crossvalidation.

Returns

j x n score matrix with j signatures and n samples containing the scores generated by this method for each sample and signature. The rows are sorted by SignatureGenerator.sort_signatures.

Return type

np.ndarray

static sort_signatures(signatures)[source]

Retrieve the signatures in a consistent order.

The signature dictionary is a python default dict, which is unordered. However, when displaying results, it is desirable to display the signatures in a consistent order.

You can use this method to iterate over the signature dictionary:

for tissue in sort_signatures(signatures):
    print(tissue, signatures[tissue])

Confusion matrices generated with SignatureTester.confusion_matrix are also sorted with sort_signatures.

Parameters

signatures – signature dictionary.

Returns

sorted keys of the signature dictionary.

Return type

list

exception pygenesig.validation.SignatureTesterException[source]

Bases: Exception

pygenesig.validation.cv_score(expr_file, target_file, signature_generator, signature_tester, splitter=StratifiedKFold(n_splits=10, random_state=None, shuffle=False), sg_kwargs={}, st_kwargs={})[source]

Perform a crossvalidation by generating and testing signatures on a given expression matrix.

Parameters
  • expr_file (str) – path to numpy.array (binary) file containing expression matrix saved with numpy.save()

  • target_file (str) – path to plaintext file containing target classes, line-by-line

  • signature_generator (SignatureGenerator) – SignatureGenerator used to derive the signatures

  • signature_tester (SignatureTester) – SignatureTester used to check the quality of signatures

  • splitter (sklearn.model_selection._BaseKFold) – crossvalidation method from scikit-learn.

Returns

list of signature dictionaries containing the signatures generated on each fold, list of confusion matrices containing the confusion matrices calculated on each fold, list of indices used for training for each fold, list of indices used for testing for each fold.

Return type

(list, list, list, list)

pygenesig.validation.filter_samples(exprs, target, n_splits)[source]

Remove all categories and the respective samples from the input data that have not at least n_splits samples per category.

This is useful when performing a cross-validation with n folds, as we need at least n samples per category.

Parameters
  • exprs

  • target

  • n_splits

Returns

target:

Return type

exprs

File input/output (pygenesig.file_formats)

pygenesig.file_formats.load_gmt(file)[source]

Deprecated. Alias for read_gmt.

pygenesig.file_formats.make_rosetta_dict(array, inverse=False)[source]

convert an array to a dictonary mapping the index to the corresponding array entry. Use inverse to reverse the mapping, i.e. the array entry to the index.

Parameters
  • array (array-like) –

  • inverse (boolean) –

Returns

the map

Return type

dict

pygenesig.file_formats.read_expr(expr_file)[source]

Read a m x n gene expression matrix from a numpy object.

pygenesig.file_formats.read_gct(file)[source]

Read a GCT file to a gene expression matrix.

Parameters

file (str) – path to GCT file

Returns

gene expression matrix.

Return type

np.array

pygenesig.file_formats.read_gmt(file)[source]

Read a GMT file into a signature dictionary.

Parameters

file – path to GMT file

Returns

signature dictionary

Example:

{
    "tissue1" : [list, of, gene, ids],
    "tissue2" : [list, of, other, genes],
    ...
}

Return type

dict of list

pygenesig.file_formats.read_rosetta(rosetta_file, as_dict=True, inverse=False)[source]

Given a m x n gene expression matrix with m genes and n samples. Read an m-array with one identifier for each gene.

If as_dict is True, this will be converted into a dictionary mapping the gene-index to the gene identifier. This can be used to map the index-based signatures back to gene symbols.

If inverse is True, the mapping is inverse, and the gene-symbol will be mapped to the gene-index.

Parameters
  • rosetta_file (str) – path to file

  • as_dict (boolean) – If True, a dictionary will be returned. Else it will be a flat numpy.array.

  • inverse (boolean) – If true, map gene-symbol to index.

Important

Be carefule when using inverse = True when the list in rosetta_file is not unique. In that case only the last entry makes it into the list!

Returns

mapping or numpy array.

Return type

dict or np.array

pygenesig.file_formats.read_target(target_file)[source]

Given a m x n gene expression matrix with m genes and n samples. Read an n-array with one target annotation for each sample.

pygenesig.file_formats.write_expr(expression_matrix, file)[source]

Store a m x n gene expression matrix as numpy object.

pygenesig.file_formats.write_gct(file, exprs, samples=None, description=None, name=None)[source]

Write a gct file.

Parameters
  • file (str) – path to output file

  • exprs (np.array) – m x n matrix with m genes and n samples

  • samples – n-array with the labels for the n samples

  • description – m array with a description for each gene (e.g. gene symbol)

  • name – m array with the name for each gene (e.g. gene index)

pygenesig.file_formats.write_gmt(signatures, file, description='na')[source]

Writes signatures to a GMT file.

Parameters
  • signatures (dict of list) – signature dictionary

  • file – path to output file

  • description – text to fill in the gmt description field.

pygenesig.file_formats.write_rosetta(rosetta_array, rosetta_file)[source]

Alias for write_feature.

Given a m x n gene expression matrix with m genes and n samples. Write a m-array with one identifier for each gene.

This can be used to map the index-based signature back to gene symbols.

pygenesig.file_formats.write_target(target_array, file)[source]

Given a m x n gene expression matrix with m genes and n samples. Write an n-array with one target annotation for each sample.

Signature Generators

Gini Index (pygenesig.gini)

The Gini-coefficient (gini index) measures the inequality among values of a frequency distribution. Originally applied to measure income inequality we can also use it on gene expression data to find genes that are over-represented in certain samples.

class pygenesig.gini.GiniSignatureGenerator(expr, target, min_gini=0.7, max_rk=3, min_expr=1, max_rel_rk=0.33, aggregate_fun=<function median>)[source]

Bases: pygenesig.validation.SignatureGenerator

Use gini index to generate gene signatures.

Genes, which are specific for a tissue result in a high gini index, whereas genes equally present in all tissues have a gini index close to zero.

The idea is, that genes with a high gini index will reliably identify their tissue of origin.

Parameters
  • expr (np.ndarray) – m x n matrix with m samples and n genes

  • target (array-like) – m-vector with true tissue for each sample

  • min_gini (float) – gini cutoff, genes need to have a gini index larger than this value.

  • max_rk (int) – rank cutoff, include genes if they rank <= max_rank among all tissues.

  • max_rel_rk (float) – rank cutoff, include genes if they rank <= max_rel_rk * m among all genes in the current sample. Default: .33, i.e. genes need to appear in the first third of the sample.

  • min_expr (float) – genes need to have at least an expression >= min_expr to be included.

  • aggregate_fun (function) – function used to aggregate samples of the same tissue.

get_rogini_format()[source]
pygenesig.gini.get_gini_signatures(df_aggr, min_gini=0.7, max_rk=3, max_rel_rk=None, min_expr=None)[source]

Generate gene signatures using gini index.

Finds over-represented genes for each sample group, given the specified thresholds.

Parameters
  • df_aggr (pd.DataFrame) – m x n pandas DataFrame with m Genes and n tissues

  • min_gini (float) – gini cutoff, genes need to have a gini index larger than this value.

  • max_rk (int) – rank cutoff, include genes if they rank <= max_rank among all tissues.

  • max_rel_rk (float) – rank cutoff, include genes if they rank <= mal_rel_rk * m among all genes in the current sample.

  • min_expr (float) – minimal expression

Returns

A signature dictionary.

Example:

{
    "tissue1" : [list, of, gene, ids],
    "tissue2" : [list, of, other, genes],
    ...
}

Return type

dict of list

pygenesig.gini.get_rogini_format(df_aggr, min_gini=0.7, max_rk=None, min_expr=1)[source]

Imitate the rogini output format.

Parameters
  • df_aggr (pd.DataFrame) – m x n pandas DataFrame with m Genes and n tissues

  • min_gini (float) – gini cutoff, genes need to have a gini index larger than this value.

  • max_rk (int) – rank cutoff, include genes if they rank <= max_rank among all tissues.

  • min_expr (float) – minimal expression

Returns

equal to rogini output.

Example:

GENEID  CATEGORY        VALUE   RANKING GINI_IDX
54165   Whole blood (ribopure)  491.359000      1       0.441296
54165   CD34 cells differentiated to erythrocyte lineage        148.124000      2       0.441296
54165   Mast cell - stimulated  68.973000       3       0.441296
101927596       CD4+CD25-CD45RA+ naive conventional T cells     15.505000       1       0.948804
101927596       CD8+ T Cells (pluriselect)      15.376000       2       0.948804
101927596       CD4+CD25-CD45RA- memory conventional T cells    10.769000       3       0.948804

Return type

pd.DataFrame

pygenesig.gini.gini(array)[source]

Calculate the Gini coefficient of a numpy array.

Based on: https://github.com/oliviaguest/gini

Parameters

array (array-like) – input array

Returns

gini-index of array

Return type

float

>>> a = np.zeros((10000))
>>> a[0] = 1.0
>>> '%.3f' % gini(a)
'1.000'
>>> a = np.ones(100)
>>> '%.3f' % gini(a)
'0.000'
>>> a = np.random.uniform(-1,0,1000000)
>>> '%.2f' % gini(a)
'0.33'

Differential Expression (pygenesig.limma)

MCPCounter (pygenesig.mcp_counter)

MCPCounter is a method described by Becht et al., Genome Biology (2015) for deconvolution of tissue-infiltrating immune and stromal cell populations.

Besides presenting a method for gene-expression based deconvolution they put a lot of effort into curating signatures.

This module

  • re-implements MCP counter in python as SignatureTester

  • implements a SignatureGenerator based on the MCP method for curating signatures.

class pygenesig.mcp_counter.MCPSignatureGenerator(expr, target, min_fc=2, min_sfc=1.5, min_auc=0.97)[source]

Bases: pygenesig.validation.SignatureGenerator

Implements the procedure described by Becht et al. for curating signatures. A gene is considered a valid Transcription Marker (TM) if it meets the following criteria

  • fold change >= min_fc (default=2)

  • specific fold change >= min_sfc (default=1.5)

  • AUC ROC >= min_auc (default=0.97)

Parameters
  • expr – m x n gene expression matrix with m genes and n samples.

  • target – m-vector with true tissue for each sample

  • min_fc – minimal fold change for a gene to be considered as valid TM

  • min_sfc – minimal specific fold change for a gene to be considered as valid TM

  • min_auc – minimal ROC AUC for a gene to be considered as valid TM

pygenesig.mcp_counter.fold_change(expr, positive_mask)[source]

Compute the fold change between the positive and negative samples for a single gene G.

According to Becht et al. the fold change is defined as

\[FC = X - \overline{X}\]

where \(X\) is the mean of positive and \(\overline{X}\) is the mean of the negative samples.

Parameters
  • expr (np.ndarray) – expression of G for each sample.

  • positive_mask (np.ndarray, dtype=np.bool) – boolean mask for expr indicating which samples belong to the positive class.

Returns

fold change

Return type

float

>>> expr = np.array([[2, 3, 5, 4, 9, 15]])
>>> target = np.array(["A", "B", "C", "A", "B", "C"])
>>> fold_change(expr, target == "A")
array([-5.])
pygenesig.mcp_counter.roc_auc(expr, positive_mask)[source]

Compute the Receiver Operator Characteristics Area under the Curve (ROC AUC) for a single gene G. This tells how well the gene discriminates between the two classes.

This is a wrapper for the scikit-learn roc_auc_score function.

Parameters
  • expr (np.ndarray) – expression of G for each sample.

  • positive_mask (np.ndarray, dtype=np.bool) – boolean mask for expr indicating which samples belong to the positive class.

Returns

roc auc score

Return type

float

>>> expr = np.array([2, 3, 5, 4, 9, 15])
>>> target = np.array(["A", "B", "C", "A", "B", "C"])
>>> roc_auc(expr, target == "A")
0.125
pygenesig.mcp_counter.specific_fold_change(expr, positive_mask, negative_masks)[source]

Compute the specific fold change of the positive class with respect to all other classes for a single gene G.

According to Becht et al. the specific fold change is defined as

\[sFC = (X - \overline{X}_{min})/(\overline{X}_{max} - \overline{X}_{min})\]

where \(X\) is the mean of positive and \(\overline{X}_{max}\) is the maximum mean over all negative classes and \(\overline{X}_{min}\) is the minimal mean over all negative classes.

Parameters
  • expr (np.ndarray) – expression of G for each sample.

  • positive_mask (np.ndarray) – boolean mask for expr indicating which samples belong to the positive class.

  • negative_masks (list of np.ndarray) – list of boolean masks for expr indicating which samples belong to the different negative classes.

Returns

specific fold change

Return type

float

>>> expr = np.array([[2, 3, 5, 4, 9, 15], [2, 3, 5, 4, 9, 15]])
>>> target = np.array(["A", "B", "C", "A", "B", "C"])
>>> specific_fold_change(expr, target == 'A', [target == "B", target == "C"])
array([-0.75, -0.75])

Signature Testers

BioQC (pygenesig.bioqc)

MCPCounter (pygenesig.mcp_counter)

MCPCounter is a method described by Becht et al., Genome Biology (2015) for deconvolution of tissue-infiltrating immune and stromal cell populations.

Besides presenting a method for gene-expression based deconvolution they put a lot of effort into curating signatures.

This module

  • re-implements MCP counter in python as SignatureTester

  • implements a SignatureGenerator based on the MCP method for curating signatures.

class pygenesig.mcp_counter.MCPSignatureTester(expr, target)[source]

Implements the MCPCounter described by Becht et al. in python.

The principle is super-simple: take the mean of all marker genes as indicator. Also see their R script.

Auxilary functions (pygenesig.tools)

pygenesig.tools.collapse_matrix(mat, group_by, axis=0, aggregate_fun=<function median>)[source]

Aggregate expression by annotation (collapse samples of the same tissue)

Parameters
  • mat (np.array) – m x n gene expression matrix with m genes and n samples.

  • group_by (list-like) – list of length m (if axis=0) or list of length n (if axis=1)

  • axis (int) – 0 for rows, 1 for columns

  • aggregate_fun (function) – aggregate to apply, defaults to numpy.median

Returns

collapsed matrix with annotation from group_by.

Return type

pd.DataFrame

pygenesig.tools.combine_signatures(*args, function=<method 'intersection' of 'set' objects>)[source]

Combine signatures (e.g. by taking the intersection) :param *args: list of signature dictonaries :param function: set operation to combine the signatures

Returns

combined signature dictionary.

Return type

dict of set

>>> s1 = {"A": {1, 2, 3}, "B": {2, 3, 4}}
>>> s2 = {"A": {1, 3, 4, 5}, "B": {42}}
>>> pprint(combine_signatures(s1, s2))
{'A': {1, 3}, 'B': set()}
pygenesig.tools.jaccard_ind(set1, set2, *args)[source]

Computes the Jaccard-Index of two or more sets.

Parameters
  • set1 (list-like) –

  • set2 (list-like) –

  • *args – arbitrary number of more sets.

Returns

jaccard index of all sets

Return type

float

pygenesig.tools.jaccard_mat(sigs1, sigs2, colname1='set_1', colname2='set_2', as_matrix=False)[source]

Compute a matrix of jaccard indices to compute the overlap of two signature sets.

Parameters
  • sigs1 – signature dictionary

  • sigs2 – signature dictionary

  • colname1 – Name of the column for sigs1 in the dataframe

  • colname2 – Name of the column for sigs2 in the dataframe

  • as_matrix – if False, a long-form dataframe will be returned, if True, a 2d matrix will be returned instead.

Returns

Matrix of Jaccard indices in long format

Return type

pd.DataFrame

Plot the overlap of signatures: >>> import seaborn as sns >>> signatures = load_gmt(TESTDATA / “bioqc/test_bioqc_log_pvalue.gmt”) >>> df = jaccard_mat(signatures, signatures) >>> sns.heatmap(df.pivot(*df.columns)) # doctest: +ELLIPSIS <AxesSubplot:…>

pygenesig.tools.normalize(array)[source]

normalize a vector to values between 0 and 1

pygenesig.tools.normalize_sum(array)[source]

normalize a vector to sum to 1

pygenesig.tools.pairwise_jaccard_ind(list_of_signatures)[source]

Compute the pairwise jaccard index for a list of signature sets. Useful for calculating the pariwise overlap between the different crossvalidation folds.

Parameters

list_of_signatures – list of signature dicts.

Returns

signature_name -> [list, of, jaccard, indices]

Return type

dict

Note

takes the signature names from the first dict in list_of_signatures to build the output dictionary.

pygenesig.tools.performance_per_tissue(list_of_confusion_matrices, sig_labels, perf_fun)[source]

Compute per-tissue performance measures from all-against-all confusion matrices.

Parameters
  • list_of_confusion_matrices (list of np.array) – list of confusion matrices

  • sig_labels (array-like) – list of signatures in the same order as in the confusion matrices.

  • perf_fun (function) – (TP, FN, TP, TN) computing a performance measure from the binary confusion matrix. See perfmeasures module.

Returns

signature_name -> list of performance meausures for each confusion matrix provided.

Return type

dict

pygenesig.tools.translate_signatures(signatures, rosetta, ignore_missing=False)[source]

Translate gene identifiers in a signature dictionary.

Parameters
  • signatures (dict of list) – signature dictionary

  • rosetta (dict) – translation table mapping one gene identifier to another

  • ignore_missing (boolean) – If true, no error will be raised if an identifier is not in the translation dictionary. Respective entries will be skipped.

Returns

translated signature dictionary

Return type

dict of list

Raises

KeyError – if a gene is not in the rosetta dictionary unless ignore_missing is specified

Collection of functions to compute various performance measures from a 2x2 confusion matrix. As an introduction to evaluating classificators, we recommend reading this paper about ROC analysis.

TP = true positive

FP = false positive

TN = true negative

FN = false negative

pygenesig.perfmeasures.f1_neg(TP, FN, FP, TN)[source]

f1-measure on negative instances.

pygenesig.perfmeasures.f1_pos(TP, FN, FP, TN)[source]

f1-measure on positive instances

pygenesig.perfmeasures.mcc(TP, FN, FP, TN)[source]

Matthews Correlation Coefficient

pygenesig.perfmeasures.prec_neg(TP, FN, FP, TN)[source]

Negative Precision

pygenesig.perfmeasures.prec_pos(TP, FN, FP, TN)[source]

Posivitve Precision

pygenesig.perfmeasures.recall_neg(TP, FN, FP, TN)[source]

Negative recall

pygenesig.perfmeasures.recall_pos(TP, FN, FP, TN)[source]

Positive recall

pygenesig.perfmeasures.sens(TP, FN, FP, TN)[source]

Sensitivity

pygenesig.perfmeasures.spec(TP, FN, FP, TN)[source]

Specificity