kindred.EntityRecognizer

class kindred.EntityRecognizer(lookup, detectFusionGenes=False, detectMicroRNA=False, acronymDetectionForAmbiguity=False, mergeTerms=False, detectVariants=False, variantStopwords=None, detectPolymorphisms=False, removePathways=False)[source]

Annotates entities in a Corpus using an exact-dictionary matching scheme with additional heuristics. These heuristics include detecthing fusion gene mentions, microRNA, identifying acronyms to reduce ambiguity, identifying variants and more. All the options are parameters for the constructor of this class.

Variables:
  • lookup – Used for the dictionary matching. A dictionary of terms (tuple of parsed words) to a list of (entityType,externalID).
  • detectFusionGenes – Whether it will try to identify fusion gene terms (e.g. BCR-ABL1). Lookup must contain terms of type ‘gene’
  • detectMicroRNA – Whether it will identify microRNA terms (added as ‘gene’ entities)
  • acronymDetectionForAmbiguity – Whether it will try to identify acronyms and use this to deal with ambiguity (by removing incorrect matches to acronyms or the longer terms)
  • mergeTerms – Whether it will merge neighbouring terms that refer to the same external entity (e.g. HER2/neu as one term instead of two)
  • detectVariants – Whether it will identify a variant (e.g. V600E) and create an entity of type ‘variant’
  • variantStopwords – Variant terms to be ignored (e.g. S100P) if detectVariants is used
  • detectPolymorphisms – Whether it will identify a SNP (using a dbSNP ID) and create an entity of type ‘variant’
  • removePathways – Whether it will remove genes that are actually naming a signalling pathway (e.g. MTOR pathway)

Methods

__init__(lookup, detectFusionGenes=False, detectMicroRNA=False, acronymDetectionForAmbiguity=False, mergeTerms=False, detectVariants=False, variantStopwords=None, detectPolymorphisms=False, removePathways=False)[source]

Create an EntityRecognizer and provide the lookup table for terms and additional flags for what to identify in text

Parameters:
  • lookup (dict) – A dictionary of terms (tuple of parsed words) to a list of (entityType,externalID).
  • detectFusionGenes (bool) – Whether to try to identify fusion gene terms (e.g. BCR-ABL1). Lookup must contain terms of type ‘gene’
  • detectMicroRNA (bool) – Whether to identify microRNA terms (added as ‘gene’ entities)
  • acronymDetectionForAmbiguity (bool) – Whether to try to identify acronyms and use this to deal with ambiguity (by removing incorrect matches to acronyms or the longer terms)
  • mergeTerms (bool) – Whether to merge neighbouring terms that refer to the same external entity (e.g. HER2/neu as one term instead of two)
  • detectVariants (bool) – Whether to identify a variant (e.g. V600E) and create an entity of type ‘variant’
  • variantStopwords (list) – Variant terms to be ignored (e.g. S100P) if detectVariants is used
  • detectPolymorphisms (bool) – Whether to identify a SNP (using a dbSNP ID) and create an entity of type ‘variant’
  • removePathways (bool) – Remove genes that are actually naming a signalling pathway (e.g. MTOR pathway)
annotate(corpus)[source]

Annotate a parsed corpus with the wordlist lookup and other entity types

Parameters:corpus (kindred.Corpus) – Corpus to annotate
static loadWordlists(entityTypesWithFilenames, idColumn=0, termsColumn=1, columnSeparator='\t', termSeparator='|')[source]

Load a wordlist from multiple files. By default, each file should be a tab-delimited file with the first column is the ID and the second column containing all the terms separated by ‘|’. This can be modified by the parameters.

As each term is parsed, this can take a long time. It is recommended to run this one time and save the output as a Python pickle file and load in.

Parameters:
  • entityTypesWithFilenames (dict) – Dictionary of entityType => filename
  • idColumn (int) – The column containing the ID for the term (starts from 0)
  • termsColumn (int) – The column containing the list of terms (starts from 0)
  • columnSeparator (str) – The column separator for the file (default is a tab)
  • termSeparator (str) – The separator for the list of terms (default is a ‘|’)
Returns:

Dictionary of lookup values

Return type:

dict