Welcome to kindred documentation!

File Formats

Kindred can load several different file formats that contain text and their annotations. Below are examples of the different file formats with code for loading them.

BioNLP Shared Task format

This format, used in BioNLP Shared Tasks, is a standoff format. This means that the text is stored in one file and the annotations in other files. The text is stored in the .txt file, the entity annotations in the .a1 file and the relations in the .a2 file. For a project, you may have a directory with many .txt files, perhaps one per document or one per sentence. Then each file has its corresponding annotation files. If no relations annotations exist, the .a2 file may be missing.

Example file: example.txt

The colorectal cancer was caused by mutations in APC

Example file: example.a1

T1   disease 4 21    colorectal cancer
T2   gene 49 52      APC

Example file: example.a2

E1   causes subj:T2 obj:T1

The .txt file contains Unicode text and no annotations. The .a1 file contains entity annotations. Each line is a new annotation and contains three tab-delimited columns. The first column is the unique identifier which is a T with a number. The second column contains the entity type, start and end position in the text with spaces in between. And the third column has a copy of the text for this entity. The .a2 file contains the relation annotations and contains tab-delimited columns. The first column is a unique identifier of the relation. The second column is the relation type and then the arguments of the relation, in the form of name:entityid. The entity identifier corresponds to the identifier in the .a1 file. Kindred supports relations with two or more arguments in the relation.

The identifiers for an entity annotation (in the .a1 file) must start with a T. The T stands for trigger. The identifiers for a relation annotation (in the .a2 file) must start with an E or R. For Kindred, these are synonymous. Note, that Kindred doesn’t support “complex” relations, which are relations where one of the arguments is another relation. All relations must be between entities.

The following code would load these files to create a kindred.Corpus with a single document.

corpus = kindred.load('standoff','example.txt')

Perhaps more useful, to load a whole corpus with multiple files in the format, use the following code assuming that the files are in the example directory. This will create a kindred.Corpus object.

corpus = kindred.load('standoff','example')

JSON format

This format, used by PubAnnotation and PubTator, stores the text and annotation data all together in a single file. Furthermore, multiple documents can be stored in a single document.

The format is standard JSON and is either a dictionary (for a single document) or a list of dictionaries (for multiple documents). Each dictionary needs to have three fields: text, denotations, and relations. The text is the text of the document. The denotations are the entity annotations and provide the unique identifier, entity type and location (span) in the text. The relations are the relation annotations.

Example file: example.json

{
  "text": "The colorectal cancer was caused by mutations in APC",
  "denotations":
    [{"id":"T1", "obj":"disease",
      "span":{"begin":4,"end":21}},
     {"id":"T2", "obj":"gene",
      "span":{"begin":49,"end":52}}],
  "relations":
    [{"id":"R1","pred":"causes",
      "subj":"T2", "obj":"T1"}]
}

To load a whole corpus with multiple files in the format, use the following code assuming that the files are in the example directory. This will create a kindred.Corpus object.

corpus = kindred.load('json','example')

BioC XML format

The BioC XML format contains text and annotations together in a single file. Furthermore, it is designed to store more than one document. It stores each document as “document” within a larger “collection”. Each document contains passages (e.g. sections of a paper) which then contain the text, entity annotations, and relations. In loading this, each passage is turned into a single kindred.Document. An example of the format is outlined below.

<?xml version='1.0' encoding='UTF-8'?><!DOCTYPE collection SYSTEM 'BioC.dtd'>
<collection>
  <source></source>
  <date></date>
  <key></key>
  <document>
    <id></id>
    <passage>
      <offset>0</offset>
      <text>The colorectal cancer was caused by mutations in APC</text>
      <annotation id="T1">
        <infon key="type">disease</infon>
        <location offset="4" length="17"/>
        <text>colorectal cancer</text>
      </annotation>
      <annotation id="T2">
        <infon key="type">gene</infon>
        <location offset="49" length="3"/>
        <text>APC</text>
      </annotation>
      <relation id="R1">
        <infon key="type">causes</infon>
        <node refid="T2" role="subj"/>
        <node refid="T1" role="obj"/>
      </relation>
    </passage>
  </document>
</collection>

To load a whole directory of BioC XML files, use the code below. This will create a single kindred.Corpus file with each passage found in all XML files in the directory turned a kindred.Document entity.

corpus = kindred.load('bioc','example')

Simple Tag format

This format is not designed for production-use but for illustration and testing purposes. It is Kindred-specific. It is an XML-based format that keeps all annotations inline, to make it easier to see which entities are annotated. A relation tag provides a relation annotation and must have a type attribute. All other attributes are assumed to be relation argument. Any non-relation tag is assumed to be an entity annotation and must wrap around text. It must also have an id attribute.

Example file: example.simple

The <disease id="T1">colorectal cancer</disease> was caused by mutations in <gene id="T2">APC</gene>
<relation type="causes" subj="T2" obj="T1" />

It is most useful for quickly creating examples for testing. For example, the code below creates a kindred.Corpus with a single document of the associated text and annotations.

text = '<drug id="1">Erlotinib</drug> is a common treatment for <cancer id="2">NSCLC</cancer>. <drug id="3">Aspirin</drug> is the main cause of <disease id="4">boneitis</disease>. <relation type="treats" subj="1" obj="2" />'

corpus = kindred.Corpus(text,loadFromSimpleTag=True)

If you do need to load a directory of these files (with suffix: .simple), the following command will load them into a kindred.Corpus file.

corpus = kindred.load('simpletag','example')

Streaming

Some corpora are too large to load into memory in a single go. Kindred supports streaming in chunks of a corpus in the BioC format. The code below uses an iterator to load smaller kindred.Corpus objects that contain a subset of the documents each time.

for corpus in kindred.iterLoad('example.bioc.xml',corpusSizeCutoff=3):
        pass

Overview

Kindred is a Python package specifically designed for binary relation extraction from biomedical texts (e.g. PubMed abstracts). It takes a supervised learning approach, and therefore requires training data in order to build a model.

Kindred can do simple dictionary-based entity extraction. It also has integration with Pubtator to automatically pull out PubMed abstracts with a number of entities tagged and with PubAnnotation and can easily load annotation data.

Installation

Kindred is distributed through PyPI. Hence you should be able to install it with the shell command below.

pip install kindred

If you need to upgrade to a newer release, use the following shell command.

pip install --upgrade kindred

And if you want to install directly from source, use this shell command.

python setup.py install

Once it is installed, Kindred can be imported in Python with:

>>> import kindred

Installing a Spacy language model

As of v2, Kindred uses the Spacy python package for parsing. A language model needs to be installed for the corresponding language using a command similar to below.

python -m spacy download en_core_web_sm

Tutorial with a mini annotation problem

There is a tutorial with sample code that steps through a small annotation task for extracting capital cities from text. It’s on Github and may give you an understanding of the annotations that Kindred needs and how you might go about getting them. Once you’ve understood the input data, you might want to dive more into the code and the below examples will give you some ideas.

Getting started with code

Let’s walk through a basic example for the BioNLP Shared Task. This will involve loading a corpus of data to train a classifier and a corpus to make predictions on and for evaluation. We will then train the classifier, make the predictions and evaluate how we did. The smaller steps (parsing, candidate building & vectorizing) are done behind the scenes.

First, we need to load the data. We want the training and development corpus and use the commands below

>>> trainCorpus = kindred.bionlpst.load('2016-SeeDev-binary-train')
>>> devCorpus = kindred.bionlpst.load('2016-SeeDev-binary-dev')

We’re going to build a model for the relations in the training corpus and make predictions on the development corpus. We are going to keep the devCorpus object to make comparisons against, but need a copy of it that doesn’t have any relations attached to it. Hence we will clone it and remove the relations. This will contain all the same text and entity annotations as the devCorpus, but no relations.

>>> predictionCorpus = devCorpus.clone()
>>> predictionCorpus.removeRelations()

Now we’re going to build the model on the training data with default settings.

>>> classifier = kindred.RelationClassifier()
>>> classifier.train(trainCorpus)

Now we will use this classifier to predict relations in the predictionCorpus object. These new relations will be added to the corpus.

>>> classifier.predict(predictionCorpus)

Lastly, we will evaluate how well we have done. The common measure is F1-score.

>>> f1score = kindred.evaluate(devCorpus, predictionCorpus, metric='f1score')

Specific Examples

Here we will show some of the individual steps that might be needed.

Loading data from files

To load a corpus from a directory, you can use the load function, providing the format of the data.

>>> corpus = kindred.load('biocxml','/home/user/data/')

And if it was in another format, you change the dataFormat parameter. Options include: ‘standoff’ for the standoff format used in the BioNLP Shared Tasks, ‘biocxml’ for BioC XML files and ‘simpletag’ if there are a set of SimpleTag XML files. Note that we only use SimpleTag for generating easy test data and not for any large problems.

Loading data from online resources

Kindred integrates with several online resources to make it easy to import data. For BioNLP Shared Tasks, you can use the command below:

>>> corpus = kindred.bionlpst.load('2016-SeeDev-binary-train')

You can currently import data from the ‘2016-SeeDev-binary’ shared tasks as the files for ‘2016-BB3-event’ are no longer available. Add ‘train’, ‘dev’ or ‘test’ to them. The ‘train’ and ‘dev’ corpora contain relations while the ‘test’ corpus does not.

You can import PubMed abstracts annotated by Pubtator with a list of PubMed IDs (or PMIDs for short). These will contain entity annotations but no relations. The command below will import the two articles with those PMIDs.

>>> corpus = kindred.pubtator.load([19894120,19894121])

You can also import text and annotation data from PubAnnotation. In this case, you provide the project name and Kindred will download all the annotations and associated text. For the ‘bionlp-st-gro-2013-development’ project, the command to import is below. These annotations may include relation information

>>> corpus = kindred.pubannotation.load('bionlp-st-gro-2013-development')

Parsing

If you want to parse a corpus, you use a Parser object.

>>> parser = kindred.Parser()
>>> parser.parse(corpus)

Candidate Building

Given a corpus with annotated entities, one may want to generate the set of all candidate relations between two entities within the same text. One can do this for the first set with the command below. Each Sentence object within the corpus will now have a set of candidate relations attached to it.

>>> candidateBuilder = kindred.CandidateBuilder()
>>> candidateBuilder.fit_transform(corpus)

You can easily extract all the candidate relations using the command below:

>>> candidateRelations = corpus.getCandidateRelations()

The corpus contains a list of relation types contained within.

>>> print(corpus.relationTypes)

And if the corpus contains annotated relations, the candidate relations will be assigned a non-zero class index. Hence a candidate relation with class 0 has not been annotated, but a candidate relation with class 1 is of the first relation type in corpus.relationTypes.

Vectorizing

You may want to generate vectors for each candidate relation. The command below will produce the vectorized matrix with the default set of feature types.

>>> vectorizer = kindred.Vectorizer()
>>> trainMatrix = vectorizer.fit_transform(trainCorpus)

Once you’ve fit the vectorizer to the training set, remember to only use transform for the test set.

>>> testMatrix = vectorizer.transform(testCorpus)

Want to use only specific feature types (of which the options are: entityTypes, unigramsBetweenEntities, bigrams, dependencyPathEdges, dependencyPathEdgesNearEntities)? Use a command like below:

>>> vectorizer = kindred.Vectorizer(featureChoice=['entityTypes','bigrams'])

Frequently Asked Questions

Does Kindred handle multiple relations that contain the same entities?

At the moment, no. Kindred will only use the first annotation of a relation.

Release Notes

Version 2.8.0

  • Updates for newer version of bioc library
  • Dealing with BioNLP 2016 task files that are unavailable

Version 2.7.0

  • Added support to save to PubAnnotation format

Version 2.6.0

  • Release v2.6.1 is the final Python2 compatible version
  • Added option for metadata associated with entities
  • Added option to load any Spacy model for parsing

Version 2.5.0

  • Added MultiLabelClassifier and changed behaviour when multiple relation types are present. They are now predicted independently using separate classifiers. This allows overlapping relations (where the same entities are part of multiple relations).

Version 2.4.0

  • Updates to the loading and saving functionality so that everything is done through kindred.load or kindred.save
  • Changed EntityRecognizer logic to use token boundaries and exact string matching instead of matching tokenization (for faster wordlist loading)

Version 2.3.0

  • Add manuallyAnnotate for a simple mechanism to annotate candidate relations
  • Add splitIntoSentences for a parsed corpus/document

Version 2.2.0

  • Add CandidateRelation class to distinguish from Relation
  • Reworking of API so that Candidate Relations are no longer stored in corpus. Changes across API that will break backwards compatibility
  • Fixes to PubTator input

Version 2.1.0

  • Added EntityRecognizer for basic entity extraction
  • Relations can now be n-ary, not just binary

Version 2.0.0

  • Large overhaul to replace CoreNLP with Spacy package for easier integration and installation
  • Simplified relation classifier functionality by removing feature building and multiclassifier options
  • Add functionality for streaming BioC files

Version 1.1.0

  • Upgraded to new version of Stanford CoreNLP (3.8.0) and added code to manage upgrade
  • Changed dependency parsing to use standard CoreNLP dep parser (instead of constituency with a conversion).
  • Changed evaluation function to not output specific details by default
  • You can now parse with every language in CoreNLP (arabic,chinese,english,french,german,spanish)
  • Improved error display for CoreNLP failures

Version 1.0.0

  • Original release (corresponding to original paper)

Citing

If your work makes use of Kindred, it’d be really nice if you cited us.

@article{lever2017painless,
         title={Painless {R}elation {E}xtraction with {K}indred},
         author={Lever, Jake and Jones, Steven JM},
         journal={Bio{NLP} 2017},
         pages={176},
         year={2017}
         }

Reference

Main components

EntityRecognizer Annotates entities in a Corpus using an exact-dictionary matching scheme with additional heuristics.
CandidateBuilder Generates set of all possible relations in corpus.
Parser Runs Spacy on corpus to get sentences and associated tokens
RelationClassifier Manages binary classifier(s) for relation classification.
Vectorizer Vectorizes set of candidate relations into scipy sparse matrix.

Data types

CandidateRelation Describes a candidate relation between entities (i.e.
Corpus Collection of text documents.
Document Span of text with associated tagged entities and relations between entities.
Entity Biomedical entity with information of location in text
Relation Describes relationship between entities (including relation type and argument names if applicable).
Sentence Set of tokens for a sentence after parsing
Token Individual word with lemma, part-of-speech and location in text.

Machine Learning Components

LogisticRegressionWithThreshold A modified Logistic Regression classifier that will filter calls by a custom threshold, instead of the default 0.5.
MultiLabelClassifier Wrapper for a set of classifiers that can behave as a multi-label classifier.

Data sources

bionlpst Importer for BioNLP Shared Task data
pubannotation Importer for PubAnnotation data
pubtator Importer for PubTator data

Essential functions

load Load a corpus from a variety of formats.
iterLoad Iteratively load sections of a (presumably large) corpus.
save Save a corpus to a directory
evaluate Compares the gold corpus with the test corpus and calculate appropriate metrics.
manuallyAnnotate Provides a method for basic manual annotation of a series of candidate relations.