Welcome to kindred documentation!¶
File Formats¶
Kindred can load several different file formats that contain text and their annotations. Below are examples of the different file formats with code for loading them.
JSON format¶
This format, used by PubAnnotation and PubTator, stores the text and annotation data all together in a single file. Furthermore, multiple documents can be stored in a single document.
The format is standard JSON and is either a dictionary (for a single document) or a list of dictionaries (for multiple documents). Each dictionary needs to have three fields: text, denotations, and relations. The text is the text of the document. The denotations are the entity annotations and provide the unique identifier, entity type and location (span) in the text. The relations are the relation annotations.
Example file: example.json
{
"text": "The colorectal cancer was caused by mutations in APC",
"denotations":
[{"id":"T1", "obj":"disease",
"span":{"begin":4,"end":21}},
{"id":"T2", "obj":"gene",
"span":{"begin":49,"end":52}}],
"relations":
[{"id":"R1","pred":"causes",
"subj":"T2", "obj":"T1"}]
}
To load a whole corpus with multiple files in the format, use the following code assuming that the files are in the example directory. This will create a kindred.Corpus
object.
corpus = kindred.load('json','example')
BioC XML format¶
The BioC XML format contains text and annotations together in a single file. Furthermore, it is designed to store more than one document. It stores each document as “document” within a larger “collection”. Each document contains passages (e.g. sections of a paper) which then contain the text, entity annotations, and relations. In loading this, each passage is turned into a single kindred.Document
. An example of the format is outlined below.
<?xml version='1.0' encoding='UTF-8'?><!DOCTYPE collection SYSTEM 'BioC.dtd'>
<collection>
<source></source>
<date></date>
<key></key>
<document>
<id></id>
<passage>
<offset>0</offset>
<text>The colorectal cancer was caused by mutations in APC</text>
<annotation id="T1">
<infon key="type">disease</infon>
<location offset="4" length="17"/>
<text>colorectal cancer</text>
</annotation>
<annotation id="T2">
<infon key="type">gene</infon>
<location offset="49" length="3"/>
<text>APC</text>
</annotation>
<relation id="R1">
<infon key="type">causes</infon>
<node refid="T2" role="subj"/>
<node refid="T1" role="obj"/>
</relation>
</passage>
</document>
</collection>
To load a whole directory of BioC XML files, use the code below. This will create a single kindred.Corpus
file with each passage found in all XML files in the directory turned a kindred.Document
entity.
corpus = kindred.load('bioc','example')
Simple Tag format¶
This format is not designed for production-use but for illustration and testing purposes. It is Kindred-specific. It is an XML-based format that keeps all annotations inline, to make it easier to see which entities are annotated. A relation tag provides a relation annotation and must have a type attribute. All other attributes are assumed to be relation argument. Any non-relation tag is assumed to be an entity annotation and must wrap around text. It must also have an id attribute.
Example file: example.simple
The <disease id="T1">colorectal cancer</disease> was caused by mutations in <gene id="T2">APC</gene>
<relation type="causes" subj="T2" obj="T1" />
It is most useful for quickly creating examples for testing. For example, the code below creates a kindred.Corpus
with a single document of the associated text and annotations.
text = '<drug id="1">Erlotinib</drug> is a common treatment for <cancer id="2">NSCLC</cancer>. <drug id="3">Aspirin</drug> is the main cause of <disease id="4">boneitis</disease>. <relation type="treats" subj="1" obj="2" />'
corpus = kindred.Corpus(text,loadFromSimpleTag=True)
If you do need to load a directory of these files (with suffix: .simple), the following command will load them into a kindred.Corpus
file.
corpus = kindred.load('simpletag','example')
Streaming¶
Some corpora are too large to load into memory in a single go. Kindred supports streaming in chunks of a corpus in the BioC format. The code below uses an iterator to load smaller kindred.Corpus
objects that contain a subset of the documents each time.
for corpus in kindred.iterLoad('example.bioc.xml',corpusSizeCutoff=3):
pass
Overview¶
Kindred is a Python package specifically designed for binary relation extraction from biomedical texts (e.g. PubMed abstracts). It takes a supervised learning approach, and therefore requires training data in order to build a model.
Kindred can do simple dictionary-based entity extraction. It also has integration with Pubtator to automatically pull out PubMed abstracts with a number of entities tagged and with PubAnnotation and can easily load annotation data.
Installation¶
Kindred is distributed through PyPI. Hence you should be able to install it with the shell command below.
pip install kindred
If you need to upgrade to a newer release, use the following shell command.
pip install --upgrade kindred
And if you want to install directly from source, use this shell command.
python setup.py install
Once it is installed, Kindred can be imported in Python with:
>>> import kindred
Installing a Spacy language model¶
As of v2, Kindred uses the Spacy python package for parsing. A language model needs to be installed for the corresponding language using a command similar to below.
python -m spacy download en_core_web_sm
Tutorial with a mini annotation problem¶
There is a tutorial with sample code that steps through a small annotation task for extracting capital cities from text. It’s on Github and may give you an understanding of the annotations that Kindred needs and how you might go about getting them. Once you’ve understood the input data, you might want to dive more into the code and the below examples will give you some ideas.
Getting started with code¶
Let’s walk through a basic example for the BioNLP Shared Task. This will involve loading a corpus of data to train a classifier and a corpus to make predictions on and for evaluation. We will then train the classifier, make the predictions and evaluate how we did. The smaller steps (parsing, candidate building & vectorizing) are done behind the scenes.
First, we need to load the data. We want the training and development corpus and use the commands below
>>> trainCorpus = kindred.bionlpst.load('2016-SeeDev-binary-train')
>>> devCorpus = kindred.bionlpst.load('2016-SeeDev-binary-dev')
We’re going to build a model for the relations in the training corpus and make predictions on the development corpus. We are going to keep the devCorpus object to make comparisons against, but need a copy of it that doesn’t have any relations attached to it. Hence we will clone it and remove the relations. This will contain all the same text and entity annotations as the devCorpus, but no relations.
>>> predictionCorpus = devCorpus.clone()
>>> predictionCorpus.removeRelations()
Now we’re going to build the model on the training data with default settings.
>>> classifier = kindred.RelationClassifier()
>>> classifier.train(trainCorpus)
Now we will use this classifier to predict relations in the predictionCorpus object. These new relations will be added to the corpus.
>>> classifier.predict(predictionCorpus)
Lastly, we will evaluate how well we have done. The common measure is F1-score.
>>> f1score = kindred.evaluate(devCorpus, predictionCorpus, metric='f1score')
Specific Examples¶
Here we will show some of the individual steps that might be needed.
Loading data from files¶
To load a corpus from a directory, you can use the load function, providing the format of the data.
>>> corpus = kindred.load('biocxml','/home/user/data/')
And if it was in another format, you change the dataFormat parameter. Options include: ‘standoff’ for the standoff format used in the BioNLP Shared Tasks, ‘biocxml’ for BioC XML files and ‘simpletag’ if there are a set of SimpleTag XML files. Note that we only use SimpleTag for generating easy test data and not for any large problems.
Loading data from online resources¶
Kindred integrates with several online resources to make it easy to import data. For BioNLP Shared Tasks, you can use the command below:
>>> corpus = kindred.bionlpst.load('2016-SeeDev-binary-train')
You can currently import data from the ‘2016-SeeDev-binary’ shared tasks as the files for ‘2016-BB3-event’ are no longer available. Add ‘train’, ‘dev’ or ‘test’ to them. The ‘train’ and ‘dev’ corpora contain relations while the ‘test’ corpus does not.
You can import PubMed abstracts annotated by Pubtator with a list of PubMed IDs (or PMIDs for short). These will contain entity annotations but no relations. The command below will import the two articles with those PMIDs.
>>> corpus = kindred.pubtator.load([19894120,19894121])
You can also import text and annotation data from PubAnnotation. In this case, you provide the project name and Kindred will download all the annotations and associated text. For the ‘bionlp-st-gro-2013-development’ project, the command to import is below. These annotations may include relation information
>>> corpus = kindred.pubannotation.load('bionlp-st-gro-2013-development')
Parsing¶
If you want to parse a corpus, you use a Parser object.
>>> parser = kindred.Parser()
>>> parser.parse(corpus)
Candidate Building¶
Given a corpus with annotated entities, one may want to generate the set of all candidate relations between two entities within the same text. One can do this for the first set with the command below. Each Sentence object within the corpus will now have a set of candidate relations attached to it.
>>> candidateBuilder = kindred.CandidateBuilder()
>>> candidateBuilder.fit_transform(corpus)
You can easily extract all the candidate relations using the command below:
>>> candidateRelations = corpus.getCandidateRelations()
The corpus contains a list of relation types contained within.
>>> print(corpus.relationTypes)
And if the corpus contains annotated relations, the candidate relations will be assigned a non-zero class index. Hence a candidate relation with class 0 has not been annotated, but a candidate relation with class 1 is of the first relation type in corpus.relationTypes.
Vectorizing¶
You may want to generate vectors for each candidate relation. The command below will produce the vectorized matrix with the default set of feature types.
>>> vectorizer = kindred.Vectorizer()
>>> trainMatrix = vectorizer.fit_transform(trainCorpus)
Once you’ve fit the vectorizer to the training set, remember to only use transform for the test set.
>>> testMatrix = vectorizer.transform(testCorpus)
Want to use only specific feature types (of which the options are: entityTypes, unigramsBetweenEntities, bigrams, dependencyPathEdges, dependencyPathEdgesNearEntities)? Use a command like below:
>>> vectorizer = kindred.Vectorizer(featureChoice=['entityTypes','bigrams'])
Frequently Asked Questions¶
Does Kindred handle multiple relations that contain the same entities?
At the moment, no. Kindred will only use the first annotation of a relation.
Release Notes¶
Version 2.8.0¶
- Updates for newer version of bioc library
- Dealing with BioNLP 2016 task files that are unavailable
Version 2.7.0¶
- Added support to save to PubAnnotation format
Version 2.6.0¶
- Release v2.6.1 is the final Python2 compatible version
- Added option for metadata associated with entities
- Added option to load any Spacy model for parsing
Version 2.5.0¶
- Added MultiLabelClassifier and changed behaviour when multiple relation types are present. They are now predicted independently using separate classifiers. This allows overlapping relations (where the same entities are part of multiple relations).
Version 2.4.0¶
- Updates to the loading and saving functionality so that everything is done through kindred.load or kindred.save
- Changed EntityRecognizer logic to use token boundaries and exact string matching instead of matching tokenization (for faster wordlist loading)
Version 2.3.0¶
- Add manuallyAnnotate for a simple mechanism to annotate candidate relations
- Add splitIntoSentences for a parsed corpus/document
Version 2.2.0¶
- Add CandidateRelation class to distinguish from Relation
- Reworking of API so that Candidate Relations are no longer stored in corpus. Changes across API that will break backwards compatibility
- Fixes to PubTator input
Version 2.1.0¶
- Added EntityRecognizer for basic entity extraction
- Relations can now be n-ary, not just binary
Version 2.0.0¶
- Large overhaul to replace CoreNLP with Spacy package for easier integration and installation
- Simplified relation classifier functionality by removing feature building and multiclassifier options
- Add functionality for streaming BioC files
Version 1.1.0¶
- Upgraded to new version of Stanford CoreNLP (3.8.0) and added code to manage upgrade
- Changed dependency parsing to use standard CoreNLP dep parser (instead of constituency with a conversion).
- Changed evaluation function to not output specific details by default
- You can now parse with every language in CoreNLP (arabic,chinese,english,french,german,spanish)
- Improved error display for CoreNLP failures
Version 1.0.0¶
- Original release (corresponding to original paper)
Citing¶
If your work makes use of Kindred, it’d be really nice if you cited us.
@article{lever2017painless,
title={Painless {R}elation {E}xtraction with {K}indred},
author={Lever, Jake and Jones, Steven JM},
journal={Bio{NLP} 2017},
pages={176},
year={2017}
}
Reference¶
Main components¶
EntityRecognizer |
Annotates entities in a Corpus using an exact-dictionary matching scheme with additional heuristics. |
CandidateBuilder |
Generates set of all possible relations in corpus. |
Parser |
Runs Spacy on corpus to get sentences and associated tokens |
RelationClassifier |
Manages binary classifier(s) for relation classification. |
Vectorizer |
Vectorizes set of candidate relations into scipy sparse matrix. |
Data types¶
CandidateRelation |
Describes a candidate relation between entities (i.e. |
Corpus |
Collection of text documents. |
Document |
Span of text with associated tagged entities and relations between entities. |
Entity |
Biomedical entity with information of location in text |
Relation |
Describes relationship between entities (including relation type and argument names if applicable). |
Sentence |
Set of tokens for a sentence after parsing |
Token |
Individual word with lemma, part-of-speech and location in text. |
Machine Learning Components¶
LogisticRegressionWithThreshold |
A modified Logistic Regression classifier that will filter calls by a custom threshold, instead of the default 0.5. |
MultiLabelClassifier |
Wrapper for a set of classifiers that can behave as a multi-label classifier. |
Data sources¶
bionlpst |
Importer for BioNLP Shared Task data |
pubannotation |
Importer for PubAnnotation data |
pubtator |
Importer for PubTator data |
Essential functions¶
load |
Load a corpus from a variety of formats. |
iterLoad |
Iteratively load sections of a (presumably large) corpus. |
save |
Save a corpus to a directory |
evaluate |
Compares the gold corpus with the test corpus and calculate appropriate metrics. |
manuallyAnnotate |
Provides a method for basic manual annotation of a series of candidate relations. |