kindred.Corpus

class kindred.Corpus(text=None, loadFromSimpleTag=False)[source]

Collection of text documents.

Variables:
  • documents – List of kindred.Document
  • parsed – Boolean of whether it has been parsed yet. A kindred.parser can parse it.

Methods

__init__(text=None, loadFromSimpleTag=False)[source]

Create an empty corpus with no documents, or quickly load one with a single document using optional SimpleTag

Parameters:
  • text (String (with SimpleTag format XML)) – Optional SimpleTag text to initalize a single document
  • loadFromSimpleTag (bool) – If text is provided, whether the text parameter is in the SimpleTag format and will extract entities and relations accordingly
addDocument(doc)[source]

Add a single document to the corpus

Parameters:doc (kindred.Document) – Document to add
clone()[source]

Clone the corpus

Returns:Clone of the corpus
Return type:kindred.Corpus
getRelations()[source]

Get all relations in this corpus

Returns:List of relations
Return type:list
nfold_split(folds)[source]

Method for splitting up the corpus multiple times and is used for an n-fold cross validation approach (as a generator). Each iteration, the training and test set for that fold are provided.

Parameters:folds (int) – Number of folds to create
Returns:Tuple of training and test corpus (for iterations=folds)
Return type:(kindred.Corpus,kindred.Corpus)
removeEntities()[source]

Remove all entities in this corpus

removeRelations()[source]

Remove all relations in this corpus

split(trainFraction)[source]

Randomly split the corpus into two corpus for use as a training and test set

Parameters:trainFraction (float) – Fraction of documents to use in training set
Returns:Tuple of training and test corpus
Return type:(kindred.Corpus,kindred.Corpus)
splitIntoSentences()[source]

Create a new corpus with one document for each sentence in this corpus.

Returns:Corpus with one document per sentence
Return type:kindred.Corpus