kindred.Corpus¶
-
class
kindred.
Corpus
(text=None, loadFromSimpleTag=False)[source]¶ Collection of text documents.
Variables: - documents – List of
kindred.Document
- parsed – Boolean of whether it has been parsed yet. A
kindred.parser
can parse it.
Methods
-
__init__
(text=None, loadFromSimpleTag=False)[source]¶ Create an empty corpus with no documents, or quickly load one with a single document using optional SimpleTag
Parameters: - text (String (with SimpleTag format XML)) – Optional SimpleTag text to initalize a single document
- loadFromSimpleTag (bool) – If text is provided, whether the text parameter is in the SimpleTag format and will extract entities and relations accordingly
-
addDocument
(doc)[source]¶ Add a single document to the corpus
Parameters: doc (kindred.Document) – Document to add
-
clone
()[source]¶ Clone the corpus
Returns: Clone of the corpus Return type: kindred.Corpus
-
nfold_split
(folds)[source]¶ Method for splitting up the corpus multiple times and is used for an n-fold cross validation approach (as a generator). Each iteration, the training and test set for that fold are provided.
Parameters: folds (int) – Number of folds to create Returns: Tuple of training and test corpus (for iterations=folds) Return type: (kindred.Corpus,kindred.Corpus)
-
split
(trainFraction)[source]¶ Randomly split the corpus into two corpus for use as a training and test set
Parameters: trainFraction (float) – Fraction of documents to use in training set Returns: Tuple of training and test corpus Return type: (kindred.Corpus,kindred.Corpus)
-
splitIntoSentences
()[source]¶ Create a new corpus with one document for each sentence in this corpus.
Returns: Corpus with one document per sentence Return type: kindred.Corpus
- documents – List of