kindred.Vectorizer¶

class kindred.Vectorizer(entityCount=2, featureChoice=None, tfidf=True)[source]¶

Vectorizes set of candidate relations into scipy sparse matrix.

Variables:

entityCount – Number of entities in candidate relations to vectorize
featureChoice – List of features (can be one or a set of the following: ‘entityTypes’, ‘unigramsBetweenEntities’, ‘bigrams’, ‘dependencyPathEdges’, ‘dependencyPathEdgesNearEntities’). Set as None to use all of them.
tfidf – Whether it will normalize n-gram based features using term frequency-inverse document frequency
fitted – Whether it has been fit on data first (before transforming).
dictVectorizers – Dictionary vectorizers used for each feature
tfidfTransformers – TFIDF transformers used for each feature (if appropriate and selected)

Methods

__init__(entityCount=2, featureChoice=None, tfidf=True)[source]¶

Constructor for vectorizer class with options for what features to use and whether to normalize using TFIDF

Parameters:

entityCount (int) – Number of entities in candidate relations to vectorize
featureChoice (list of str) – List of features (can be one or a set of the following: ‘entityTypes’, ‘unigramsBetweenEntities’, ‘bigrams’, ‘dependencyPathEdges’, ‘dependencyPathEdgesNearEntities’). Set as None to use all of them.
tfidf (bool) – Whether to normalize n-gram based features using term frequency-inverse document frequency

fit_transform(candidates)[source]¶

Fit the vectorizer to a list of candidate relations found in a corpus and vectorize them to generate the feature matrix.

Parameters:	candidates (list of kindred.CandidateRelation) – Relation candidates to vectorize
Returns:	Feature matrix (# rows = number of candidate relations, # cols = number of features)
Return type:	scipy.sparse.csr.csr_matrix

getFeatureNames()[source]¶

Get the names for each feature (i.e. each column in matrix generated by the fit_transform() and transform() functions. Fit_transform() must have already been used, i.e. the vectorizer needs to have been fit to training data.

Returns:	List of names for each feature (column of the vectorized data)
Return type:	List of str

transform(candidates)[source]¶

Vectorize the candidate relations to generate the feature matrix. Must already have been fit.

Parameters:	candidates (list of kindred.CandidateRelation) – Relation candidates to vectorize
Returns:	Feature matrix (# rows = number of candidate relations, # cols = number of features)
Return type:	scipy.sparse.csr.csr_matrix