kindred.Vectorizer

class kindred.Vectorizer(entityCount=2, featureChoice=None, tfidf=True)[source]

Vectorizes set of candidate relations into scipy sparse matrix.

Variables:
  • entityCount – Number of entities in candidate relations to vectorize
  • featureChoice – List of features (can be one or a set of the following: ‘entityTypes’, ‘unigramsBetweenEntities’, ‘bigrams’, ‘dependencyPathEdges’, ‘dependencyPathEdgesNearEntities’). Set as None to use all of them.
  • tfidf – Whether it will normalize n-gram based features using term frequency-inverse document frequency
  • fitted – Whether it has been fit on data first (before transforming).
  • dictVectorizers – Dictionary vectorizers used for each feature
  • tfidfTransformers – TFIDF transformers used for each feature (if appropriate and selected)

Methods

__init__(entityCount=2, featureChoice=None, tfidf=True)[source]

Constructor for vectorizer class with options for what features to use and whether to normalize using TFIDF

Parameters:
  • entityCount (int) – Number of entities in candidate relations to vectorize
  • featureChoice (list of str) – List of features (can be one or a set of the following: ‘entityTypes’, ‘unigramsBetweenEntities’, ‘bigrams’, ‘dependencyPathEdges’, ‘dependencyPathEdgesNearEntities’). Set as None to use all of them.
  • tfidf (bool) – Whether to normalize n-gram based features using term frequency-inverse document frequency
fit_transform(candidates)[source]

Fit the vectorizer to a list of candidate relations found in a corpus and vectorize them to generate the feature matrix.

Parameters:candidates (list of kindred.CandidateRelation) – Relation candidates to vectorize
Returns:Feature matrix (# rows = number of candidate relations, # cols = number of features)
Return type:scipy.sparse.csr.csr_matrix
getFeatureNames()[source]

Get the names for each feature (i.e. each column in matrix generated by the fit_transform() and transform() functions. Fit_transform() must have already been used, i.e. the vectorizer needs to have been fit to training data.

Returns:List of names for each feature (column of the vectorized data)
Return type:List of str
transform(candidates)[source]

Vectorize the candidate relations to generate the feature matrix. Must already have been fit.

Parameters:candidates (list of kindred.CandidateRelation) – Relation candidates to vectorize
Returns:Feature matrix (# rows = number of candidate relations, # cols = number of features)
Return type:scipy.sparse.csr.csr_matrix