kindred.Vectorizer¶
-
class
kindred.
Vectorizer
(entityCount=2, featureChoice=None, tfidf=True)[source]¶ Vectorizes set of candidate relations into scipy sparse matrix.
Variables: - entityCount – Number of entities in candidate relations to vectorize
- featureChoice – List of features (can be one or a set of the following: ‘entityTypes’, ‘unigramsBetweenEntities’, ‘bigrams’, ‘dependencyPathEdges’, ‘dependencyPathEdgesNearEntities’). Set as None to use all of them.
- tfidf – Whether it will normalize n-gram based features using term frequency-inverse document frequency
- fitted – Whether it has been fit on data first (before transforming).
- dictVectorizers – Dictionary vectorizers used for each feature
- tfidfTransformers – TFIDF transformers used for each feature (if appropriate and selected)
Methods
-
__init__
(entityCount=2, featureChoice=None, tfidf=True)[source]¶ Constructor for vectorizer class with options for what features to use and whether to normalize using TFIDF
Parameters: - entityCount (int) – Number of entities in candidate relations to vectorize
- featureChoice (list of str) – List of features (can be one or a set of the following: ‘entityTypes’, ‘unigramsBetweenEntities’, ‘bigrams’, ‘dependencyPathEdges’, ‘dependencyPathEdgesNearEntities’). Set as None to use all of them.
- tfidf (bool) – Whether to normalize n-gram based features using term frequency-inverse document frequency
-
fit_transform
(candidates)[source]¶ Fit the vectorizer to a list of candidate relations found in a corpus and vectorize them to generate the feature matrix.
Parameters: candidates (list of kindred.CandidateRelation) – Relation candidates to vectorize Returns: Feature matrix (# rows = number of candidate relations, # cols = number of features) Return type: scipy.sparse.csr.csr_matrix
-
getFeatureNames
()[source]¶ Get the names for each feature (i.e. each column in matrix generated by the fit_transform() and transform() functions. Fit_transform() must have already been used, i.e. the vectorizer needs to have been fit to training data.
Returns: List of names for each feature (column of the vectorized data) Return type: List of str
-
transform
(candidates)[source]¶ Vectorize the candidate relations to generate the feature matrix. Must already have been fit.
Parameters: candidates (list of kindred.CandidateRelation) – Relation candidates to vectorize Returns: Feature matrix (# rows = number of candidate relations, # cols = number of features) Return type: scipy.sparse.csr.csr_matrix