kindred.iterLoad

kindred.iterLoad(dataFormat, path, corpusSizeCutoff=500)[source]

Iteratively load sections of a (presumably large) corpus. This will create a generator that provides kindred.Corpus objects that are subsets of the larger corpus. This should be used to lower the memory requirements (so that the entire file doesn’t need to be loaded into memory at one time).

Parameters:
  • dataFormat (str) – Format of the data files to load (only ‘biocxml’ is currently supported)
  • path (str) – Path to data. Can be directory or an individual file (for bioc, json or simpletag)
  • corpusSizeCutoff (int) – Approximate maximum number of documents to be in each corpus subset
Returns:

Subsets of the BioC file

Return type:

A kindred.Corpus generator