Model Creation Helpers¶
webstruct.model contains convetional wrappers for creating NER models.
-
class
webstruct.model.NER(model, loader=None, html_tokenizer=None, entity_colors=None)[source]¶ Class for extracting named entities from HTML.
Initialize it with a trained
model.modelmust havepredictmethod that accepts lists ofHtmlTokensequences and returns lists of predicted IOB2 tags.create_wapiti_pipeline()function returns such model.-
extract(bytes_data)[source]¶ Extract named entities from binary HTML data
bytes_data. Return a list of(entity_text, entity_type)tuples.
-
extract_from_url(url)[source]¶ A convenience wrapper for
extract()method that downloads input data from a remote URL.
-
extract_raw(bytes_data)[source]¶ Extract named entities from binary HTML data
bytes_data. Return a list of(html_token, iob2_tag)tuples.
-
extract_groups(bytes_data, dont_penalize=None)[source]¶ Extract groups of named entities from binary HTML data
bytes_data. Return a list of lists of(entity_text, entity_type)tuples.Entites are grouped using algorithm from
webstruct.grouping.
-
extract_groups_from_url(url, dont_penalize=None)[source]¶ A convenience wrapper for
extract_groups()method that downloads input data from a remote URL.
-
build_entity(html_tokens)[source]¶ Join tokens to an entity. Return an entity, as text. By default this function uses
webstruct.utils.smart_join().Override it to customize
extract(),extract_from_url()andextract_groups()results. If this function returns empty string or None, entity is dropped.
-