Model Creation Helpers¶
webstruct.model
contains convetional wrappers for creating NER models.
-
class
webstruct.model.
NER
(model, loader=None, html_tokenizer=None, entity_colors=None)[source]¶ Class for extracting named entities from HTML.
Initialize it with a trained
model
.model
must havepredict
method that accepts lists ofHtmlToken
sequences and returns lists of predicted IOB2 tags.create_wapiti_pipeline()
function returns such model.-
extract
(bytes_data)[source]¶ Extract named entities from binary HTML data
bytes_data
. Return a list of(entity_text, entity_type)
tuples.
-
extract_from_url
(url)[source]¶ A convenience wrapper for
extract()
method that downloads input data from a remote URL.
-
extract_raw
(bytes_data)[source]¶ Extract named entities from binary HTML data
bytes_data
. Return a list of(html_token, iob2_tag)
tuples.
-
extract_groups
(bytes_data, dont_penalize=None)[source]¶ Extract groups of named entities from binary HTML data
bytes_data
. Return a list of lists of(entity_text, entity_type)
tuples.Entites are grouped using algorithm from
webstruct.grouping
.
-
extract_groups_from_url
(url, dont_penalize=None)[source]¶ A convenience wrapper for
extract_groups()
method that downloads input data from a remote URL.
-
build_entity
(html_tokens)[source]¶ Join tokens to an entity. Return an entity, as text. By default this function uses
webstruct.utils.smart_join()
.Override it to customize
extract()
,extract_from_url()
andextract_groups()
results. If this function returns empty string or None, entity is dropped.
-