Model Creation Helpers

webstruct.model contains convetional wrappers for creating NER models.

class webstruct.model.NER(model, loader=None, html_tokenizer=None)[source]

Class for extracting named entities from HTML.

Initialize it with a trained model. model must have transform method that accepts lists of HtmlToken sequences and returns lists of predicted IOB2 tags. create_wapiti_pipeline() function returns such model.

extract(bytes_data)[source]

Extract named entities from binary HTML data bytes_data. Return a list of (entity_text, entity_type) tuples.

extract_from_url(url)[source]

A convenience wrapper for extract() method that downloads input data from a remote URL.

extract_raw(bytes_data)[source]

Extract named entities from binary HTML data bytes_data. Return a list of (html_token, iob2_tag) tuples.

extract_groups(bytes_data, dont_penalize=None)[source]

Extract groups of named entities from binary HTML data bytes_data. Return a list of lists of (entity_text, entity_type) tuples.

Entites are grouped using algorithm from webstruct.grouping.

build_entity(html_tokens, tag)[source]

Join tokens to an entity. Return an entity, as text. By default this function uses webstruct.utils.smart_join().

Override it to customize extract(), extract_from_url() and extract_groups() results. If this function returns empty string or None, entity is dropped.