Model Creation Helpers¶
webstruct.model contains convetional wrappers for creating NER models.
- class webstruct.model.NER(model, loader=None, html_tokenizer=None)[source]¶
Class for extracting named entities from HTML.
Initialize it with a trained model. model must have transform method that accepts lists of HtmlToken sequences and returns lists of predicted IOB2 tags. create_wapiti_pipeline() function returns such model.
- extract(bytes_data)[source]¶
Extract named entities from binary HTML data bytes_data. Return a list of (entity_text, entity_type) tuples.
- extract_from_url(url)[source]¶
A convenience wrapper for extract() method that downloads input data from a remote URL.
- extract_raw(bytes_data)[source]¶
Extract named entities from binary HTML data bytes_data. Return a list of (html_token, iob2_tag) tuples.
- extract_groups(bytes_data, dont_penalize=None)[source]¶
Extract groups of named entities from binary HTML data bytes_data. Return a list of lists of (entity_text, entity_type) tuples.
Entites are grouped using algorithm from webstruct.grouping.
- build_entity(html_tokens, tag)[source]¶
Join tokens to an entity. Return an entity, as text. By default this function uses webstruct.utils.smart_join().
Override it to customize extract(), extract_from_url() and extract_groups() results. If this function returns empty string or None, entity is dropped.