Feature Extraction

HTML Tokenization

webstruct.html_tokenizer contains HtmlTokenizer class which allows to extract text from a web page and tokenize it, preserving information about token position in HTML tree (token + its tree position = HtmlToken). HtmlTokenizer also allows to extract annotations from the tree (if present) and split them from regular text/tokens.

class webstruct.html_tokenizer.HtmlToken[source]

HTML token info.


  • index is a token index (in the tokens list)
  • tokens is a list of all tokens in current html block
  • elem is the current html block (as lxml’s Element) - most likely you want parent instead of it
  • is_tail flag indicates that token belongs to element tail
  • position is logical position(in letters or codepoints) of token start in parent text
  • length is logical length(in letters or codepoints) of token in parent text

Computed properties:

  • token is the current token (as text);
  • parent is token’s parent HTML element (as lxml’s Element);
  • root is an ElementTree this token belongs to.
class webstruct.html_tokenizer.HtmlTokenizer(tagset=None, sequence_encoder=None, text_tokenize_func=None, kill_html_tags=None, replace_html_tags=None, ignore_html_tags=None)[source]

Class for converting HTML trees (returned by one of the webstruct.loaders) into lists of HtmlToken instances and associated tags. Also, it can do the reverse conversion.

Use tokenize_single() to convert a single tree and tokenize() to convert multiple trees.

Use detokenize_single() to get an annotated tree out of a list of HtmlToken instances and a list of tags.

tagset : set, optional

A set of entity types to keep. If not passed, all entity types are kept. Use this argument to discard some entity types from training data.

sequence_encoder : object, optional

Sequence encoder object. If not passed, IobEncoder instance is created.

text_toknize_func : callable, optional

Function used for tokenizing text inside HTML elements. By default, HtmlTokenizer uses webstruct.text_tokenizers.tokenize().

kill_html_tags: set, optional

A set of HTML tags which should be removed. Contents inside removed tags is not removed. See webstruct.utils.kill_html_tags()

replace_html_tags: dict, optional

A mapping {'old_tagname': 'new_tagname'}. It defines how tags should be renamed. See webstruct.utils.replace_html_tags()

ignore_html_tags: set, optional

A set of HTML tags which won’t produce HtmlToken instances, but will be kept in a tree. Default is {'script', 'style'}.

detokenize_single(html_tokens, tags)[source]

Build annotated lxml.etree.ElementTree from html_tokens (a list of HtmlToken instances) and tags (a list of their tags). ATTENTION: html_tokens should be tokenized from tree without tags

Annotations are encoded as __START_TAG__ and __END_TAG__ text tokens (this is the format webstruct.loaders use).


Return two lists:

  • a list a list of HtmlToken tokens;
  • a list of associated tags.

For unannotated HTML all tags will be “O” - they may be ignored.


>>> from webstruct import GateLoader, HtmlTokenizer
>>> loader = GateLoader(known_entities={'PER'})
>>> html_tokenizer = HtmlTokenizer(replace_html_tags={'b': 'strong'})
>>> tree = loader.loadbytes(b"<p>hello, <PER>John <b>Doe</b></PER> <br> <PER>Mary</PER> said</p>")
>>> html_tokens, tags = html_tokenizer.tokenize_single(tree)
>>> html_tokens
[HtmlToken(token='hello', parent=<Element p at ...>, index=0, ...), HtmlToken...]
>>> tags
['O', 'B-PER', 'I-PER', 'B-PER', 'O']
>>> for tok, iob_tag in zip(html_tokens, tags):
...     print("%5s" % iob_tag, tok.token, tok.elem.tag, tok.parent.tag)
    O hello p p
B-PER John p p
I-PER Doe strong strong
B-PER Mary br p
    O said br p

For HTML without text it returns empty lists:

>>> html_tokenizer.tokenize_single(loader.loadbytes(b'<p></p>'))
([], [])

Feature Extraction Utilitites

webstruct.feature_extraction contains classes that help with:

  • converting HTML pages into lists of feature dicts and
  • extracting annotations.

Usually, the approach is the following:

  1. Convert a web page to a list of HtmlToken instances and a list of annotation tags (if present). HtmlTokenizer is used for that.

  2. Run a number of “token feature functions” that return bits of information about each token: token text, token shape (uppercased/lowercased/…), whether token is in <a> HTML element, etc. For each token information is combined into a single feature dictionary.

    Use HtmlFeatureExtractor at this stage. There is a number of predefined token feature functions in webstruct.features.

  3. Run a number of “global feature functions” that can modify token feature dicts inplace (insert new features, change, remove them) using “global” information - information about all other tokens in a document and their existing token-level feature dicts. Global feature functions are applied sequentially: subsequent global feature functions get feature dicts updated by previous feature functions.

    This is also done by HtmlFeatureExtractor.

    LongestMatchGlobalFeature can be used to create features that capture multi-token patterns. Some predefined global feature functions can be found in webstruct.gazetteers.

class webstruct.feature_extraction.HtmlFeatureExtractor(token_features, global_features=None, min_df=1)[source]

This class extracts features from lists of HtmlToken instances (HtmlTokenizer can be used to create such lists).

fit() / transform() / fit_transform() interface may look familiar to you if you ever used scikit-learn: HtmlFeatureExtractor implements sklearn’s Transformer interface. But there is one twist: usually for sequence labelling tasks the whole sequences are considered observations. So in our case a single observation is a tokenized document (a list of tokens), not an individual token: fit() / transform() / fit_transform() methods accept lists of documents (lists of lists of tokens), and return lists of documents’ feature dicts (lists of lists of feature dicts).

token_features : list of callables

List of “token” feature functions. Each function accepts a single html_token parameter and returns a dictionary wich maps feature names to feature values. Dicts from all token feature functions are merged by HtmlFeatureExtractor. Example token feature (it just returns token text):

>>> def current_token(html_token):
...     return {'tok': html_token.token}

webstruct.features module provides some predefined feature functions, e.g. parent_tag which returns token’s parent tag.


>>> from webstruct import GateLoader, HtmlTokenizer, HtmlFeatureExtractor
>>> from webstruct.features import parent_tag

>>> loader = GateLoader(known_entities={'PER'})
>>> html_tokenizer = HtmlTokenizer()
>>> feature_extractor = HtmlFeatureExtractor(token_features=[parent_tag])

>>> tree = loader.loadbytes(b"<p>hello, <PER>John <b>Doe</b></PER> <br> <PER>Mary</PER> said</p>")
>>> html_tokens, tags = html_tokenizer.tokenize_single(tree)
>>> feature_dicts = feature_extractor.transform_single(html_tokens)
>>> for token, tag, feat in zip(html_tokens, tags, feature_dicts):
...     print("%s %s %s" % (token.token, tag, feat))
hello O {'parent_tag': 'p'}
John B-PER {'parent_tag': 'p'}
Doe I-PER {'parent_tag': 'b'}
Mary B-PER {'parent_tag': 'p'}
said O {'parent_tag': 'p'}
global_features : list of callables, optional

List of “global” feature functions. Each “global” feature function should accept a single argument - a list of (html_token, feature_dict) tuples. This list contains all tokens from the document and features extracted by previous feature functions.

“Global” feature functions are applied after “token” feature functions in the order they are passed.

They should change feature dicts feature_dict inplace.

min_df : integer or Mapping, optional

Feature values that have a document frequency strictly lower than the given threshold are removed. If min_df is integer, its value is used as threshold.

TODO: if min_df is a dictionary, it should map feature names to thresholds.

fit(html_token_lists, y=None)[source]
fit_transform(html_token_lists, y=None, **fit_params)[source]

Predefined Feature Functions

class webstruct.features.token_features.PrefixFeatures(lenghts=(2, 3, 4), featname='prefix', lower=True)[source]
class webstruct.features.token_features.SuffixFeatures(lenghts=(2, 3, 4), featname='suffix', lower=True)[source]
class webstruct.features.block_features.InsideTag(tagname)[source]
class webstruct.features.global_features.DAWGGlobalFeature(filename, featname, format=None)[source]

Global feature that matches longest entities from a lexicon stored either in a dawg.CompletionDAWG (if format is None) or in a dawg.RecordDAWG (if format is not None).

class webstruct.features.global_features.LongestMatchGlobalFeature(lookup_data, featname)[source]
process_range(doc, start, end, matched_text)[source]
class webstruct.features.global_features.Pattern(*lookups, **kwargs)[source]

Global feature that combines local features.

Gazetteer Support

class webstruct.gazetteers.features.MarisaGeonamesGlobalFeature(filename, featname, format=None)[source]

Global feature that matches longest entities from a lexicon extracted from geonames.org and stored in a MARISA Trie.


Parse geonames file to a pandas.DataFrame. File may be downloaded from http://download.geonames.org/export/dump/; it should be unzipped and in a “geonames table” format.

webstruct.gazetteers.geonames.read_geonames_zipped(zip_filename, geonames_filename=None)[source]

Parse zipped geonames file.

webstruct.gazetteers.geonames.to_dawg(df, columns=None, format=None)[source]

Encode pandas.DataFrame with GeoNames data (loaded using read_geonames() and maybe filtered in some way) to dawg.DAWG or dawg.RecordDAWG. dawg.DAWG is created if columns and format are both None.

webstruct.gazetteers.geonames.to_marisa(df, columns=['country_code', 'feature_class', 'feature_code', 'admin1_code', 'admin2_code'], format='2s 1s 5s 2s 3s')[source]

Encode pandas.DataFrame with GeoNames data (loaded using read_geonames() and maybe filtered in some way) to a marisa.RecordTrie.