Feature Extraction¶
HTML Tokenization¶
webstruct.html_tokenizer contains HtmlTokenizer class
which allows to extract text from a web page and tokenize it, preserving
information about token position in HTML tree
(token + its tree position = HtmlToken). HtmlTokenizer
also allows to extract annotations from the tree (if present) and split
them from regular text/tokens.
-
class
webstruct.html_tokenizer.HtmlToken[source]¶ HTML token info.
Attributes:
indexis a token index (in thetokenslist)tokensis a list of all tokens in current html blockelemis the current html block (as lxml’s Element) - most likely you wantparentinstead of itis_tailflag indicates that token belongs to element tailpositionis logical position(in letters or codepoints) of token start in parent textlengthis logical length(in letters or codepoints) of token in parent text
Computed properties:
tokenis the current token (as text);parentis token’s parent HTML element (as lxml’s Element);rootis an ElementTree this token belongs to.
-
class
webstruct.html_tokenizer.HtmlTokenizer(tagset=None, sequence_encoder=None, text_tokenize_func=None, kill_html_tags=None, replace_html_tags=None, ignore_html_tags=None)[source]¶ Class for converting HTML trees (returned by one of the
webstruct.loaders) into lists ofHtmlTokeninstances and associated tags. Also, it can do the reverse conversion.Use
tokenize_single()to convert a single tree andtokenize()to convert multiple trees.Use
detokenize_single()to get an annotated tree out of a list ofHtmlTokeninstances and a list of tags.Parameters: - tagset : set, optional
A set of entity types to keep. If not passed, all entity types are kept. Use this argument to discard some entity types from training data.
- sequence_encoder : object, optional
Sequence encoder object. If not passed,
IobEncoderinstance is created.- text_toknize_func : callable, optional
Function used for tokenizing text inside HTML elements. By default,
HtmlTokenizeruseswebstruct.text_tokenizers.tokenize().- kill_html_tags: set, optional
A set of HTML tags which should be removed. Contents inside removed tags is not removed. See
webstruct.utils.kill_html_tags()- replace_html_tags: dict, optional
A mapping
{'old_tagname': 'new_tagname'}. It defines how tags should be renamed. Seewebstruct.utils.replace_html_tags()- ignore_html_tags: set, optional
A set of HTML tags which won’t produce
HtmlTokeninstances, but will be kept in a tree. Default is{'script', 'style'}.
-
detokenize_single(html_tokens, tags)[source]¶ Build annotated
lxml.etree.ElementTreefromhtml_tokens(a list ofHtmlTokeninstances) andtags(a list of their tags). ATTENTION:html_tokensshould be tokenized from tree without tagsAnnotations are encoded as
__START_TAG__and__END_TAG__text tokens (this is the formatwebstruct.loadersuse).
-
tokenize_single(tree)[source]¶ Return two lists:
- a list a list of HtmlToken tokens;
- a list of associated tags.
For unannotated HTML all tags will be “O” - they may be ignored.
Example:
>>> from webstruct import GateLoader, HtmlTokenizer >>> loader = GateLoader(known_entities={'PER'}) >>> html_tokenizer = HtmlTokenizer(replace_html_tags={'b': 'strong'}) >>> tree = loader.loadbytes(b"<p>hello, <PER>John <b>Doe</b></PER> <br> <PER>Mary</PER> said</p>") >>> html_tokens, tags = html_tokenizer.tokenize_single(tree) >>> html_tokens [HtmlToken(token='hello', parent=<Element p at ...>, index=0, ...), HtmlToken...] >>> tags ['O', 'B-PER', 'I-PER', 'B-PER', 'O'] >>> for tok, iob_tag in zip(html_tokens, tags): ... print("%5s" % iob_tag, tok.token, tok.elem.tag, tok.parent.tag) O hello p p B-PER John p p I-PER Doe strong strong B-PER Mary br p O said br p
For HTML without text it returns empty lists:
>>> html_tokenizer.tokenize_single(loader.loadbytes(b'<p></p>')) ([], [])
Feature Extraction Utilitites¶
webstruct.feature_extraction contains classes that help
with:
- converting HTML pages into lists of feature dicts and
- extracting annotations.
Usually, the approach is the following:
Convert a web page to a list of
HtmlTokeninstances and a list of annotation tags (if present).HtmlTokenizeris used for that.Run a number of “token feature functions” that return bits of information about each token: token text, token shape (uppercased/lowercased/…), whether token is in
<a>HTML element, etc. For each token information is combined into a single feature dictionary.Use
HtmlFeatureExtractorat this stage. There is a number of predefined token feature functions inwebstruct.features.Run a number of “global feature functions” that can modify token feature dicts inplace (insert new features, change, remove them) using “global” information - information about all other tokens in a document and their existing token-level feature dicts. Global feature functions are applied sequentially: subsequent global feature functions get feature dicts updated by previous feature functions.
This is also done by
HtmlFeatureExtractor.LongestMatchGlobalFeaturecan be used to create features that capture multi-token patterns. Some predefined global feature functions can be found inwebstruct.gazetteers.
-
class
webstruct.feature_extraction.HtmlFeatureExtractor(token_features, global_features=None, min_df=1)[source]¶ This class extracts features from lists of
HtmlTokeninstances (HtmlTokenizercan be used to create such lists).fit()/transform()/fit_transform()interface may look familiar to you if you ever used scikit-learn:HtmlFeatureExtractorimplements sklearn’s Transformer interface. But there is one twist: usually for sequence labelling tasks the whole sequences are considered observations. So in our case a single observation is a tokenized document (a list of tokens), not an individual token:fit()/transform()/fit_transform()methods accept lists of documents (lists of lists of tokens), and return lists of documents’ feature dicts (lists of lists of feature dicts).Parameters: - token_features : list of callables
List of “token” feature functions. Each function accepts a single
html_tokenparameter and returns a dictionary wich maps feature names to feature values. Dicts from all token feature functions are merged by HtmlFeatureExtractor. Example token feature (it just returns token text):>>> def current_token(html_token): ... return {'tok': html_token.token}
webstruct.featuresmodule provides some predefined feature functions, e.g.parent_tagwhich returns token’s parent tag.Example:
>>> from webstruct import GateLoader, HtmlTokenizer, HtmlFeatureExtractor >>> from webstruct.features import parent_tag >>> loader = GateLoader(known_entities={'PER'}) >>> html_tokenizer = HtmlTokenizer() >>> feature_extractor = HtmlFeatureExtractor(token_features=[parent_tag]) >>> tree = loader.loadbytes(b"<p>hello, <PER>John <b>Doe</b></PER> <br> <PER>Mary</PER> said</p>") >>> html_tokens, tags = html_tokenizer.tokenize_single(tree) >>> feature_dicts = feature_extractor.transform_single(html_tokens) >>> for token, tag, feat in zip(html_tokens, tags, feature_dicts): ... print("%s %s %s" % (token.token, tag, feat)) hello O {'parent_tag': 'p'} John B-PER {'parent_tag': 'p'} Doe I-PER {'parent_tag': 'b'} Mary B-PER {'parent_tag': 'p'} said O {'parent_tag': 'p'}
- global_features : list of callables, optional
List of “global” feature functions. Each “global” feature function should accept a single argument - a list of
(html_token, feature_dict)tuples. This list contains all tokens from the document and features extracted by previous feature functions.“Global” feature functions are applied after “token” feature functions in the order they are passed.
They should change feature dicts
feature_dictinplace.- min_df : integer or Mapping, optional
Feature values that have a document frequency strictly lower than the given threshold are removed. If
min_dfis integer, its value is used as threshold.TODO: if
min_dfis a dictionary, it should map feature names to thresholds.
Predefined Feature Functions¶
-
class
webstruct.features.token_features.PrefixFeatures(lenghts=(2, 3, 4), featname='prefix', lower=True)[source]¶
-
class
webstruct.features.token_features.SuffixFeatures(lenghts=(2, 3, 4), featname='suffix', lower=True)[source]¶
-
class
webstruct.features.global_features.DAWGGlobalFeature(filename, featname, format=None)[source]¶ Global feature that matches longest entities from a lexicon stored either in a
dawg.CompletionDAWG(ifformatis None) or in adawg.RecordDAWG(ifformatis not None).
Gazetteer Support¶
-
class
webstruct.gazetteers.features.MarisaGeonamesGlobalFeature(filename, featname, format=None)[source]¶ Global feature that matches longest entities from a lexicon extracted from geonames.org and stored in a MARISA Trie.
-
webstruct.gazetteers.geonames.read_geonames(filename)[source]¶ Parse geonames file to a pandas.DataFrame. File may be downloaded from http://download.geonames.org/export/dump/; it should be unzipped and in a “geonames table” format.
-
webstruct.gazetteers.geonames.read_geonames_zipped(zip_filename, geonames_filename=None)[source]¶ Parse zipped geonames file.
-
webstruct.gazetteers.geonames.to_dawg(df, columns=None, format=None)[source]¶ Encode
pandas.DataFramewith GeoNames data (loaded usingread_geonames()and maybe filtered in some way) todawg.DAWGordawg.RecordDAWG.dawg.DAWGis created ifcolumnsandformatare both None.
-
webstruct.gazetteers.geonames.to_marisa(df, columns=['country_code', 'feature_class', 'feature_code', 'admin1_code', 'admin2_code'], format='2s 1s 5s 2s 3s')[source]¶ Encode
pandas.DataFramewith GeoNames data (loaded usingread_geonames()and maybe filtered in some way) to amarisa.RecordTrie.