Feature Extraction¶
Feature Extraction Utilitites¶
webstruct.feature_extraction
contains classes that help
with:
- converting HTML pages into lists of feature dicts and
- extracting annotations.
Usually, the approach is the following:
Extract text from the webpage and tokenize it, preserving information about token position in original HTML tree (token + its tree position =
HtmlToken
). Information about annotations (if present) is split from the rest of data at this stage.HtmlTokenizer
is used for extracting HTML tokens and annotation tags.Run a number of “token feature functions” that return bits of information about each token: token text, token shape (uppercased/lowercased/...), whether token is in
<a>
HTML element, etc. For each token information is combined into a single feature dictionary.Use
HtmlFeatureExtractor
at this stage. There is a number of predefined token feature functions inwebstruct.features
.Run a number of “global feature functions” that can modify token feature dicts inplace (insert new features, change, remove them) using “global” information - information about all other tokens in a document and their existing token-level feature dicts. Global feature functions are applied sequentially: subsequent global feature functions get feature dicts updated by previous feature functions.
This is also done by
HtmlFeatureExtractor
.LongestMatchGlobalFeature
can be used to create features that capture multi-token patterns. Some predefined global feature functions can be found inwebstruct.gazetteers
.
-
class
webstruct.feature_extraction.
HtmlToken
[source]¶ HTML token info.
Attributes:
index
is a token index (in thetokens
list)tokens
is a list of all tokens in current html blockelem
is the current html block (as lxml’s Element) - most likely you wantparent
instead of itis_tail
flag indicates that token belongs to element tail
Computed properties:
token
is the current token (as text);parent
is token’s parent HTML element (as lxml’s Element);root
is an ElementTree this token belongs to.
-
class
webstruct.feature_extraction.
HtmlTokenizer
(tagset=None, sequence_encoder=None, text_tokenize_func=None, kill_html_tags=None, replace_html_tags=None, ignore_html_tags=None)[source]¶ Class for converting HTML trees (returned by one of the
webstruct.loaders
) into lists ofHtmlToken
instances and associated tags. Also, it can do the reverse conversion.Use
tokenize_single()
to convert a single tree andtokenize()
to convert multiple trees.Use
detokenize_single()
to get an annotated tree out of a list ofHtmlToken
instances and a list of tags.Parameters: tagset : set, optional
A set of entity types to keep. If not passed, all entity types are kept. Use this argument to discard some entity types from training data.
sequence_encoder : object, optional
Sequence encoder object. If not passed,
IobEncoder
instance is created.text_toknize_func : callable, optional
Function used for tokenizing text inside HTML elements. By default,
HtmlTokenizer
useswebstruct.text_tokenizers.tokenize()
.kill_html_tags: set, optional
A set of HTML tags which should be removed. Contents inside removed tags is not removed. See
webstruct.utils.kill_html_tags()
replace_html_tags: dict, optional
A mapping
{'old_tagname': 'new_tagname'}
. It defines how tags should be renamed. Seewebstruct.utils.replace_html_tags()
ignore_html_tags: set, optional
A set of HTML tags which won’t produce
HtmlToken
instances, but will be kept in a tree. Default is{'script', 'style'}
.-
detokenize_single
(html_tokens, tags)[source]¶ Build annotated
lxml.etree.ElementTree
fromhtml_tokens
(a list ofHtmlToken
instances) andtags
(a list of their tags).Annotations are encoded as
__START_TAG__
and__END_TAG__
text tokens (this is the formatwebstruct.loaders
use).
-
tokenize_single
(tree)[source]¶ Return two lists:
- a list a list of HtmlToken tokens;
- a list of associated tags.
For unannotated HTML all tags will be “O” - they may be ignored.
Example:
>>> from webstruct import GateLoader, HtmlTokenizer >>> loader = GateLoader(known_entities={'PER'}) >>> html_tokenizer = HtmlTokenizer(replace_html_tags={'b': 'strong'}) >>> tree = loader.loadbytes(b"<p>hello, <PER>John <b>Doe</b></PER> <br> <PER>Mary</PER> said</p>") >>> html_tokens, tags = html_tokenizer.tokenize_single(tree) >>> html_tokens [HtmlToken(token='hello', parent=<Element p at ...>, index=0), HtmlToken...] >>> tags ['O', 'B-PER', 'I-PER', 'B-PER', 'O'] >>> for tok, iob_tag in zip(html_tokens, tags): ... print("%5s" % iob_tag, tok.token, tok.elem.tag, tok.parent.tag) O hello p p B-PER John p p I-PER Doe strong strong B-PER Mary br p O said br p
For HTML without text it returns empty lists:
>>> html_tokenizer.tokenize_single(loader.loadbytes(b'<p></p>')) ([], [])
-
-
class
webstruct.feature_extraction.
HtmlFeatureExtractor
(token_features, global_features=None, min_df=1)[source]¶ This class extracts features from lists of
HtmlToken
instances (HtmlTokenizer
can be used to create such lists).fit()
/transform()
/fit_transform()
interface may look familiar to you if you ever used scikit-learn:HtmlFeatureExtractor
implements sklearn’s Transformer interface. But there is one twist: usually for sequence labelling tasks the whole sequences are considered observations. So in our case a single observation is a tokenized document (a list of tokens), not an individual token:fit()
/transform()
/fit_transform()
methods accept lists of documents (lists of lists of tokens), and return lists of documents’ feature dicts (lists of lists of feature dicts).Parameters: token_features : list of callables
List of “token” feature functions. Each function accepts a single
html_token
parameter and returns a dictionary wich maps feature names to feature values. Dicts from all token feature functions are merged by HtmlFeatureExtractor. Example token feature (it just returns token text):>>> def current_token(html_token): ... return {'tok': html_token.token}
webstruct.features
module provides some predefined feature functions, e.g.parent_tag
which returns token’s parent tag.Example:
>>> from webstruct import GateLoader, HtmlTokenizer, HtmlFeatureExtractor >>> from webstruct.features import parent_tag >>> loader = GateLoader(known_entities={'PER'}) >>> html_tokenizer = HtmlTokenizer() >>> feature_extractor = HtmlFeatureExtractor(token_features=[parent_tag]) >>> tree = loader.loadbytes(b"<p>hello, <PER>John <b>Doe</b></PER> <br> <PER>Mary</PER> said</p>") >>> html_tokens, tags = html_tokenizer.tokenize_single(tree) >>> feature_dicts = feature_extractor.transform_single(html_tokens) >>> for token, tag, feat in zip(html_tokens, tags, feature_dicts): ... print("%s %s %s" % (token.token, tag, feat)) hello O {'parent_tag': 'p'} John B-PER {'parent_tag': 'p'} Doe I-PER {'parent_tag': 'b'} Mary B-PER {'parent_tag': 'p'} said O {'parent_tag': 'p'}
global_features : list of callables, optional
List of “global” feature functions. Each “global” feature function should accept a single argument - a list of
(html_token, feature_dict)
tuples. This list contains all tokens from the document and features extracted by previous feature functions.“Global” feature functions are applied after “token” feature functions in the order they are passed.
They should change feature dicts
feature_dict
inplace.min_df : integer or Mapping, optional
Feature values that have a document frequency strictly lower than the given threshold are removed. If
min_df
is integer, its value is used as threshold.TODO: if
min_df
is a dictionary, it should map feature names to thresholds.
Predefined Feature Functions¶
-
class
webstruct.features.token_features.
PrefixFeatures
(lenghts=(2, 3, 4), featname='prefix', lower=True)[source]¶
-
class
webstruct.features.token_features.
SuffixFeatures
(lenghts=(2, 3, 4), featname='suffix', lower=True)[source]¶
-
class
webstruct.features.global_features.
DAWGGlobalFeature
(filename, featname, format=None)[source]¶ Global feature that matches longest entities from a lexicon stored either in a
dawg.CompletionDAWG
(ifformat
is None) or in adawg.RecordDAWG
(ifformat
is not None).
Gazetteer Support¶
-
class
webstruct.gazetteers.features.
MarisaGeonamesGlobalFeature
(filename, featname, format=None)[source]¶ Global feature that matches longest entities from a lexicon extracted from geonames.org and stored in a MARISA Trie.
-
webstruct.gazetteers.geonames.
read_geonames
(filename)[source]¶ Parse geonames file to a pandas.DataFrame. File may be downloaded from http://download.geonames.org/export/dump/; it should be unzipped and in a “geonames table” format.
-
webstruct.gazetteers.geonames.
read_geonames_zipped
(zip_filename, geonames_filename=None)[source]¶ Parse zipped geonames file.
-
webstruct.gazetteers.geonames.
to_dawg
(df, columns=None, format=None)[source]¶ Encode
pandas.DataFrame
with GeoNames data (loaded usingread_geonames()
and maybe filtered in some way) todawg.DAWG
ordawg.RecordDAWG
.dawg.DAWG
is created ifcolumns
andformat
are both None.
-
webstruct.gazetteers.geonames.
to_marisa
(df, columns=['country_code', 'feature_class', 'feature_code', 'admin1_code', 'admin2_code'], format='2s 1s 5s 2s 3s')[source]¶ Encode
pandas.DataFrame
with GeoNames data (loaded usingread_geonames()
and maybe filtered in some way) to amarisa.RecordTrie
.