Wapiti Helpers¶

webstruct.wapiti module provides utilities for easier creation of Wapiti models, templates and data files.

class webstruct.wapiti.WapitiCRF(model_filename, train_args=(), feature_template='# Label unigrams and bigrams:n*n', unigrams_scope='u', tempdir=None, unlink_temp=True, verbose=True, feature_encoder=None, dev_size=0)[source]¶

Class for training and applying Wapiti CRF models.

For training it relies on calling original Wapiti binary (via subprocess), so “wapiti” binary must be available if you need “fit” method.

Trained model is saved in an external file; its filename is a first parameter to constructor. This file is created and overwritten by WapitiCRF.fit(); it must exist for WapitiCRF.transform() to work.

For prediction WapitiCRF relies on python-wapiti library.

WAPITI_CMD = 'wapiti'¶: Command used to start wapiti

fit(X, y, X_dev=None, y_dev=None, out_dev=None)[source]¶

Train a model.

Parameters:

X : list of lists of dicts

Feature dicts for several documents.

y : a list of lists of strings

Labels for several documents.

X_dev : (optional) list of lists of feature dicts

Data used for testing and as a stopping criteria.

y_dev : (optional) list of lists of labels

Labels corresponding to X_dev.

out_dev : (optional) string

Path to a file where tagged development data will be written.

run_wapiti(args)[source]¶

score(X, y)[source]¶

Macro-averaged F1 score of lists of BIO-encoded sequences y_true and y_pred.

A named entity in a sequence from y_pred is considered correct only if it is an exact match of the corresponding entity in the y_true.

It requires https://github.com/larsmans/seqlearn to work.

transform(X)[source]¶

Make a prediction.

Parameters:

X : list of lists

feature dicts

Returns:

y : list of lists

predicted labels

class webstruct.wapiti.WapitiFeatureEncoder(move_to_front=('token', ))[source]¶

Utility class for preparing Wapiti templates and converting sequences of dicts with features to the format Wapiti understands.

fit(X, y=None)[source]¶: X should be a list of lists of dicts with features. It can be obtained, for example, using HtmlFeatureExtractor.

partial_fit(X, y=None)[source]¶

prepare_template(template)[source]¶

Prepare Wapiti template by replacing feature names with feature column indices inside %x[row,col] macros. Indices are compatible with WapitiFeatureEncoder.transform() output.

>>> we = WapitiFeatureEncoder(['token', 'tag'])
>>> seq_features = [{'token': 'the', 'tag': 'DT'}, {'token': 'dog', 'tag': 'NN'}]
>>> we.fit([seq_features])
WapitiFeatureEncoder(move_to_front=('token', 'tag'))
>>> we.prepare_template('*:Pos-1 L=%x[-1, tag]\n*:Suf-2 X=%m[ 0,token,".?.?$"]')
'*:Pos-1 L=%x[-1,1]\n*:Suf-2 X=%m[0,0,".?.?$"]'

Check these links for more info about template format:

reset()[source]¶

transform(X)[source]¶

transform_single(feature_dicts)[source]¶: Transform a sequence of dicts feature_dicts to a list of Wapiti data file lines.

unigram_features_template(scope='*')[source]¶

Return Wapiti template with unigram features for each of known features.

>>> we = WapitiFeatureEncoder(['token', 'tag'])
>>> seq_features = [{'token': 'the', 'tag': 'DT'}, {'token': 'dog', 'tag': 'NN'}]
>>> we.fit([seq_features])
WapitiFeatureEncoder(move_to_front=('token', 'tag'))
>>> print(we.unigram_features_template())

# Unigrams for all custom features
*feat:token=%x[0,0]
*feat:tag=%x[0,1]

>>> print(we.unigram_features_template('u'))

# Unigrams for all custom features
ufeat:token=%x[0,0]
ufeat:tag=%x[0,1]

webstruct.wapiti.create_wapiti_pipeline(model_filename, token_features=None, global_features=None, train_args=None, feature_template=None, min_df=1, **wapiti_kwargs)[source]¶

Create a scikit-learn Pipeline for HTML tagging using Wapiti. This pipeline expects data produced by HtmlTokenizer as an input and produces sequences of IOB2 tags as output.

Example of training, with all parameters default:

>>> import webstruct
>>> trees = webstruct.load_trees([
...    ("train/*.html", webstruct.WebAnnotatorLoader())
... ])  
>>> X, y = webstruct.HtmlTokenizer().tokenize(trees)  
>>> model = webstruct.create_wapiti_pipeline('model.wapiti')  
>>> model.fit(X, y)  

webstruct.wapiti.prepare_wapiti_template(template, vocabulary)[source]¶

Prepare Wapiti template by replacing feature names with feature column indices inside %x[row,col] macros:

>>> vocab = {'token': 0, 'tag': 1}
>>> prepare_wapiti_template('*:Pos-1 L=%x[-1, tag]\n*:Suf-2 X=%m[ 0,token,".?.?$"]', vocab)
'*:Pos-1 L=%x[-1,1]\n*:Suf-2 X=%m[0,0,".?.?$"]'

It understands which lines are comments:

>>> prepare_wapiti_template('*:Pos-1 L=%x[-1, tag]\n# *:Suf-2 X=%m[ 0,token,".?.?$"]', vocab)
'*:Pos-1 L=%x[-1,1]\n# *:Suf-2 X=%m[ 0,token,".?.?$"]'

Check these links for more info about template format: