Wapiti Helpers

webstruct.wapiti module provides utilities for easier creation of Wapiti models, templates and data files.

class webstruct.wapiti.WapitiCRF(model_filename=None, train_args=None, feature_template='# Label unigrams and bigrams:n*n', unigrams_scope='u', tempdir=None, unlink_temp=True, verbose=True, feature_encoder=None, dev_size=0, top_n=1)[source]

Bases: webstruct.base.BaseSequenceClassifier

Class for training and applying Wapiti CRF models.

For training it relies on calling original Wapiti binary (via subprocess), so “wapiti” binary must be available if you need “fit” method.

Trained model is saved in an external file; its filename is a first parameter to constructor. This file is created and overwritten by WapitiCRF.fit(); it must exist for WapitiCRF.transform() to work.

For prediction WapitiCRF relies on python-wapiti library.

WAPITI_CMD = 'wapiti'

Command used to start wapiti

fit(X, y, X_dev=None, y_dev=None, out_dev=None)[source]

Train a model.

X : list of lists of dicts

Feature dicts for several documents.

y : a list of lists of strings

Labels for several documents.

X_dev : (optional) list of lists of feature dicts

Data used for testing and as a stopping criteria.

y_dev : (optional) list of lists of labels

Labels corresponding to X_dev.

out_dev : (optional) string

Path to a file where tagged development data will be written.


Make a prediction.

X : list of lists

feature dicts

y : list of lists

predicted labels


Run wapiti binary in a subprocess

score(X, y)

Macro-averaged F1 score of lists of BIO-encoded sequences y_true and y_pred.

A named entity in a sequence from y_pred is considered correct only if it is an exact match of the corresponding entity in the y_true.

class webstruct.wapiti.WapitiFeatureEncoder(move_to_front=('token', ))[source]

Bases: BaseEstimator, TransformerMixin

Utility class for preparing Wapiti templates and converting sequences of dicts with features to the format Wapiti understands.

fit(X, y=None)[source]

X should be a list of lists of dicts with features. It can be obtained, for example, using HtmlFeatureExtractor.


Prepare Wapiti template by replacing feature names with feature column indices inside %x[row,col] macros. Indices are compatible with WapitiFeatureEncoder.transform() output.

>>> we = WapitiFeatureEncoder(['token', 'tag'])
>>> seq_features = [{'token': 'the', 'tag': 'DT'}, {'token': 'dog', 'tag': 'NN'}]
>>> we.fit([seq_features])
WapitiFeatureEncoder(move_to_front=('token', 'tag'))
>>> we.prepare_template('*:Pos-1 L=%x[-1, tag]\n*:Suf-2 X=%m[ 0,token,".?.?$"]')
'*:Pos-1 L=%x[-1,1]\n*:Suf-2 X=%m[0,0,".?.?$"]'

Check these links for more info about template format:


Transform a sequence of dicts feature_dicts to a list of Wapiti data file lines.


Return Wapiti template with unigram features for each of known features.

>>> we = WapitiFeatureEncoder(['token', 'tag'])
>>> seq_features = [{'token': 'the', 'tag': 'DT'}, {'token': 'dog', 'tag': 'NN'}]
>>> we.fit([seq_features])
WapitiFeatureEncoder(move_to_front=('token', 'tag'))
>>> print(we.unigram_features_template())

# Unigrams for all custom features

>>> print(we.unigram_features_template('u'))

# Unigrams for all custom features
webstruct.wapiti.create_wapiti_pipeline(model_filename=None, token_features=None, global_features=None, min_df=1, **crf_kwargs)[source]

Create a scikit-learn Pipeline for HTML tagging using Wapiti. This pipeline expects data produced by HtmlTokenizer as an input and produces sequences of IOB2 tags as output.


import webstruct
from webstruct.features import EXAMPLE_TOKEN_FEATURES

# load train data
html_tokenizer = webstruct.HtmlTokenizer()
train_trees = webstruct.load_trees(
X_train, y_train = html_tokenizer.tokenize(train_trees)

# train
model = webstruct.create_wapiti_pipeline(
    model_filename = 'model.wapiti',
    token_features = EXAMPLE_TOKEN_FEATURES,
    train_args = '--algo l-bfgs --maxiter 50 --nthread 8 --jobsize 1 --stopwin 10',
model.fit(X_train, y_train)

# load test data
test_trees = webstruct.load_trees(
X_test, y_test = html_tokenizer.tokenize(test_trees)

# do a prediction
y_pred = model.predict(X_test)

Take first (most probable) as base for resulting chain and merge other N-1 chains one by one Entities in next merged chain, which has any overlap with entities in resulting chain, just ignored

non-overlap >>> chains = [ [‘B-PER’, ‘O’ ], … [‘O’ , ‘B-FUNC’] ]

>>> merge_top_n(chains)
['B-PER', 'B-FUNC']

partially overlap >>> chains = [ [‘B-PER’, ‘I-PER’, ‘O’ ], … [‘O’ , ‘B-PER’, ‘I-PER’] ]

>>> merge_top_n(chains)
['B-PER', 'I-PER', 'O']

fully overlap >>> chains = [ [‘B-PER’, ‘I-PER’], … [‘B-ORG’, ‘I-ORG’] ]

>>> merge_top_n(chains)
['B-PER', 'I-PER']
webstruct.wapiti.prepare_wapiti_template(template, vocabulary)[source]

Prepare Wapiti template by replacing feature names with feature column indices inside %x[row,col] macros:

>>> vocab = {'token': 0, 'tag': 1}
>>> prepare_wapiti_template('*:Pos-1 L=%x[-1, tag]\n*:Suf-2 X=%m[ 0,token,".?.?$"]', vocab)
'*:Pos-1 L=%x[-1,1]\n*:Suf-2 X=%m[0,0,".?.?$"]'

It understands which lines are comments:

>>> prepare_wapiti_template('*:Pos-1 L=%x[-1, tag]\n# *:Suf-2 X=%m[ 0,token,".?.?$"]', vocab)
'*:Pos-1 L=%x[-1,1]\n# *:Suf-2 X=%m[ 0,token,".?.?$"]'

Check these links for more info about template format: