Wapiti Helpers¶
webstruct.wapiti
module provides utilities for easier creation
of Wapiti models, templates and data files.
-
class
webstruct.wapiti.
WapitiCRF
(model_filename=None, train_args=None, feature_template='# Label unigrams and bigrams:n*n', unigrams_scope='u', tempdir=None, unlink_temp=True, verbose=True, feature_encoder=None, dev_size=0, top_n=1)[source]¶ Bases:
webstruct.base.BaseSequenceClassifier
Class for training and applying Wapiti CRF models.
For training it relies on calling original Wapiti binary (via subprocess), so “wapiti” binary must be available if you need “fit” method.
Trained model is saved in an external file; its filename is a first parameter to constructor. This file is created and overwritten by
WapitiCRF.fit()
; it must exist forWapitiCRF.transform()
to work.For prediction WapitiCRF relies on python-wapiti library.
-
WAPITI_CMD
= 'wapiti'¶ Command used to start wapiti
-
fit
(X, y, X_dev=None, y_dev=None, out_dev=None)[source]¶ Train a model.
Parameters: - X : list of lists of dicts
Feature dicts for several documents.
- y : a list of lists of strings
Labels for several documents.
- X_dev : (optional) list of lists of feature dicts
Data used for testing and as a stopping criteria.
- y_dev : (optional) list of lists of labels
Labels corresponding to X_dev.
- out_dev : (optional) string
Path to a file where tagged development data will be written.
-
predict
(X)[source]¶ Make a prediction.
Parameters: - X : list of lists
feature dicts
Returns: - y : list of lists
predicted labels
-
score
(X, y)¶ Macro-averaged F1 score of lists of BIO-encoded sequences
y_true
andy_pred
.A named entity in a sequence from
y_pred
is considered correct only if it is an exact match of the corresponding entity in they_true
.
-
-
class
webstruct.wapiti.
WapitiFeatureEncoder
(move_to_front=('token', ))[source]¶ Bases:
BaseEstimator
,TransformerMixin
Utility class for preparing Wapiti templates and converting sequences of dicts with features to the format Wapiti understands.
-
fit
(X, y=None)[source]¶ X should be a list of lists of dicts with features. It can be obtained, for example, using
HtmlFeatureExtractor
.
-
prepare_template
(template)[source]¶ Prepare Wapiti template by replacing feature names with feature column indices inside
%x[row,col]
macros. Indices are compatible withWapitiFeatureEncoder.transform()
output.>>> we = WapitiFeatureEncoder(['token', 'tag']) >>> seq_features = [{'token': 'the', 'tag': 'DT'}, {'token': 'dog', 'tag': 'NN'}] >>> we.fit([seq_features]) WapitiFeatureEncoder(move_to_front=('token', 'tag')) >>> we.prepare_template('*:Pos-1 L=%x[-1, tag]\n*:Suf-2 X=%m[ 0,token,".?.?$"]') '*:Pos-1 L=%x[-1,1]\n*:Suf-2 X=%m[0,0,".?.?$"]'
Check these links for more info about template format:
-
transform_single
(feature_dicts)[source]¶ Transform a sequence of dicts
feature_dicts
to a list of Wapiti data file lines.
-
unigram_features_template
(scope='*')[source]¶ Return Wapiti template with unigram features for each of known features.
>>> we = WapitiFeatureEncoder(['token', 'tag']) >>> seq_features = [{'token': 'the', 'tag': 'DT'}, {'token': 'dog', 'tag': 'NN'}] >>> we.fit([seq_features]) WapitiFeatureEncoder(move_to_front=('token', 'tag')) >>> print(we.unigram_features_template()) # Unigrams for all custom features *feat:token=%x[0,0] *feat:tag=%x[0,1] >>> print(we.unigram_features_template('u')) # Unigrams for all custom features ufeat:token=%x[0,0] ufeat:tag=%x[0,1]
-
-
webstruct.wapiti.
create_wapiti_pipeline
(model_filename=None, token_features=None, global_features=None, min_df=1, **crf_kwargs)[source]¶ Create a scikit-learn Pipeline for HTML tagging using Wapiti. This pipeline expects data produced by
HtmlTokenizer
as an input and produces sequences of IOB2 tags as output.Example:
import webstruct from webstruct.features import EXAMPLE_TOKEN_FEATURES # load train data html_tokenizer = webstruct.HtmlTokenizer() train_trees = webstruct.load_trees( "train/*.html", webstruct.WebAnnotatorLoader() ) X_train, y_train = html_tokenizer.tokenize(train_trees) # train model = webstruct.create_wapiti_pipeline( model_filename = 'model.wapiti', token_features = EXAMPLE_TOKEN_FEATURES, train_args = '--algo l-bfgs --maxiter 50 --nthread 8 --jobsize 1 --stopwin 10', ) model.fit(X_train, y_train) # load test data test_trees = webstruct.load_trees( "test/*.html", webstruct.WebAnnotatorLoader() ) X_test, y_test = html_tokenizer.tokenize(test_trees) # do a prediction y_pred = model.predict(X_test)
-
webstruct.wapiti.
merge_top_n
(chains)[source]¶ Take first (most probable) as base for resulting chain and merge other N-1 chains one by one Entities in next merged chain, which has any overlap with entities in resulting chain, just ignored
non-overlap >>> chains = [ [‘B-PER’, ‘O’ ], … [‘O’ , ‘B-FUNC’] ]
>>> merge_top_n(chains) ['B-PER', 'B-FUNC']
partially overlap >>> chains = [ [‘B-PER’, ‘I-PER’, ‘O’ ], … [‘O’ , ‘B-PER’, ‘I-PER’] ]
>>> merge_top_n(chains) ['B-PER', 'I-PER', 'O']
fully overlap >>> chains = [ [‘B-PER’, ‘I-PER’], … [‘B-ORG’, ‘I-ORG’] ]
>>> merge_top_n(chains) ['B-PER', 'I-PER']
-
webstruct.wapiti.
prepare_wapiti_template
(template, vocabulary)[source]¶ Prepare Wapiti template by replacing feature names with feature column indices inside
%x[row,col]
macros:>>> vocab = {'token': 0, 'tag': 1} >>> prepare_wapiti_template('*:Pos-1 L=%x[-1, tag]\n*:Suf-2 X=%m[ 0,token,".?.?$"]', vocab) '*:Pos-1 L=%x[-1,1]\n*:Suf-2 X=%m[0,0,".?.?$"]'
It understands which lines are comments:
>>> prepare_wapiti_template('*:Pos-1 L=%x[-1, tag]\n# *:Suf-2 X=%m[ 0,token,".?.?$"]', vocab) '*:Pos-1 L=%x[-1,1]\n# *:Suf-2 X=%m[ 0,token,".?.?$"]'
Check these links for more info about template format: