CRFsuite Helpers

CRFsuite backend for webstruct based on python-crfsuite and sklearn-crfsuite.

class webstruct.crfsuite.CRFsuitePipeline(fe, crf)[source]

Bases: Pipeline

A pipeline for HTML tagging using CRFsuite. It combines a feature extractor and a CRF; they are available as fe and crf attributes for easier access.

In addition to that, this class adds support for X_dev/y_dev arguments for fit() and fit_transform() methods - they work as expected, being transformed using feature extractor.

webstruct.crfsuite.create_crfsuite_pipeline(token_features=None, global_features=None, min_df=1, **crf_kwargs)[source]

Create CRFsuitePipeline for HTML tagging using CRFsuite. This pipeline expects data produced by HtmlTokenizer as an input and produces sequences of IOB2 tags as output.


import webstruct
from webstruct.features import EXAMPLE_TOKEN_FEATURES

# load train data
html_tokenizer = webstruct.HtmlTokenizer()
train_trees = webstruct.load_trees(
X_train, y_train = html_tokenizer.tokenize(train_trees)

# train
model = webstruct.create_crfsuite_pipeline(
    token_features = EXAMPLE_TOKEN_FEATURES,
), y_train)

# load test data
test_trees = webstruct.load_trees(
X_test, y_test = html_tokenizer.tokenize(test_trees)

# do a prediction
y_pred = model.predict(X_test)