CRFsuite Helpers

CRFsuite backend for webstruct based on python-crfsuite and sklearn-crfsuite.

class webstruct.crfsuite.CRFsuitePipeline(fe, crf)[source]

Bases: Pipeline

A pipeline for HTML tagging using CRFsuite. It combines a feature extractor and a CRF; they are available as fe and crf attributes for easier access.

In addition to that, this class adds support for X_dev/y_dev arguments for fit() and fit_transform() methods - they work as expected, being transformed using feature extractor.

webstruct.crfsuite.create_crfsuite_pipeline(token_features=None, global_features=None, min_df=1, **crf_kwargs)[source]

Create CRFsuitePipeline for HTML tagging using CRFsuite. This pipeline expects data produced by HtmlTokenizer as an input and produces sequences of IOB2 tags as output.

Example:

import webstruct
from webstruct.features import EXAMPLE_TOKEN_FEATURES

# load train data
html_tokenizer = webstruct.HtmlTokenizer()
train_trees = webstruct.load_trees(
    "train/*.html",
    webstruct.WebAnnotatorLoader()
)
X_train, y_train = html_tokenizer.tokenize(train_trees)

# train
model = webstruct.create_crfsuite_pipeline(
    token_features = EXAMPLE_TOKEN_FEATURES,
)
model.fit(X_train, y_train)

# load test data
test_trees = webstruct.load_trees(
    "test/*.html",
    webstruct.WebAnnotatorLoader()
)
X_test, y_test = html_tokenizer.tokenize(test_trees)

# do a prediction
y_pred = model.predict(X_test)