CRFsuite Helpers¶
CRFsuite backend for webstruct based on python-crfsuite
-
class
webstruct.crfsuite.
CRFsuitePipeline
(fe, crf)[source]¶ Bases:
Pipeline
A pipeline for HTML tagging using CRFsuite. It combines a feature extractor and a CRF; they are available as
fe
andcrf
attributes for easier access.In addition to that, this class adds support for X_dev/y_dev arguments for
fit()
andfit_transform()
methods - they work as expected, being transformed using feature extractor.
-
webstruct.crfsuite.
create_crfsuite_pipeline
(token_features=None, global_features=None, min_df=1, **crf_kwargs)[source]¶ Create
CRFsuitePipeline
for HTML tagging using CRFsuite. This pipeline expects data produced byHtmlTokenizer
as an input and produces sequences of IOB2 tags as output.Example:
import webstruct from webstruct.features import EXAMPLE_TOKEN_FEATURES # load train data html_tokenizer = webstruct.HtmlTokenizer() train_trees = webstruct.load_trees( "train/*.html", webstruct.WebAnnotatorLoader() ) X_train, y_train = html_tokenizer.tokenize(train_trees) # train model = webstruct.create_crfsuite_pipeline( token_features = EXAMPLE_TOKEN_FEATURES, ) model.fit(X_train, y_train) # load test data test_trees = webstruct.load_trees( "test/*.html", webstruct.WebAnnotatorLoader() ) X_test, y_test = html_tokenizer.tokenize(test_trees) # do a prediction y_pred = model.predict(X_test)