Tutorial

This tutorial assumes you are familiar with machine learning.

Get annotated data

First, you need the training/development data. We suggest to use WebAnnotator Firefox extension to annotate HTML pages.

Recommended WebAnnotator options:

_images/wa-options.png

Pro tip - enable WebAnnotator toolbar buttons:

_images/wa-buttons.png

Follow WebAnnotator manual to define named entities and annotate some web pages (nested WebAnnotator entities are not supported).

After that you can load annotated webpages as lxml trees:

import webstruct
trees = webstruct.load_trees([
    ("train/*.html", webstruct.WebAnnotatorLoader())
])

See HTML Loaders for more info. GATE annotation format is also supported.

From HTML to Tokens

To convert HTML trees to a format suitable for sequence prediction algorithm (like CRF, MEMM or Structured Perceptron) the following approach is used:

  1. Text is extracted from HTML and split into tokens.
  2. For each token a special HtmlToken instance is created. It contains information not only about the text token itself, but also about its position in HTML tree.

A single HTML page corresponds to a single input sequence (a list of HtmlTokens). For training/testing data (where webpages are already annotated) there is also a list of labels for each webpage, a label per HtmlToken.

To transform HTML trees into labels and HTML tokens use HtmlTokenizer.

html_tokenizer = webstruct.HtmlTokenizer()
X, y = html_tokenizer.tokenize(trees)

Input trees should be loaded by one of the WebStruct loaders. For consistency, for each tree (even if it is loaded from raw unannotated html) HtmlTokenizer extracts two arrays: a list of HtmlToken instances and a list of tags encoded using IOB2 encoding (also known as BIO encoding). So in our example X is a list of lists of HtmlToken instances, and y is a list of lists of strings.

Feature Extraction

For supervised machine learning algorithms to work we need to extract features.

In WebStruct feature vectors are Python dicts {"feature_name": "feature_value"}; a dict is computed for each HTML token. How to convert these dicts into representation required by a sequence labelling toolkit depends on a toolkit used; we will cover that later.

To compute feature dicts we’ll use HtmlFeatureExtractor.

First, define your feature functions. A feature function should take an HtmlToken instance and return a feature dict; feature dicts from individual feature functions will be merged into the final feature dict for a token. Feature functions can ask questions about token itself, its neighbours (in the same HTML element), its position in HTML.

Note

WebStruct supports another kind of feature functions that work on multiple tokens; we don’t cover them in this tutorial.

There are predefined feature functions in webstruct.features, but for this tutorial let’s create some functions ourselves:

def token_identity(html_token):
    return {'token': html_token.token}

def token_isupper(html_token):
    return {'isupper': html_token.token.isupper()}

def parent_tag(html_token):
    return {'parent_tag': html_token.parent.tag}

def border_at_left(html_token):
    return {'border_at_left': html_token.index == 0}

Next, create HtmlFeatureExtractor:

feature_extractor = HtmlFeatureExtractor(
    token_features = [
        token_identity,
        token_isupper,
        parent_tag,
        border_at_left
    ]
)

and use it to extract feature dicts:

features = feature_extractor.fit_transform(X)

See Feature Extraction for more info about HTML tokenization and feature extraction.

Using a Sequence Labelling Toolkit

WebStruct doesn’t provide a CRF or Structured Perceptron implementation; learning and prediction is supposed to be handled by an external sequence labelling toolkit like Wapiti, CRFSuite or seqlearn.

Once feature dicts are extracted from HTML you should convert them to a format required by your sequence labelling tooklit and use this toolkit to train a model and do the prediction. For example, you may use DictVectorizer from scikit-learn to convert feature dicts into seqlearn input format.

WebStruct provides some helpers for Wapiti sequence labelling toolkit. To use Wapiti with WebStruct, you need

  • for training: wapiti C++ library itself, including wapiti command-line utility (python-wapiti wrapper is not necessary);
  • for prediction: python-wapiti wrapper, github version (C++ library is not necessary).

We’ll use Wapiti in this tutorial.

Defining a Model

Basic way to define CRF model is the following:

model = webstruct.create_wapiti_pipeline('mymodel.wapiti',
    token_features = [token_identity, token_isupper, parent_tag, border_at_left],
    train_args = '--algo l-bfgs --maxiter 50 --compact'
)

First create_wapiti_pipeline() argument is a file name Wapiti model will be save to after training. train_args is a string or a list with arguments passed to wapiti; check Wapiti manual for available options.

Under the hood create_wapiti_pipeline() creates a sklearn.pipeline.Pipeline with an HtmlFeatureExtractor instance followed by WapitiCRF instance. The example above is just a shortcut for the following:

model = Pipeline([
    ('fe', HtmlFeatureExtractor(
        token_features = [
            token_identity,
            token_isupper,
            parent_tag,
            border_at_left,
        ]
    )),
    ('crf', WapitiCRF(
        'mymodel.wapiti',
        train_args = '--algo l-bfgs --maxiter 50 --compact',
    )),
])

Extracting Features using Wapiti Templates

Wapiti has “templates” support which allows to define richer features from the basic features, and to specify what to do with labels. Template format is described in Wapiti manual; you may also check CRF++ docs to get the templates idea - CRF++ and Wapiti template formats are very similar.

WebStruct allows to use feature names instead of numbers in Wapiti templates.

Let’s define a template that will make Wapiti use first-order transition features, plus token text values in a +-2 window near the current token.

feature_template = '''
# Label unigram & bigram
*

# Nearby token unigrams
uLL:%x[-2,token]
u-L:%x[-1,token]
u-R:%x[ 1,token]
uRR:%x[ 2,token]
'''

Note

create_wapiti_pipeline() (via WapitiCRF) by default adds all features for the current token to template. That’s why we haven’t defined them in our template, and that’s why we were fine without using template at all. In our example additional auto-generated lines would be

ufeat:token=%x[0,token]
ufeat:isupper=%x[0,isupper]
ufeat:parent_tag=%x[0,parent_tag]
ufeat:border_at_left=%x[0,border_at_left]

To make Wapiti use this template, pass it as an argument to create_wapiti_pipeline() (or WapitiCRF, whatever you use):

model = webstruct.create_wapiti_pipeline('mymodel.wapiti',
    token_features = [token_identity, token_isupper, parent_tag, border_at_left],
    feature_template = feature_template,
    train_args = '--algo l-bfgs --maxiter 50 --compact'
)

Training

To train a model use its fit method:

model.fit(X, y)

X and y are return values of HtmlTokenizer.tokenize() (a list of lists of HtmlToken instances and a list of lists of string IOB labels).

If you use WapitiCRF directly then train it using WapitiCRF.fit() method. It accepts 2 lists: a list of lists of feature dicts, and a list of lists of tags:

crf.fit(features, y)

Named Entity Recognition

Once you got a trained model you can use it to extract entities from unseen (unannotated) webpages. First, get some binary HTML data:

>>> import urllib2
>>> html = urllib2.urlopen("http://scrapinghub.com/contact").read()

Then create a NER instance initialized with a trained model:

>>> ner = webstruct.NER(model)

The model must provide a transform method that extracts features from HTML tokens and predicts labels for these tokens. A pipeline created with create_wapiti_pipeline() function fits this definition.

Finally, use NER.extract() method to extract entities:

>>> ner.extract(html)
[('Scrapinghub', 'ORG'), ..., ('Iturriaga 3429 ap. 1', 'STREET'), ...]

Generally, the steps are:

  1. Load data using HtmlLoader loader. If a custom HTML cleaner was used for loading training data make sure to apply it here as well.
  2. Use the same html_tokenizer as used for training to extract HTML tokens from loaded trees. All labels would be “O” when using HtmlLoader loader - y can be discarded.
  3. Use the same feature_extractor as used for training to extract features.
  4. Run your_crf.transform() method (e.g. WapitiCRF.transform()) on features extracted in (3) to get the prediction - a list of IOB2-encoded tags for each input document.
  5. Build entities from input tokens based on predicted tags (check IobEncoder.group() and smart_join()).
  6. Split entities into groups (optional). One way to do it is to use webstruct.grouping.

NER helper class combines HTML loading, HTML tokenization, feature extraction, CRF model, entity building and grouping.

Entity Grouping

Detecting entities on their own is not always enough; in many cases what is wanted is to find the relationship between them. For example, “street_name/STREET city_name/CITY zipcode_number/ZIPCODE form an address”, or “phone/TEL is a phone of person/PER”.

The first approximation is to say that all entities from a single webpage are related. For example, if we have extracted some organizaion/ORG and some phone/TEL from a single webpage we may assume that the phone is a contact phone of the organization.

Sometimes there are several “entity groups” on a webpage. If a page contains contact phones of several persons or several business locations it is better to split all entities into groups of related entities - “person name + his/her phone(s)” or “address”.

WebStruct provides an unsupervised algorithm for extracting such entity groups. Algorithm prefers to build large groups without entities of duplicate types; if a split is needed algorithm tries to split at points where distance between entities is larger.

Use NER.extract_groups() to extract groups of entities:

>>> ner.extract_groups(html)
[[...], ... [('Iturriaga 3429 ap. 1', 'STREET'), ('Montevideo', 'CITY'), ...]]

Sometimes it is better to allow some entity types to appear multuple times in a group. For example, a person (PER entity) may have several contact phones and faxes (TEL and FAX entities) - we should penalize groups with multiple PERs, but multiple TELs and FAXes are fine. Use dont_penalize argument if you want to allow some entity types to appear multiple times in a group:

ner.extract_groups(html, dont_penalize={'TEL', 'FAX'})

The simple algorithm WebStruct provides is by no means a general solution to relation detection, but give it a try - maybe it is enough for your task.

Model Development

To develop the model you need to choose the learning algorithm, features, hyperparameters, etc. To do that you need scoring metrics, cross-validation utilities and tools for debugging what classifier learned. WebStruct helps in the following way:

  1. Pipeline created by create_wapiti_pipeline() is compatible with cross-validation and grid search utilities from scikit-learn; use them to select model parameters and check the quality.

    One limitation of create_wapiti_pipeline() is that n_jobs in scikit-learn functions and classes should be 1, but other than that WebStruct objects should work fine with scikit-learn. Just keep in mind that for WebStruct an “observation” is a document, not an individual token, and a “label” is a sequence of labels for a document, not an individual IOB tag.

  2. There is webstruct.metrics module with a couple of metrics useful for sequence classification. Currently they require seqlearn to be installed.

To debug what CRF learned you should use methods specific to a labelling toolkit used. With Wapiti it would be wapiti dump console command and some UNIX utilities. For example, if we’ve saved our model to mymodel.wapiti file, and we want to check top positive features for CITY entity, we can execute the following in UNIX shell:

$ wapiti dump mymodel.wapiti | sort -nr -k4 | grep CITY | head -n 8

and get an output similar to this:

* Load model
* Dump model
*   B-CITY  I-CITY  2.74057
*   B-CITY  B-STATE 2.33235
*   I-STREET        B-CITY  1.98106
*   I-CITY  B-STATE 1.71408
u--L:street #       B-CITY  1.34199
u--L:west   #       I-CITY  1.32428
u--L:in     #       B-CITY  1.24937
u--L:-      #       B-CITY  1.11139