HTML Loaders¶

Webstruct supports WebAnnotator and GATE annotation formats out of box; WebAnnotator is recommended.

Both GATE and WebAnnotator embed annotations into HTML using special tags: GATE uses custom tags like <ORG> while WebAnnotator uses tags like <span wa-type="ORG">.

webstruct.loaders classes convert GATE and WebAnnotator tags into __START_TAGNAME__ and __END_TAGNAME__ tokens, clean the HTML and return the result as a tree parsed by lxml:

>>> from webstruct import WebAnnotatorLoader  
>>> loader = WebAnnotatorLoader()  
>>> loader.load('0.html')  
<Element html at ...>

Such trees can be processed with utilities from webstruct.feature_extraction.

API¶

class webstruct.loaders.WebAnnotatorLoader(encoding=None, cleaner=None, known_entities=None)[source]¶

Bases: webstruct.loaders.HtmlLoader

Class for loading HTML annotated using WebAnnotator.

Note

Use WebAnnotator’s “save format”, not “export format”.

load(filename)¶

loadbytes(data)[source]¶

class webstruct.loaders.GateLoader(encoding=None, cleaner=None, known_entities=None)[source]¶

Bases: webstruct.loaders.HtmlLoader

Class for loading HTML annotated using GATE

>>> import lxml.html
>>> from webstruct import GateLoader

>>> loader = GateLoader(known_entities={'ORG', 'CITY'})
>>> html = b"<html><body><p><ORG>Scrapinghub</ORG> has an <b>office</b> in <CITY>Montevideo</CITY></p></body></html>"
>>> tree = loader.loadbytes(html)
>>> lxml.html.tostring(tree).decode()
'<html><body><p> __START_ORG__ Scrapinghub __END_ORG__  has an <b>office</b> in  __START_CITY__ Montevideo __END_CITY__ </p></body></html>'

Note that you must specify known_entities when creating GateLoader. It should contain all entities which are present in data, even if you want to use only a subset of them for training. Use arguments of HtmlLoader to train a tagger which uses a subset of labels.

load(filename)¶

loadbytes(data)[source]¶

class webstruct.loaders.HtmlLoader(encoding=None, cleaner=None)[source]¶

Bases: object

Class for loading unannotated HTML files.

load(filename)[source]¶

loadbytes(data)[source]¶

webstruct.loaders.load_trees(pattern, loader, verbose=False)[source]¶

Load HTML data using loader loader from all files matched by pattern glob pattern.

Example:

>>> trees = load_trees('path/*.html', HtmlLoader())