HTML Loaders¶
Webstruct supports WebAnnotator and GATE annotation formats out of box; WebAnnotator is recommended.
Both GATE and WebAnnotator embed annotations into HTML using special tags:
GATE uses custom tags like <ORG>
while WebAnnotator uses tags like
<span wa-type="ORG">
.
webstruct.loaders
classes convert GATE and WebAnnotator tags into
__START_TAGNAME__
and __END_TAGNAME__
tokens, clean the HTML
and return the result as a tree parsed by lxml:
>>> from webstruct import WebAnnotatorLoader
>>> loader = WebAnnotatorLoader()
>>> loader.load('0.html')
<Element html at ...>
Such trees can be processed with utilities from
webstruct.feature_extraction
.
API¶
-
class
webstruct.loaders.
WebAnnotatorLoader
(encoding=None, cleaner=None, known_entities=None)[source]¶ Bases:
webstruct.loaders.HtmlLoader
Class for loading HTML annotated using WebAnnotator.
Note
Use WebAnnotator’s “save format”, not “export format”.
-
load
(filename)¶
-
-
class
webstruct.loaders.
GateLoader
(encoding=None, cleaner=None, known_entities=None)[source]¶ Bases:
webstruct.loaders.HtmlLoader
Class for loading HTML annotated using GATE
>>> import lxml.html >>> from webstruct import GateLoader
>>> loader = GateLoader(known_entities={'ORG', 'CITY'}) >>> html = b"<html><body><p><ORG>Scrapinghub</ORG> has an <b>office</b> in <CITY>Montevideo</CITY></p></body></html>" >>> tree = loader.loadbytes(html) >>> lxml.html.tostring(tree).decode() '<html><body><p> __START_ORG__ Scrapinghub __END_ORG__ has an <b>office</b> in __START_CITY__ Montevideo __END_CITY__ </p></body></html>'
Note that you must specify known_entities when creating GateLoader. It should contain all entities which are present in data, even if you want to use only a subset of them for training. Use arguments of
HtmlLoader
to train a tagger which uses a subset of labels.-
load
(filename)¶
-