HTML Loaders¶
Webstruct supports WebAnnotator and GATE annotation formats out of box; WebAnnotator is recommended.
Both GATE and WebAnnotator embed annotations into HTML using special tags: GATE uses custom tags like <ORG> while WebAnnotator uses tags like <span wa-type="ORG">.
webstruct.loaders classes convert GATE and WebAnnotator tags into __START_TAGNAME__ and __END_TAGNAME__ tokens, clean the HTML and return the result as a tree parsed by lxml:
>>> from webstruct import WebAnnotatorLoader
>>> loader = WebAnnotatorLoader()
>>> loader.load('0.html')
<Element html at ...>
Such trees can be processed with utilities from webstruct.feature_extraction.
API¶
- class webstruct.loaders.WebAnnotatorLoader(encoding=None, cleaner=None, known_entities=None)[source]¶
Bases: webstruct.loaders.HtmlLoader
Class for loading HTML annotated using WebAnnotator.
Note
Use WebAnnotator’s “save format”, not “export format”.
- load(filename)¶
- class webstruct.loaders.GateLoader(encoding=None, cleaner=None, known_entities=None)[source]¶
Bases: webstruct.loaders.HtmlLoader
Class for loading HTML annotated using GATE
>>> import lxml.html >>> from webstruct import GateLoader
>>> loader = GateLoader(known_entities={'ORG', 'CITY'}) >>> html = b"<html><body><p><ORG>Scrapinghub</ORG> has an <b>office</b> in <CITY>Montevideo</CITY></p></body></html>" >>> tree = loader.loadbytes(html) >>> lxml.html.tostring(tree) '<html><body><p> __START_ORG__ Scrapinghub __END_ORG__ has an <b>office</b> in __START_CITY__ Montevideo __END_CITY__ </p></body></html>'
- load(filename)¶
- class webstruct.loaders.HtmlLoader(encoding=None, cleaner=None)[source]¶
Bases: object
Class for loading unannotated HTML files.
- webstruct.loaders.load_trees(patterns, verbose=False)[source]¶
Load HTML data from several paths/glob patterns, maybe using different loaders. Return a list of lxml trees.
patterns should be a list of tuples (glob_pattern, loader).
Example:
>>> loader = HtmlLoader() >>> patterns = [('path1/*.html', loader), ('path2/*.html', loader)] >>> trees = load_trees(patterns)