HTML Loaders

Webstruct supports WebAnnotator and GATE annotation formats out of box; WebAnnotator is recommended.

Both GATE and WebAnnotator embed annotations into HTML using special tags: GATE uses custom tags like <ORG> while WebAnnotator uses tags like <span wa-type="ORG">.

webstruct.loaders classes convert GATE and WebAnnotator tags into __START_TAGNAME__ and __END_TAGNAME__ tokens, clean the HTML and return the result as a tree parsed by lxml:

>>> from webstruct import WebAnnotatorLoader  
>>> loader = WebAnnotatorLoader()  
>>> loader.load('0.html')  
<Element html at ...>

Such trees can be processed with utilities from webstruct.feature_extraction.

API

class webstruct.loaders.WebAnnotatorLoader(encoding=None, cleaner=None, known_entities=None)[source]

Bases: webstruct.loaders.HtmlLoader

Class for loading HTML annotated using WebAnnotator.

Note

Use WebAnnotator’s “save format”, not “export format”.

load(filename)
loadbytes(data)[source]
class webstruct.loaders.GateLoader(encoding=None, cleaner=None, known_entities=None)[source]

Bases: webstruct.loaders.HtmlLoader

Class for loading HTML annotated using GATE

>>> import lxml.html
>>> from webstruct import GateLoader
>>> loader = GateLoader(known_entities={'ORG', 'CITY'})
>>> html = b"<html><body><p><ORG>Scrapinghub</ORG> has an <b>office</b> in <CITY>Montevideo</CITY></p></body></html>"
>>> tree = loader.loadbytes(html)
>>> lxml.html.tostring(tree)
'<html><body><p> __START_ORG__ Scrapinghub __END_ORG__  has an <b>office</b> in  __START_CITY__ Montevideo __END_CITY__ </p></body></html>'
load(filename)
loadbytes(data)[source]
class webstruct.loaders.HtmlLoader(encoding=None, cleaner=None)[source]

Bases: object

Class for loading unannotated HTML files.

load(filename)[source]
loadbytes(data)[source]
webstruct.loaders.load_trees(patterns, verbose=False)[source]

Load HTML data from several paths/glob patterns, maybe using different loaders. Return a list of lxml trees.

patterns should be a list of tuples (glob_pattern, loader).

Example:

>>> loader = HtmlLoader()
>>> patterns = [('path1/*.html', loader), ('path2/*.html', loader)]
>>> trees = load_trees(patterns)  
webstruct.loaders.load_trees_from_files(pattern, loader, verbose=False)[source]

Load HTML data using loader loader from all files matched by pattern glob pattern.