class webstruct.utils.BestMatch(known)[source]

Bases: object

Class for finding best non-overlapping matches in a sequence of tokens. Override get_sorted_ranges() method to define which results are best.

get_sorted_ranges(ranges, tokens)[source]
class webstruct.utils.LongestMatch(known)[source]

Bases: webstruct.utils.BestMatch

Class for finding longest non-overlapping matches in a sequence of tokens.

>>> known = {'North Las', 'North Las Vegas', 'North Pole', 'Vegas USA', 'Las Vegas', 'USA', "Toronto"}
>>> lm = LongestMatch(known)
>>> lm.max_length
>>> tokens = ["Toronto", "to", "North", "Las", "Vegas", "USA"]
>>> for start, end, matched_text in lm.find_ranges(tokens):
...     print(start, end, tokens[start:end], matched_text)
0 1 ['Toronto'] Toronto
2 5 ['North', 'Las', 'Vegas'] North Las Vegas
5 6 ['USA'] USA

LongestMatch also accepts a dict instead of a list/set for a known argument. In this case dict keys are used:

>>> lm = LongestMatch({'North': 'direction', 'North Las Vegas': 'location'})
>>> tokens = ["Toronto", "to", "North", "Las", "Vegas", "USA"]
>>> for start, end, matched_text in lm.find_ranges(tokens):
...     print(start, end, tokens[start:end], matched_text)
2 5 ['North', 'Las', 'Vegas'] North Las Vegas
get_sorted_ranges(ranges, tokens)[source]
webstruct.utils.flatten(sequence) → list[source]

Return a single, flat list which contains all elements retrieved from the sequence and all recursively contained sub-sequences (iterables).


>>> [1, 2, [3,4], (5,6)]
[1, 2, [3, 4], (5, 6)]
>>> flatten([[[1,2,3], (42,None)], [4,5], [6], 7, (8,9,10)])
[1, 2, 3, 42, None, 4, 5, 6, 7, 8, 9, 10]
>>> sorted(get_combined_keys([{'foo': 'egg'}, {'bar': 'spam'}]))
['bar', 'foo']
>>> get_domain("http://example.com/path")
>>> get_domain("https://hello.example.com/foo/bar")
>>> get_domain("http://hello.example.co.uk/foo?bar=1")
webstruct.utils.html_document_fromstring(data, encoding=None)[source]

Load HTML document from string using lxml.html.HTMLParser

webstruct.utils.kill_html_tags(doc, tagnames, keep_child=True)[source]
>>> from lxml.html import fragment_fromstring, tostring
>>> root = fragment_fromstring('<div><h1>head 1</h1></div>')
>>> kill_html_tags(root, ['h1'])
>>> tostring(root).decode()
'<div>head 1</div>'
>>> root = fragment_fromstring('<div><h1>head 1</h1></div>')
>>> kill_html_tags(root, ['h1'], False)
>>> tostring(root).decode()
>>> sorted(merge_dicts({'foo': 'bar'}, {'bar': 'baz'}).items())
[('bar', 'baz'), ('foo', 'bar')]
webstruct.utils.replace_html_tags(root, tag_replaces)[source]

Replace lxml elements’ tag.

>>> from lxml.html import fragment_fromstring, document_fromstring, tostring
>>> root = fragment_fromstring('<h1>head 1</h1>')
>>> replace_html_tags(root, {'h1': 'strong'})
>>> tostring(root).decode()
'<strong>head 1</strong>'
>>> root = document_fromstring('<h1>head 1</h1> <H2>head 2</H2>')
>>> replace_html_tags(root, {'h1': 'strong', 'h2': 'strong', 'h3': 'strong', 'h4': 'strong'})
>>> tostring(root).decode()
'<html><body><strong>head 1</strong> <strong>head 2</strong></body></html>'
webstruct.utils.run_command(args, verbose=True)[source]

Execute a command in a subprocess, terminate it if exception occurs, raise CalledProcessError exception if command returned non-zero exit code.

If verbose == True then print output as it appears using “print”. Unlike subprocess.check_call it doesn’t assume that stdout has a file descriptor - this allows printing to work in IPython notebook.


>>> run_command(["python", "-c", "print(1+2)"])
>>> run_command(["python", "-c", "print(1+2)"], verbose=False)

Join tokens without adding unneeded spaces before punctuation:

>>> smart_join(['Hello', ',', 'world', '!'])
'Hello, world!'

>>> smart_join(['(', '303', ')', '444-7777'])
'(303) 444-7777'
webstruct.utils.substrings(txt, min_length, max_length, pad='')[source]
>>> substrings("abc", 1, 100)
['a', 'ab', 'abc', 'b', 'bc', 'c']
>>> substrings("abc", 2, 100)
['ab', 'abc', 'bc']
>>> substrings("abc", 1, 2)
['a', 'ab', 'b', 'bc', 'c']
>>> substrings("abc", 1, 3, '$')
['$a', 'a', '$ab', 'ab', '$abc', 'abc', 'abc$', 'b', 'bc', 'bc$', 'c', 'c$']
webstruct.utils.train_test_split_noshuffle(*arrays, **options)[source]

Split arrays or matrices into train and test subsets without shuffling.

It allows to write

X_train, X_test, y_train, y_test = train_test_split_noshuffle(X, y, test_size=test_size)

instead of

X_train, X_test = X[:-test_size], X[-test_size:]
y_train, y_test = y[:-test_size], y[-test_size:]

*arrays : sequence of lists

test_size : float, int, or None (default is None)

If float, should be between 0.0 and 1.0 and represent the proportion of the dataset to include in the test split. If int, represents the absolute number of test samples. If None, test size is set to 0.25.


splitting : list of lists, length=2 * len(arrays)

List containing train-test split of input array.


>>> train_test_split_noshuffle([1,2,3], ['a', 'b', 'c'], test_size=1)
[[1, 2], [3], ['a', 'b'], ['c']]
>>> train_test_split_noshuffle([1,2,3,4], ['a', 'b', 'c', 'd'], test_size=0.5)
[[1, 2], [3, 4], ['a', 'b'], ['c', 'd']]

Key func for sorting strings according to numerical value.


sorted that uses alphanum_key() as a key function

Text Tokenization

class webstruct.text_tokenizers.DefaultTokenizer[source]
class webstruct.text_tokenizers.TextToken

Alias for field number 0


Alias for field number 2


Alias for field number 1

class webstruct.text_tokenizers.WordTokenizer[source]

This tokenizer is copy-pasted version of TreebankWordTokenizer that doesn’t split on @ and ‘:’ symbols and doesn’t split contractions. It supports span_tokenize(in terms of nltk tokenizers) method - segment_words():

>>> s = '''Good muffins cost $3.88\nin New York. Email: muffins@gmail.com'''
>>> WordTokenizer().segment_words(s)
[TextToken(chars=’Good’, position=0, length=4),
TextToken(chars=’muffins’, position=5, length=7), TextToken(chars=’cost’, position=13, length=4), TextToken(chars=’$’, position=18, length=1), TextToken(chars=‘3.88’, position=19, length=4), TextToken(chars=’in’, position=24, length=2), TextToken(chars=’New’, position=27, length=3), TextToken(chars=’York.’, position=31, length=5), TextToken(chars=’Email:’, position=37, length=6), TextToken(chars=’muffins@gmail.com’, position=44, length=17)]
>>> s = '''Shelbourne Road,'''
>>> WordTokenizer().segment_words(s)
[TextToken(chars='Shelbourne', position=0, length=10),
 TextToken(chars='Road', position=11, length=4),
 TextToken(chars=',', position=15, length=1)]
>>> s = '''population of 100,000'''
>>> WordTokenizer().segment_words(s)
[TextToken(chars='population', position=0, length=10),
 TextToken(chars='of', position=11, length=2),
 TextToken(chars='100,000', position=14, length=7)]
>>> s = '''Hello|World'''
>>> WordTokenizer().segment_words(s)
[TextToken(chars='Hello', position=0, length=5),
 TextToken(chars='|', position=5, length=1),
 TextToken(chars='World', position=6, length=5)]
>>> s2 = '"We beat some pretty good teams to get here," Slocum said.'
>>> WordTokenizer().segment_words(s2)  
[TextToken(chars='``', position=0, length=1),
 TextToken(chars='We', position=1, length=2),
 TextToken(chars='beat', position=4, length=4),
 TextToken(chars='some', position=9, length=4),
 TextToken(chars='pretty', position=14, length=6),
 TextToken(chars='good', position=21, length=4),
 TextToken(chars='teams', position=26, length=5),
 TextToken(chars='to', position=32, length=2),
 TextToken(chars='get', position=35, length=3),
 TextToken(chars='here', position=39, length=4),
 TextToken(chars=',', position=43, length=1),
 TextToken(chars="''", position=44, length=1),
 TextToken(chars='Slocum', position=46, length=6),
 TextToken(chars='said', position=53, length=4),
 TextToken(chars='.', position=57, length=1)]
>>> s3 = '''Well, we couldn't have this predictable,
... cliche-ridden, \"Touched by an
... Angel\" (a show creator John Masius
... worked on) wanna-be if she didn't.'''
>>> WordTokenizer().segment_words(s3)  
[TextToken(chars='Well', position=0, length=4),
 TextToken(chars=',', position=4, length=1),
 TextToken(chars='we', position=6, length=2),
 TextToken(chars="couldn't", position=9, length=8),
 TextToken(chars='have', position=18, length=4),
 TextToken(chars='this', position=23, length=4),
 TextToken(chars='predictable', position=28, length=11),
 TextToken(chars=',', position=39, length=1),
 TextToken(chars='cliche-ridden', position=41, length=13),
 TextToken(chars=',', position=54, length=1),
 TextToken(chars='``', position=56, length=1),
 TextToken(chars='Touched', position=57, length=7),
 TextToken(chars='by', position=65, length=2),
 TextToken(chars='an', position=68, length=2),
 TextToken(chars='Angel', position=71, length=5),
 TextToken(chars="''", position=76, length=1),
 TextToken(chars='(', position=78, length=1),
 TextToken(chars='a', position=79, length=1),
 TextToken(chars='show', position=81, length=4),
 TextToken(chars='creator', position=86, length=7),
 TextToken(chars='John', position=94, length=4),
 TextToken(chars='Masius', position=99, length=6),
 TextToken(chars='worked', position=106, length=6),
 TextToken(chars='on', position=113, length=2),
 TextToken(chars=')', position=115, length=1),
 TextToken(chars='wanna-be', position=117, length=8),
 TextToken(chars='if', position=126, length=2),
 TextToken(chars='she', position=129, length=3),
 TextToken(chars="didn't", position=133, length=6),
 TextToken(chars='.', position=139, length=1)]
>>> WordTokenizer().segment_words('"')
[TextToken(chars='``', position=0, length=1)]
>>> WordTokenizer().segment_words('" a')
[TextToken(chars='``', position=0, length=1),
 TextToken(chars='a', position=2, length=1)]
>>> WordTokenizer().segment_words('["a')
[TextToken(chars='[', position=0, length=1),
 TextToken(chars='``', position=1, length=1),
 TextToken(chars='a', position=2, length=1)]

Some issues:

>>> WordTokenizer().segment_words("Copyright © 2014 Foo Bar and Buzz Spam. All Rights Reserved.")
[TextToken(chars='Copyright', position=0, length=9),
 TextToken(chars=u'\xa9', position=10, length=1),
 TextToken(chars='2014', position=12, length=4),
 TextToken(chars='Foo', position=17, length=3),
 TextToken(chars='Bar', position=21, length=3),
 TextToken(chars='and', position=25, length=3),
 TextToken(chars='Buzz', position=29, length=4),
 TextToken(chars='Spam.', position=34, length=5),
 TextToken(chars='All', position=40, length=3),
 TextToken(chars='Rights', position=44, length=6),
 TextToken(chars='Reserved', position=51, length=8),
 TextToken(chars='.', position=59, length=1)]
open_quotes = <_sre.SRE_Pattern object>
rules = [(<_sre.SRE_Pattern object>, u''), (<_sre.SRE_Pattern object>, u'``'), (<_sre.SRE_Pattern object>, u"''"), (<_sre.SRE_Pattern object>, None), (<_sre.SRE_Pattern object>, u'...'), (<_sre.SRE_Pattern object>, None), (<_sre.SRE_Pattern object>, None), (<_sre.SRE_Pattern object>, None), (<_sre.SRE_Pattern object>, None), (<_sre.SRE_Pattern object>, None)]
webstruct.text_tokenizers.tokenize(self, text)

Sequence Encoding

class webstruct.sequence_encoding.InputTokenProcessor(tagset=None)[source]
>>> tp = InputTokenProcessor()
>>> tp.classify('foo')
('token', 'foo')
>>> tp.classify('__START_ORG__')
('start', 'ORG')
>>> tp.classify('__END_ORG__')
('end', 'ORG')
class webstruct.sequence_encoding.IobEncoder(token_processor=None)[source]

Utility class for encoding tagged token streams using IOB2 encoding.

Encode input tokens using encode method:

>>> iob_encoder = IobEncoder()
>>> input_tokens = ["__START_PER__", "John", "__END_PER__", "said"]
>>> def encode(encoder, tokens): return [p for p in IobEncoder.from_indices(encoder.encode(tokens), tokens)]
>>> encode(iob_encoder, input_tokens)
[('John', 'B-PER'), ('said', 'O')]

>>> input_tokens = ["hello", "__START_PER__", "John", "Doe", "__END_PER__", "__START_PER__", "Mary", "__END_PER__", "said"]
>>> tokens = encode(iob_encoder, input_tokens)
>>> tokens, tags = iob_encoder.split(tokens)
>>> tokens, tags
(['hello', 'John', 'Doe', 'Mary', 'said'], ['O', 'B-PER', 'I-PER', 'B-PER', 'O'])

Note that IobEncoder is stateful. This means you can encode incomplete stream and continue the encoding later:

>>> iob_encoder = IobEncoder()
>>> input_tokens_partial = ["__START_PER__", "John"]
>>> encode(iob_encoder, input_tokens_partial)
[('John', 'B-PER')]
>>> input_tokens_partial = ["Mayer", "__END_PER__", "said"]
>>> encode(iob_encoder, input_tokens_partial)
[('Mayer', 'I-PER'), ('said', 'O')]

To reset internal state, use reset method:

>>> iob_encoder.reset()

Group results to entities:

>>> iob_encoder.group(encode(iob_encoder, input_tokens))
[(['hello'], 'O'), (['John', 'Doe'], 'PER'), (['Mary'], 'PER'), (['said'], 'O')]

Input token stream is processed by InputTokenProcessor() by default; you can pass other token processing class to customize which tokens are considered start/end tags.

classmethod from_indices(indices, input_tokens)[source]
classmethod group(data, strict=False)[source]

Group IOB2-encoded entities. data should be an iterable of (info, iob_tag) tuples. info could be any Python object, iob_tag should be a string with a tag.


>>> data = [("hello", "O"), (",", "O"), ("John", "B-PER"),
...         ("Doe", "I-PER"), ("Mary", "B-PER"), ("said", "O")]
>>> for items, tag in IobEncoder.iter_group(data):
...     print("%s %s" % (items, tag))
['hello', ','] O
['John', 'Doe'] PER
['Mary'] PER
['said'] O

By default, invalid sequences are fixed:

>>> data = [("hello", "O"), ("John", "I-PER"), ("Doe", "I-PER")]
>>> for items, tag in IobEncoder.iter_group(data):
...     print("%s %s" % (items, tag))
['hello'] O
['John', 'Doe'] PER

Pass ‘strict=True’ argument to raise an exception for invalid sequences:

>>> for items, tag in IobEncoder.iter_group(data, strict=True):
...     print("%s %s" % (items, tag))
Traceback (most recent call last):
ValueError: Invalid sequence: I-PER tag can't start sequence
classmethod iter_group(data, strict=False)[source]

Reset the sequence


split [(token, tag)] to ([token], [tags]) tuple

Webpage domain inferring

Module for getting a most likely base URL (domain) for a page. It is useful if you’ve downloaded HTML files, but haven’t preserved URLs explicitly, and still want to have cross-validation done right. Grouping pages by domain name is a reasonable way to do that.

WebAnnotator data has either <base> tags with original URLs (or at least original domains), or a commented out base tags.

Unfortunately, GATE-annotated data doesn’t have this tag. So the idea is to use a most popular domain mentioned in a page as a page’s domain.


Return href of a base tag; base tag could be commented out.

webstruct.infer_domain.get_tree_domain(tree, blacklist=set(['flickr.com', 'pinterest.com', 'youtube.com', 'google.com', 'fonts.com', 'paypal.com', 'twitter.com', 'fonts.net', 'addthis.com', 'facebook.com', 'googleapis.com', 'linkedin.com']), get_domain=<function get_domain>)[source]

Return the most likely domain for the tree. Domain is extracted from base tag or guessed if there is no base tag. If domain can’t be detected an empty string is returned.

webstruct.infer_domain.guess_domain(tree, blacklist=set(['flickr.com', 'pinterest.com', 'youtube.com', 'google.com', 'fonts.com', 'paypal.com', 'twitter.com', 'fonts.net', 'addthis.com', 'facebook.com', 'googleapis.com', 'linkedin.com']), get_domain=<function get_domain>)[source]

Return most common domain not in a black list.