Miscellaneous¶

Utils¶

class webstruct.utils.BestMatch(known)[source]¶

Bases: object

Class for finding best non-overlapping matches in a sequence of tokens. Override get_sorted_ranges() method to define which results are best.

find_ranges(tokens)[source]¶

get_sorted_ranges(ranges, tokens)[source]¶

class webstruct.utils.LongestMatch(known)[source]¶

Bases: webstruct.utils.BestMatch

Class for finding longest non-overlapping matches in a sequence of tokens.

>>> known = {'North Las', 'North Las Vegas', 'North Pole', 'Vegas USA', 'Las Vegas', 'USA', "Toronto"}
>>> lm = LongestMatch(known)
>>> lm.max_length
3
>>> tokens = ["Toronto", "to", "North", "Las", "Vegas", "USA"]
>>> for start, end, matched_text in lm.find_ranges(tokens):
...     print(start, end, tokens[start:end], matched_text)
(0, 1, ['Toronto'], 'Toronto')
(2, 5, ['North', 'Las', 'Vegas'], 'North Las Vegas')
(5, 6, ['USA'], 'USA')

get_sorted_ranges(ranges, tokens)[source]¶

webstruct.utils.flatten(sequence) → list[source]¶

Return a single, flat list which contains all elements retrieved from the sequence and all recursively contained sub-sequences (iterables).

Examples:

>>> [1, 2, [3,4], (5,6)]
[1, 2, [3, 4], (5, 6)]
>>> flatten([[[1,2,3], (42,None)], [4,5], [6], 7, (8,9,10)])
[1, 2, 3, 42, None, 4, 5, 6, 7, 8, 9, 10]

webstruct.utils.get_combined_keys(dicts)[source]¶

>>> sorted(get_combined_keys([{'foo': 'egg'}, {'bar': 'spam'}]))
['bar', 'foo']

webstruct.utils.html_document_fromstring(data, encoding=None)[source]¶: Load HTML document from string using lxml.html.HTMLParser

webstruct.utils.kill_html_tags(doc, tagnames, keep_child=True)[source]¶

>>> from lxml.html import fragment_fromstring, tostring
>>> root = fragment_fromstring('<div><h1>head 1</h1></div>')
>>> kill_html_tags(root, ['h1'])
>>> tostring(root)
'<div>head 1</div>'

>>> root = fragment_fromstring('<div><h1>head 1</h1></div>')
>>> kill_html_tags(root, ['h1'], False)
>>> tostring(root)
'<div></div>'

webstruct.utils.merge_dicts(*dicts)[source]¶

>>> sorted(merge_dicts({'foo': 'bar'}, {'bar': 'baz'}).items())
[('bar', 'baz'), ('foo', 'bar')]

webstruct.utils.replace_html_tags(root, tag_replaces)[source]¶

Replace lxml elements’ tag.

>>> from lxml.html import fragment_fromstring, document_fromstring, tostring
>>> root = fragment_fromstring('<h1>head 1</h1>')
>>> replace_html_tags(root, {'h1': 'strong'})
>>> tostring(root)
'<strong>head 1</strong>'

>>> root = document_fromstring('<h1>head 1</h1> <H2>head 2</H2>')
>>> replace_html_tags(root, {'h1': 'strong', 'h2': 'strong', 'h3': 'strong', 'h4': 'strong'})
>>> tostring(root)
'<html><body><strong>head 1</strong> <strong>head 2</strong></body></html>'

webstruct.utils.run_command(args, verbose=True)[source]¶

Execute a command in a subprocess, terminate it if exception occurs, raise CalledProcessError exception if command returned non-zero exit code.

If verbose == True then print output as it appears using “print”. Unlike subprocess.check_call it doesn’t assume that stdout has a file descriptor - this allows printing to works in IPython notebook.

webstruct.utils.smart_join(tokens)[source]¶

Join tokens without adding unneeded spaces before punctuation:

>>> smart_join(['Hello', ',', 'world', '!'])
'Hello, world!'

>>> smart_join(['(', '303', ')', '444-7777'])
'(303) 444-7777'

webstruct.utils.substrings(txt, min_length=2, max_length=10, pad='')[source]¶

>>> substrings("abc", 1)
['a', 'ab', 'abc', 'b', 'bc', 'c']
>>> substrings("abc", 2)
['ab', 'abc', 'bc']
>>> substrings("abc", 1, 2)
['a', 'ab', 'b', 'bc', 'c']
>>> substrings("abc", 1, 3, '$')
['$a', 'a', '$ab', 'ab', '$abc', 'abc', 'abc$', 'b', 'bc', 'bc$', 'c', 'c$']

webstruct.utils.tostr(val)[source]¶

webstruct.utils.alphanum_key(s)[source]¶: Key func for sorting strings according to numerical value.

webstruct.utils.human_sorted()¶: sorted that uses alphanum_key() as a key function

Text Tokenization¶

class webstruct.tokenizers.DefaultTokenizer[source]¶

tokenize(text)[source]¶

class webstruct.tokenizers.WordTokenizer[source]¶

This tokenizer is copy-pasted version of TreebankWordTokenizer that doesn’t split on @ and ‘:’ symbols and doesn’t split contractions:

>>> from nltk.tokenize.treebank import TreebankWordTokenizer
>>> s = u'''Good muffins cost $3.88\nin New York. Email: muffins@gmail.com'''
>>> TreebankWordTokenizer().tokenize(s)
[u'Good', u'muffins', u'cost', u'$', u'3.88', u'in', u'New', u'York.', u'Email', u':', u'muffins', u'@', u'gmail.com']
>>> WordTokenizer().tokenize(s)
[u'Good', u'muffins', u'cost', u'$', u'3.88', u'in', u'New', u'York.', u'Email:', u'muffins@gmail.com']

>>> s = u'''Shelbourne Road,'''
>>> WordTokenizer().tokenize(s)
[u'Shelbourne', u'Road', u',']

>>> s = u'''population of 100,000'''
>>> WordTokenizer().tokenize(s)
[u'population', u'of', u'100,000']

>>> s = u'''Hello|World'''
>>> WordTokenizer().tokenize(s)
[u'Hello', u'|', u'World']

tokenize(text)[source]¶

webstruct.tokenizers.tokenize(self, text)¶

Sequence Encoding¶

class webstruct.sequence_encoding.InputTokenProcessor(tagset=None)[source]¶

classify(token)[source]¶

>>> tp = InputTokenProcessor()
>>> tp.classify('foo')
('token', 'foo')
>>> tp.classify('__START_ORG__')
('start', 'ORG')
>>> tp.classify('__END_ORG__')
('end', 'ORG')

class webstruct.sequence_encoding.IobEncoder(token_processor=None)[source]¶

Utility class for encoding tagged token streams using IOB2 encoding.

Encode input tokens using encode method:

>>> iob_encoder = IobEncoder()
>>> input_tokens = ["__START_PER__", "John", "__END_PER__", "said"]
>>> iob_encoder.encode(input_tokens)
[('John', 'B-PER'), ('said', 'O')]

Get the result in another format using encode_split method:

>>> input_tokens = ["hello", "__START_PER__", "John", "Doe", "__END_PER__", "__START_PER__", "Mary", "__END_PER__", "said"]
>>> tokens, tags = iob_encoder.encode_split(input_tokens)
>>> tokens, tags
(['hello', 'John', 'Doe', 'Mary', 'said'], ['O', 'B-PER', 'I-PER', 'B-PER', 'O'])

Note that IobEncoder is stateful. This means you can encode incomplete stream and continue the encoding later:

>>> iob_encoder = IobEncoder()
>>> iob_encoder.encode(["__START_PER__", "John"])
[('John', 'B-PER')]
>>> iob_encoder.encode(["Mayer", "__END_PER__", "said"])
[('Mayer', 'I-PER'), ('said', 'O')]

To reset internal state, use reset method:

>>> iob_encoder.reset()

Group results to entities:

>>> iob_encoder.group(iob_encoder.encode(input_tokens))
[(['hello'], 'O'), (['John', 'Doe'], 'PER'), (['Mary'], 'PER'), (['said'], 'O')]

Input token stream is processed by InputTokenProcessor() by default; you can pass other token processing class to customize which tokens are considered start/end tags.

encode(input_tokens)[source]¶

encode_split(input_tokens)[source]¶: The same as encode, but returns (tokens, tags) tuple

classmethod group(data, strict=False)[source]¶

Group IOB2-encoded entities. data should be an iterable of (info, iob_tag) tuples. info could be any Python object, iob_tag should be a string with a tag.

Example:

>>>
>>> data = [("hello", "O"), (",", "O"), ("John", "B-PER"),
...         ("Doe", "I-PER"), ("Mary", "B-PER"), ("said", "O")]
>>> for items, tag in IobEncoder.iter_group(data):
...     print("%s %s" % (items, tag))
['hello', ','] O
['John', 'Doe'] PER
['Mary'] PER
['said'] O

By default, invalid sequences are fixed:

>>> data = [("hello", "O"), ("John", "I-PER"), ("Doe", "I-PER")]
>>> for items, tag in IobEncoder.iter_group(data):
...     print("%s %s" % (items, tag))
['hello'] O
['John', 'Doe'] PER

Pass ‘strict=True’ argument to raise an exception for invalid sequences:

>>> for items, tag in IobEncoder.iter_group(data, strict=True):
...     print("%s %s" % (items, tag))
Traceback (most recent call last):
...
ValueError: Invalid sequence: I-PER tag can't start sequence

iter_encode(input_tokens)[source]¶

classmethod iter_group(data, strict=False)[source]¶

reset()[source]¶: Reset the sequence