Miscellaneous¶

Utils¶

class webstruct.utils.BestMatch(known)[source]¶

Bases: object

Class for finding best non-overlapping matches in a sequence of tokens. Override get_sorted_ranges() method to define which results are best.

find_ranges(tokens)[source]¶

get_sorted_ranges(ranges, tokens)[source]¶

class webstruct.utils.LongestMatch(known)[source]¶

Bases: webstruct.utils.BestMatch

Class for finding longest non-overlapping matches in a sequence of tokens.

>>> known = {'North Las', 'North Las Vegas', 'North Pole', 'Vegas USA', 'Las Vegas', 'USA', "Toronto"}
>>> lm = LongestMatch(known)
>>> lm.max_length
3
>>> tokens = ["Toronto", "to", "North", "Las", "Vegas", "USA"]
>>> for start, end, matched_text in lm.find_ranges(tokens):
...     print(start, end, tokens[start:end], matched_text)
0 1 ['Toronto'] Toronto
2 5 ['North', 'Las', 'Vegas'] North Las Vegas
5 6 ['USA'] USA

LongestMatch also accepts a dict instead of a list/set for a known argument. In this case dict keys are used:

>>> lm = LongestMatch({'North': 'direction', 'North Las Vegas': 'location'})
>>> tokens = ["Toronto", "to", "North", "Las", "Vegas", "USA"]
>>> for start, end, matched_text in lm.find_ranges(tokens):
...     print(start, end, tokens[start:end], matched_text)
2 5 ['North', 'Las', 'Vegas'] North Las Vegas

get_sorted_ranges(ranges, tokens)[source]¶

webstruct.utils.flatten(sequence) → list[source]¶

Return a single, flat list which contains all elements retrieved from the sequence and all recursively contained sub-sequences (iterables).

Examples:

>>> [1, 2, [3,4], (5,6)]
[1, 2, [3, 4], (5, 6)]
>>> flatten([[[1,2,3], (42,None)], [4,5], [6], 7, (8,9,10)])
[1, 2, 3, 42, None, 4, 5, 6, 7, 8, 9, 10]

webstruct.utils.get_combined_keys(dicts)[source]¶

>>> sorted(get_combined_keys([{'foo': 'egg'}, {'bar': 'spam'}]))
['bar', 'foo']

webstruct.utils.html_document_fromstring(data, encoding=None)[source]¶: Load HTML document from string using lxml.html.HTMLParser

webstruct.utils.kill_html_tags(doc, tagnames, keep_child=True)[source]¶

>>> from lxml.html import fragment_fromstring, tostring
>>> root = fragment_fromstring('<div><h1>head 1</h1></div>')
>>> kill_html_tags(root, ['h1'])
>>> tostring(root).decode()
'<div>head 1</div>'

>>> root = fragment_fromstring('<div><h1>head 1</h1></div>')
>>> kill_html_tags(root, ['h1'], False)
>>> tostring(root).decode()
'<div></div>'

webstruct.utils.merge_dicts(*dicts)[source]¶

>>> sorted(merge_dicts({'foo': 'bar'}, {'bar': 'baz'}).items())
[('bar', 'baz'), ('foo', 'bar')]

webstruct.utils.replace_html_tags(root, tag_replaces)[source]¶

Replace lxml elements’ tag.

>>> from lxml.html import fragment_fromstring, document_fromstring, tostring
>>> root = fragment_fromstring('<h1>head 1</h1>')
>>> replace_html_tags(root, {'h1': 'strong'})
>>> tostring(root).decode()
'<strong>head 1</strong>'

>>> root = document_fromstring('<h1>head 1</h1> <H2>head 2</H2>')
>>> replace_html_tags(root, {'h1': 'strong', 'h2': 'strong', 'h3': 'strong', 'h4': 'strong'})
>>> tostring(root).decode()
'<html><body><strong>head 1</strong> <strong>head 2</strong></body></html>'

webstruct.utils.run_command(args, verbose=True)[source]¶

Execute a command in a subprocess, terminate it if exception occurs, raise CalledProcessError exception if command returned non-zero exit code.

If verbose == True then print output as it appears using “print”. Unlike subprocess.check_call it doesn’t assume that stdout has a file descriptor - this allows printing to work in IPython notebook.

Example:

>>> run_command(["python", "-c", "print(1+2)"])
3
>>> run_command(["python", "-c", "print(1+2)"], verbose=False)

webstruct.utils.smart_join(tokens)[source]¶

Join tokens without adding unneeded spaces before punctuation:

>>> smart_join(['Hello', ',', 'world', '!'])
'Hello, world!'

>>> smart_join(['(', '303', ')', '444-7777'])
'(303) 444-7777'

webstruct.utils.substrings(txt, min_length, max_length, pad='')[source]¶

>>> substrings("abc", 1, 100)
['a', 'ab', 'abc', 'b', 'bc', 'c']
>>> substrings("abc", 2, 100)
['ab', 'abc', 'bc']
>>> substrings("abc", 1, 2)
['a', 'ab', 'b', 'bc', 'c']
>>> substrings("abc", 1, 3, '$')
['$a', 'a', '$ab', 'ab', '$abc', 'abc', 'abc$', 'b', 'bc', 'bc$', 'c', 'c$']

webstruct.utils.train_test_split_noshuffle(*arrays, **options)[source]¶

Split arrays or matrices into train and test subsets without shuffling.

It allows to write

X_train, X_test, y_train, y_test = train_test_split_noshuffle(X, y, test_size=test_size)

instead of

X_train, X_test = X[:-test_size], X[-test_size:]
y_train, y_test = y[:-test_size], y[-test_size:]

Parameters:

*arrays : sequence of lists

test_size : float, int, or None (default is None)

If float, should be between 0.0 and 1.0 and represent the proportion of the dataset to include in the test split. If int, represents the absolute number of test samples. If None, test size is set to 0.25.

Returns:

splitting : list of lists, length=2 * len(arrays)

List containing train-test split of input array.

Examples

>>> train_test_split_noshuffle([1,2,3], ['a', 'b', 'c'], test_size=1)
[[1, 2], [3], ['a', 'b'], ['c']]
>>> train_test_split_noshuffle([1,2,3,4], ['a', 'b', 'c', 'd'], test_size=0.5)
[[1, 2], [3, 4], ['a', 'b'], ['c', 'd']]

webstruct.utils.alphanum_key(s)[source]¶: Key func for sorting strings according to numerical value.

webstruct.utils.human_sorted()¶: sorted that uses alphanum_key() as a key function

Text Tokenization¶

class webstruct.text_tokenizers.DefaultTokenizer[source]¶

tokenize(text)[source]¶

class webstruct.text_tokenizers.WordTokenizer[source]¶

This tokenizer is copy-pasted version of TreebankWordTokenizer that doesn’t split on @ and ‘:’ symbols and doesn’t split contractions:

>>> from nltk.tokenize.treebank import TreebankWordTokenizer  
>>> s = '''Good muffins cost $3.88\nin New York. Email: muffins@gmail.com'''
>>> TreebankWordTokenizer().tokenize(s) 

[‘Good’, ‘muffins’, ‘cost’, ‘$’, ‘3.88’, ‘in’, ‘New’, ‘York.’, ‘Email’, ‘:’, ‘muffins’, ‘@’, ‘gmail.com’] >>> WordTokenizer().tokenize(s) [‘Good’, ‘muffins’, ‘cost’, ‘$’, ‘3.88’, ‘in’, ‘New’, ‘York.’, ‘Email:’, 'muffins@gmail.com‘]

>>> s = '''Shelbourne Road,'''
>>> WordTokenizer().tokenize(s)
['Shelbourne', 'Road', ',']

>>> s = '''population of 100,000'''
>>> WordTokenizer().tokenize(s)
['population', 'of', '100,000']

>>> s = '''Hello|World'''
>>> WordTokenizer().tokenize(s)
['Hello', '|', 'World']

>>> s2 = '"We beat some pretty good teams to get here," Slocum said.'
>>> WordTokenizer().tokenize(s2)  
['``', 'We', 'beat', 'some', 'pretty', 'good',
'teams', 'to', 'get', 'here', ',', "''", 'Slocum', 'said', '.']
>>> s3 = '''Well, we couldn't have this predictable,
... cliche-ridden, \"Touched by an
... Angel\" (a show creator John Masius
... worked on) wanna-be if she didn't.'''
>>> WordTokenizer().tokenize(s3)  
['Well', ',', 'we', "couldn't", 'have', 'this', 'predictable',
 ',', 'cliche-ridden', ',', '``', 'Touched', 'by', 'an',
 'Angel', "''", '(', 'a', 'show', 'creator', 'John', 'Masius',
 'worked', 'on', ')', 'wanna-be', 'if', 'she', "didn't", '.']

Some issues:

>>> WordTokenizer().tokenize("Phone:855-349-1914")  
['Phone', ':', '855-349-1914']

>>> WordTokenizer().tokenize("Copyright © 2014 Foo Bar and Buzz Spam. All Rights Reserved.")  
['Copyright', '\xc2\xa9', '2014', 'Wall', 'Decor', 'and', 'Home', 'Accents', '.', 'All', 'Rights', 'Reserved', '.']

>>> WordTokenizer().tokenize("Powai Campus, Mumbai-400077")  
['Powai', 'Campus', ',', 'Mumbai", "-", "400077']

>>> WordTokenizer().tokenize("1 5858/ 1800")  
['1', '5858', '/', '1800']

>>> WordTokenizer().tokenize("Saudi Arabia-")  
['Saudi', 'Arabia', '-']

open_quotes = <_sre.SRE_Pattern object>¶

rules = [(<_sre.SRE_Pattern object at 0x7fecddd40e70>, u''), (<_sre.SRE_Pattern object at 0x7fecdf25d580>, u'``'), (<_sre.SRE_Pattern object at 0x18758c0>, u"''"), (<_sre.SRE_Pattern object at 0x7fecddd40f08>, None), (<_sre.SRE_Pattern object at 0x1847ac0>, u'...'), (<_sre.SRE_Pattern object at 0x7fecddc46030>, None), (<_sre.SRE_Pattern object at 0x7fecddc20618>, None), (<_sre.SRE_Pattern object at 0x7fecddc2c780>, None), (<_sre.SRE_Pattern object at 0x7fecdddb2e40>, None), (<_sre.SRE_Pattern object at 0x7fecddc20540>, None)]¶

tokenize(text)[source]¶

webstruct.text_tokenizers.tokenize(self, text)¶

Sequence Encoding¶

class webstruct.sequence_encoding.InputTokenProcessor(tagset=None)[source]¶

classify(token)[source]¶

>>> tp = InputTokenProcessor()
>>> tp.classify('foo')
('token', 'foo')
>>> tp.classify('__START_ORG__')
('start', 'ORG')
>>> tp.classify('__END_ORG__')
('end', 'ORG')

class webstruct.sequence_encoding.IobEncoder(token_processor=None)[source]¶

Utility class for encoding tagged token streams using IOB2 encoding.

Encode input tokens using encode method:

>>> iob_encoder = IobEncoder()
>>> input_tokens = ["__START_PER__", "John", "__END_PER__", "said"]
>>> iob_encoder.encode(input_tokens)
[('John', 'B-PER'), ('said', 'O')]

Get the result in another format using encode_split method:

>>> input_tokens = ["hello", "__START_PER__", "John", "Doe", "__END_PER__", "__START_PER__", "Mary", "__END_PER__", "said"]
>>> tokens, tags = iob_encoder.encode_split(input_tokens)
>>> tokens, tags
(['hello', 'John', 'Doe', 'Mary', 'said'], ['O', 'B-PER', 'I-PER', 'B-PER', 'O'])

Note that IobEncoder is stateful. This means you can encode incomplete stream and continue the encoding later:

>>> iob_encoder = IobEncoder()
>>> iob_encoder.encode(["__START_PER__", "John"])
[('John', 'B-PER')]
>>> iob_encoder.encode(["Mayer", "__END_PER__", "said"])
[('Mayer', 'I-PER'), ('said', 'O')]

To reset internal state, use reset method:

>>> iob_encoder.reset()

Group results to entities:

>>> iob_encoder.group(iob_encoder.encode(input_tokens))
[(['hello'], 'O'), (['John', 'Doe'], 'PER'), (['Mary'], 'PER'), (['said'], 'O')]

Input token stream is processed by InputTokenProcessor() by default; you can pass other token processing class to customize which tokens are considered start/end tags.

encode(input_tokens)[source]¶

encode_split(input_tokens)[source]¶: The same as encode, but returns (tokens, tags) tuple

classmethod group(data, strict=False)[source]¶

Group IOB2-encoded entities. data should be an iterable of (info, iob_tag) tuples. info could be any Python object, iob_tag should be a string with a tag.

Example:

>>>
>>> data = [("hello", "O"), (",", "O"), ("John", "B-PER"),
...         ("Doe", "I-PER"), ("Mary", "B-PER"), ("said", "O")]
>>> for items, tag in IobEncoder.iter_group(data):
...     print("%s %s" % (items, tag))
['hello', ','] O
['John', 'Doe'] PER
['Mary'] PER
['said'] O

By default, invalid sequences are fixed:

>>> data = [("hello", "O"), ("John", "I-PER"), ("Doe", "I-PER")]
>>> for items, tag in IobEncoder.iter_group(data):
...     print("%s %s" % (items, tag))
['hello'] O
['John', 'Doe'] PER

Pass ‘strict=True’ argument to raise an exception for invalid sequences:

>>> for items, tag in IobEncoder.iter_group(data, strict=True):
...     print("%s %s" % (items, tag))
Traceback (most recent call last):
...
ValueError: Invalid sequence: I-PER tag can't start sequence

iter_encode(input_tokens)[source]¶

classmethod iter_group(data, strict=False)[source]¶

reset()[source]¶: Reset the sequence