Miscellaneous¶
Utils¶
-
class
webstruct.utils.
BestMatch
(known)[source]¶ Bases:
object
Class for finding best non-overlapping matches in a sequence of tokens. Override
get_sorted_ranges()
method to define which results are best.
-
class
webstruct.utils.
LongestMatch
(known)[source]¶ Bases:
webstruct.utils.BestMatch
Class for finding longest non-overlapping matches in a sequence of tokens.
>>> known = {'North Las', 'North Las Vegas', 'North Pole', 'Vegas USA', 'Las Vegas', 'USA', "Toronto"} >>> lm = LongestMatch(known) >>> lm.max_length 3 >>> tokens = ["Toronto", "to", "North", "Las", "Vegas", "USA"] >>> for start, end, matched_text in lm.find_ranges(tokens): ... print(start, end, tokens[start:end], matched_text) 0 1 ['Toronto'] Toronto 2 5 ['North', 'Las', 'Vegas'] North Las Vegas 5 6 ['USA'] USA
LongestMatch
also accepts a dict instead of a list/set for aknown
argument. In this case dict keys are used:>>> lm = LongestMatch({'North': 'direction', 'North Las Vegas': 'location'}) >>> tokens = ["Toronto", "to", "North", "Las", "Vegas", "USA"] >>> for start, end, matched_text in lm.find_ranges(tokens): ... print(start, end, tokens[start:end], matched_text) 2 5 ['North', 'Las', 'Vegas'] North Las Vegas
-
webstruct.utils.
flatten
(sequence) → list[source]¶ Return a single, flat list which contains all elements retrieved from the sequence and all recursively contained sub-sequences (iterables).
Examples:
>>> [1, 2, [3,4], (5,6)] [1, 2, [3, 4], (5, 6)] >>> flatten([[[1,2,3], (42,None)], [4,5], [6], 7, (8,9,10)]) [1, 2, 3, 42, None, 4, 5, 6, 7, 8, 9, 10]
-
webstruct.utils.
get_combined_keys
(dicts)[source]¶ >>> sorted(get_combined_keys([{'foo': 'egg'}, {'bar': 'spam'}])) ['bar', 'foo']
-
webstruct.utils.
html_document_fromstring
(data, encoding=None)[source]¶ Load HTML document from string using lxml.html.HTMLParser
>>> from lxml.html import fragment_fromstring, tostring >>> root = fragment_fromstring('<div><h1>head 1</h1></div>') >>> kill_html_tags(root, ['h1']) >>> tostring(root).decode() '<div>head 1</div>'
>>> root = fragment_fromstring('<div><h1>head 1</h1></div>') >>> kill_html_tags(root, ['h1'], False) >>> tostring(root).decode() '<div></div>'
-
webstruct.utils.
merge_dicts
(*dicts)[source]¶ >>> sorted(merge_dicts({'foo': 'bar'}, {'bar': 'baz'}).items()) [('bar', 'baz'), ('foo', 'bar')]
Replace lxml elements’ tag.
>>> from lxml.html import fragment_fromstring, document_fromstring, tostring >>> root = fragment_fromstring('<h1>head 1</h1>') >>> replace_html_tags(root, {'h1': 'strong'}) >>> tostring(root).decode() '<strong>head 1</strong>'
>>> root = document_fromstring('<h1>head 1</h1> <H2>head 2</H2>') >>> replace_html_tags(root, {'h1': 'strong', 'h2': 'strong', 'h3': 'strong', 'h4': 'strong'}) >>> tostring(root).decode() '<html><body><strong>head 1</strong> <strong>head 2</strong></body></html>'
-
webstruct.utils.
run_command
(args, verbose=True)[source]¶ Execute a command in a subprocess, terminate it if exception occurs, raise CalledProcessError exception if command returned non-zero exit code.
If
verbose == True
then print output as it appears using “print”. Unlikesubprocess.check_call
it doesn’t assume that stdout has a file descriptor - this allows printing to work in IPython notebook.Example:
>>> run_command(["python", "-c", "print(1+2)"]) 3 >>> run_command(["python", "-c", "print(1+2)"], verbose=False)
-
webstruct.utils.
smart_join
(tokens)[source]¶ Join tokens without adding unneeded spaces before punctuation:
>>> smart_join(['Hello', ',', 'world', '!']) 'Hello, world!' >>> smart_join(['(', '303', ')', '444-7777']) '(303) 444-7777'
-
webstruct.utils.
substrings
(txt, min_length, max_length, pad='')[source]¶ >>> substrings("abc", 1, 100) ['a', 'ab', 'abc', 'b', 'bc', 'c'] >>> substrings("abc", 2, 100) ['ab', 'abc', 'bc'] >>> substrings("abc", 1, 2) ['a', 'ab', 'b', 'bc', 'c'] >>> substrings("abc", 1, 3, '$') ['$a', 'a', '$ab', 'ab', '$abc', 'abc', 'abc$', 'b', 'bc', 'bc$', 'c', 'c$']
-
webstruct.utils.
train_test_split_noshuffle
(*arrays, **options)[source]¶ Split arrays or matrices into train and test subsets without shuffling.
It allows to write
X_train, X_test, y_train, y_test = train_test_split_noshuffle(X, y, test_size=test_size)
instead of
X_train, X_test = X[:-test_size], X[-test_size:] y_train, y_test = y[:-test_size], y[-test_size:]
Parameters: *arrays : sequence of lists
test_size : float, int, or None (default is None)
If float, should be between 0.0 and 1.0 and represent the proportion of the dataset to include in the test split. If int, represents the absolute number of test samples. If None, test size is set to 0.25.
Returns: splitting : list of lists, length=2 * len(arrays)
List containing train-test split of input array.
Examples
>>> train_test_split_noshuffle([1,2,3], ['a', 'b', 'c'], test_size=1) [[1, 2], [3], ['a', 'b'], ['c']] >>> train_test_split_noshuffle([1,2,3,4], ['a', 'b', 'c', 'd'], test_size=0.5) [[1, 2], [3, 4], ['a', 'b'], ['c', 'd']]
-
webstruct.utils.
human_sorted
()¶ sorted
that usesalphanum_key()
as a key function
Text Tokenization¶
-
class
webstruct.text_tokenizers.
WordTokenizer
[source]¶ This tokenizer is copy-pasted version of TreebankWordTokenizer that doesn’t split on @ and ‘:’ symbols and doesn’t split contractions:
>>> from nltk.tokenize.treebank import TreebankWordTokenizer >>> s = '''Good muffins cost $3.88\nin New York. Email: muffins@gmail.com''' >>> TreebankWordTokenizer().tokenize(s)
[‘Good’, ‘muffins’, ‘cost’, ‘$’, ‘3.88’, ‘in’, ‘New’, ‘York.’, ‘Email’, ‘:’, ‘muffins’, ‘@’, ‘gmail.com’] >>> WordTokenizer().tokenize(s) [‘Good’, ‘muffins’, ‘cost’, ‘$’, ‘3.88’, ‘in’, ‘New’, ‘York.’, ‘Email:’, 'muffins@gmail.com‘]
>>> s = '''Shelbourne Road,''' >>> WordTokenizer().tokenize(s) ['Shelbourne', 'Road', ',']
>>> s = '''population of 100,000''' >>> WordTokenizer().tokenize(s) ['population', 'of', '100,000']
>>> s = '''Hello|World''' >>> WordTokenizer().tokenize(s) ['Hello', '|', 'World']
>>> s2 = '"We beat some pretty good teams to get here," Slocum said.' >>> WordTokenizer().tokenize(s2) ['``', 'We', 'beat', 'some', 'pretty', 'good', 'teams', 'to', 'get', 'here', ',', "''", 'Slocum', 'said', '.'] >>> s3 = '''Well, we couldn't have this predictable, ... cliche-ridden, \"Touched by an ... Angel\" (a show creator John Masius ... worked on) wanna-be if she didn't.''' >>> WordTokenizer().tokenize(s3) ['Well', ',', 'we', "couldn't", 'have', 'this', 'predictable', ',', 'cliche-ridden', ',', '``', 'Touched', 'by', 'an', 'Angel', "''", '(', 'a', 'show', 'creator', 'John', 'Masius', 'worked', 'on', ')', 'wanna-be', 'if', 'she', "didn't", '.']
Some issues:
>>> WordTokenizer().tokenize("Phone:855-349-1914") ['Phone', ':', '855-349-1914']
>>> WordTokenizer().tokenize("Copyright © 2014 Foo Bar and Buzz Spam. All Rights Reserved.") ['Copyright', '\xc2\xa9', '2014', 'Wall', 'Decor', 'and', 'Home', 'Accents', '.', 'All', 'Rights', 'Reserved', '.']
>>> WordTokenizer().tokenize("Powai Campus, Mumbai-400077") ['Powai', 'Campus', ',', 'Mumbai", "-", "400077']
>>> WordTokenizer().tokenize("1 5858/ 1800") ['1', '5858', '/', '1800']
>>> WordTokenizer().tokenize("Saudi Arabia-") ['Saudi', 'Arabia', '-']
-
open_quotes
= <_sre.SRE_Pattern object>¶
-
rules
= [(<_sre.SRE_Pattern object at 0x7fcce14086b8>, u''), (<_sre.SRE_Pattern object at 0x7fcce1320030>, u'``'), (<_sre.SRE_Pattern object at 0x1b834a0>, u"''"), (<_sre.SRE_Pattern object at 0x7fcce1408750>, None), (<_sre.SRE_Pattern object at 0x1b82df0>, u'...'), (<_sre.SRE_Pattern object at 0x7fcce14087e8>, None), (<_sre.SRE_Pattern object at 0x7fcce139c468>, None), (<_sre.SRE_Pattern object at 0x7fcce1384810>, None), (<_sre.SRE_Pattern object at 0x7fcce14a2cb0>, None), (<_sre.SRE_Pattern object at 0x7fcce139c540>, None)]¶
-
-
webstruct.text_tokenizers.
tokenize
(self, text)¶
Sequence Encoding¶
-
class
webstruct.sequence_encoding.
IobEncoder
(token_processor=None)[source]¶ Utility class for encoding tagged token streams using IOB2 encoding.
Encode input tokens using
encode
method:>>> iob_encoder = IobEncoder() >>> input_tokens = ["__START_PER__", "John", "__END_PER__", "said"] >>> iob_encoder.encode(input_tokens) [('John', 'B-PER'), ('said', 'O')]
Get the result in another format using
encode_split
method:>>> input_tokens = ["hello", "__START_PER__", "John", "Doe", "__END_PER__", "__START_PER__", "Mary", "__END_PER__", "said"] >>> tokens, tags = iob_encoder.encode_split(input_tokens) >>> tokens, tags (['hello', 'John', 'Doe', 'Mary', 'said'], ['O', 'B-PER', 'I-PER', 'B-PER', 'O'])
Note that IobEncoder is stateful. This means you can encode incomplete stream and continue the encoding later:
>>> iob_encoder = IobEncoder() >>> iob_encoder.encode(["__START_PER__", "John"]) [('John', 'B-PER')] >>> iob_encoder.encode(["Mayer", "__END_PER__", "said"]) [('Mayer', 'I-PER'), ('said', 'O')]
To reset internal state, use
reset method
:>>> iob_encoder.reset()
Group results to entities:
>>> iob_encoder.group(iob_encoder.encode(input_tokens)) [(['hello'], 'O'), (['John', 'Doe'], 'PER'), (['Mary'], 'PER'), (['said'], 'O')]
Input token stream is processed by
InputTokenProcessor()
by default; you can pass other token processing class to customize which tokens are considered start/end tags.-
classmethod
group
(data, strict=False)[source]¶ Group IOB2-encoded entities.
data
should be an iterable of(info, iob_tag)
tuples.info
could be any Python object,iob_tag
should be a string with a tag.Example:
>>> >>> data = [("hello", "O"), (",", "O"), ("John", "B-PER"), ... ("Doe", "I-PER"), ("Mary", "B-PER"), ("said", "O")] >>> for items, tag in IobEncoder.iter_group(data): ... print("%s %s" % (items, tag)) ['hello', ','] O ['John', 'Doe'] PER ['Mary'] PER ['said'] O
By default, invalid sequences are fixed:
>>> data = [("hello", "O"), ("John", "I-PER"), ("Doe", "I-PER")] >>> for items, tag in IobEncoder.iter_group(data): ... print("%s %s" % (items, tag)) ['hello'] O ['John', 'Doe'] PER
Pass ‘strict=True’ argument to raise an exception for invalid sequences:
>>> for items, tag in IobEncoder.iter_group(data, strict=True): ... print("%s %s" % (items, tag)) Traceback (most recent call last): ... ValueError: Invalid sequence: I-PER tag can't start sequence
-
classmethod