Miscellaneous¶
Utils¶
-
class
webstruct.utils.
BestMatch
(known)[source]¶ Bases:
object
Class for finding best non-overlapping matches in a sequence of tokens. Override
get_sorted_ranges()
method to define which results are best.
-
class
webstruct.utils.
LongestMatch
(known)[source]¶ Bases:
webstruct.utils.BestMatch
Class for finding longest non-overlapping matches in a sequence of tokens.
>>> known = {'North Las', 'North Las Vegas', 'North Pole', 'Vegas USA', 'Las Vegas', 'USA', "Toronto"} >>> lm = LongestMatch(known) >>> lm.max_length 3 >>> tokens = ["Toronto", "to", "North", "Las", "Vegas", "USA"] >>> for start, end, matched_text in lm.find_ranges(tokens): ... print(start, end, tokens[start:end], matched_text) 0 1 ['Toronto'] Toronto 2 5 ['North', 'Las', 'Vegas'] North Las Vegas 5 6 ['USA'] USA
LongestMatch
also accepts a dict instead of a list/set for aknown
argument. In this case dict keys are used:>>> lm = LongestMatch({'North': 'direction', 'North Las Vegas': 'location'}) >>> tokens = ["Toronto", "to", "North", "Las", "Vegas", "USA"] >>> for start, end, matched_text in lm.find_ranges(tokens): ... print(start, end, tokens[start:end], matched_text) 2 5 ['North', 'Las', 'Vegas'] North Las Vegas
-
webstruct.utils.
flatten
(sequence) → list[source]¶ Return a single, flat list which contains all elements retrieved from the sequence and all recursively contained sub-sequences (iterables).
Examples:
>>> [1, 2, [3,4], (5,6)] [1, 2, [3, 4], (5, 6)] >>> flatten([[[1,2,3], (42,None)], [4,5], [6], 7, (8,9,10)]) [1, 2, 3, 42, None, 4, 5, 6, 7, 8, 9, 10]
-
webstruct.utils.
get_combined_keys
(dicts)[source]¶ >>> sorted(get_combined_keys([{'foo': 'egg'}, {'bar': 'spam'}])) ['bar', 'foo']
-
webstruct.utils.
html_document_fromstring
(data, encoding=None)[source]¶ Load HTML document from string using lxml.html.HTMLParser
>>> from lxml.html import fragment_fromstring, tostring >>> root = fragment_fromstring('<div><h1>head 1</h1></div>') >>> kill_html_tags(root, ['h1']) >>> tostring(root).decode() '<div>head 1</div>'
>>> root = fragment_fromstring('<div><h1>head 1</h1></div>') >>> kill_html_tags(root, ['h1'], False) >>> tostring(root).decode() '<div></div>'
-
webstruct.utils.
merge_dicts
(*dicts)[source]¶ >>> sorted(merge_dicts({'foo': 'bar'}, {'bar': 'baz'}).items()) [('bar', 'baz'), ('foo', 'bar')]
Replace lxml elements’ tag.
>>> from lxml.html import fragment_fromstring, document_fromstring, tostring >>> root = fragment_fromstring('<h1>head 1</h1>') >>> replace_html_tags(root, {'h1': 'strong'}) >>> tostring(root).decode() '<strong>head 1</strong>'
>>> root = document_fromstring('<h1>head 1</h1> <H2>head 2</H2>') >>> replace_html_tags(root, {'h1': 'strong', 'h2': 'strong', 'h3': 'strong', 'h4': 'strong'}) >>> tostring(root).decode() '<html><body><strong>head 1</strong> <strong>head 2</strong></body></html>'
-
webstruct.utils.
run_command
(args, verbose=True)[source]¶ Execute a command in a subprocess, terminate it if exception occurs, raise CalledProcessError exception if command returned non-zero exit code.
If
verbose == True
then print output as it appears using “print”. Unlikesubprocess.check_call
it doesn’t assume that stdout has a file descriptor - this allows printing to work in IPython notebook.Example:
>>> run_command(["python", "-c", "print(1+2)"]) 3 >>> run_command(["python", "-c", "print(1+2)"], verbose=False)
-
webstruct.utils.
smart_join
(tokens)[source]¶ Join tokens without adding unneeded spaces before punctuation:
>>> smart_join(['Hello', ',', 'world', '!']) 'Hello, world!' >>> smart_join(['(', '303', ')', '444-7777']) '(303) 444-7777'
-
webstruct.utils.
substrings
(txt, min_length, max_length, pad='')[source]¶ >>> substrings("abc", 1, 100) ['a', 'ab', 'abc', 'b', 'bc', 'c'] >>> substrings("abc", 2, 100) ['ab', 'abc', 'bc'] >>> substrings("abc", 1, 2) ['a', 'ab', 'b', 'bc', 'c'] >>> substrings("abc", 1, 3, '$') ['$a', 'a', '$ab', 'ab', '$abc', 'abc', 'abc$', 'b', 'bc', 'bc$', 'c', 'c$']
-
webstruct.utils.
train_test_split_noshuffle
(*arrays, **options)[source]¶ Split arrays or matrices into train and test subsets without shuffling.
It allows to write
X_train, X_test, y_train, y_test = train_test_split_noshuffle(X, y, test_size=test_size)
instead of
X_train, X_test = X[:-test_size], X[-test_size:] y_train, y_test = y[:-test_size], y[-test_size:]
Parameters: *arrays : sequence of lists
test_size : float, int, or None (default is None)
If float, should be between 0.0 and 1.0 and represent the proportion of the dataset to include in the test split. If int, represents the absolute number of test samples. If None, test size is set to 0.25.
Returns: splitting : list of lists, length=2 * len(arrays)
List containing train-test split of input array.
Examples
>>> train_test_split_noshuffle([1,2,3], ['a', 'b', 'c'], test_size=1) [[1, 2], [3], ['a', 'b'], ['c']] >>> train_test_split_noshuffle([1,2,3,4], ['a', 'b', 'c', 'd'], test_size=0.5) [[1, 2], [3, 4], ['a', 'b'], ['c', 'd']]
-
webstruct.utils.
human_sorted
()¶ sorted
that usesalphanum_key()
as a key function
Text Tokenization¶
-
class
webstruct.text_tokenizers.
WordTokenizer
[source]¶ This tokenizer is copy-pasted version of TreebankWordTokenizer that doesn’t split on @ and ‘:’ symbols and doesn’t split contractions:
>>> from nltk.tokenize.treebank import TreebankWordTokenizer >>> s = '''Good muffins cost $3.88\nin New York. Email: muffins@gmail.com''' >>> TreebankWordTokenizer().tokenize(s)
[‘Good’, ‘muffins’, ‘cost’, ‘$’, ‘3.88’, ‘in’, ‘New’, ‘York.’, ‘Email’, ‘:’, ‘muffins’, ‘@’, ‘gmail.com’] >>> WordTokenizer().tokenize(s) [‘Good’, ‘muffins’, ‘cost’, ‘$’, ‘3.88’, ‘in’, ‘New’, ‘York.’, ‘Email:’, 'muffins@gmail.com‘]
>>> s = '''Shelbourne Road,''' >>> WordTokenizer().tokenize(s) ['Shelbourne', 'Road', ',']
>>> s = '''population of 100,000''' >>> WordTokenizer().tokenize(s) ['population', 'of', '100,000']
>>> s = '''Hello|World''' >>> WordTokenizer().tokenize(s) ['Hello', '|', 'World']
>>> s2 = '"We beat some pretty good teams to get here," Slocum said.' >>> WordTokenizer().tokenize(s2) ['``', 'We', 'beat', 'some', 'pretty', 'good', 'teams', 'to', 'get', 'here', ',', "''", 'Slocum', 'said', '.'] >>> s3 = '''Well, we couldn't have this predictable, ... cliche-ridden, \"Touched by an ... Angel\" (a show creator John Masius ... worked on) wanna-be if she didn't.''' >>> WordTokenizer().tokenize(s3) ['Well', ',', 'we', "couldn't", 'have', 'this', 'predictable', ',', 'cliche-ridden', ',', '``', 'Touched', 'by', 'an', 'Angel', "''", '(', 'a', 'show', 'creator', 'John', 'Masius', 'worked', 'on', ')', 'wanna-be', 'if', 'she', "didn't", '.']
Some issues:
>>> WordTokenizer().tokenize("Phone:855-349-1914") ['Phone', ':', '855-349-1914']
>>> WordTokenizer().tokenize("Copyright © 2014 Foo Bar and Buzz Spam. All Rights Reserved.") ['Copyright', '\xc2\xa9', '2014', 'Wall', 'Decor', 'and', 'Home', 'Accents', '.', 'All', 'Rights', 'Reserved', '.']
>>> WordTokenizer().tokenize("Powai Campus, Mumbai-400077") ['Powai', 'Campus', ',', 'Mumbai", "-", "400077']
>>> WordTokenizer().tokenize("1 5858/ 1800") ['1', '5858', '/', '1800']
>>> WordTokenizer().tokenize("Saudi Arabia-") ['Saudi', 'Arabia', '-']
-
open_quotes
= <_sre.SRE_Pattern object>¶
-
rules
= [(<_sre.SRE_Pattern object at 0x7fecddd40e70>, u''), (<_sre.SRE_Pattern object at 0x7fecdf25d580>, u'``'), (<_sre.SRE_Pattern object at 0x18758c0>, u"''"), (<_sre.SRE_Pattern object at 0x7fecddd40f08>, None), (<_sre.SRE_Pattern object at 0x1847ac0>, u'...'), (<_sre.SRE_Pattern object at 0x7fecddc46030>, None), (<_sre.SRE_Pattern object at 0x7fecddc20618>, None), (<_sre.SRE_Pattern object at 0x7fecddc2c780>, None), (<_sre.SRE_Pattern object at 0x7fecdddb2e40>, None), (<_sre.SRE_Pattern object at 0x7fecddc20540>, None)]¶
-
-
webstruct.text_tokenizers.
tokenize
(self, text)¶
Sequence Encoding¶
-
class
webstruct.sequence_encoding.
IobEncoder
(token_processor=None)[source]¶ Utility class for encoding tagged token streams using IOB2 encoding.
Encode input tokens using
encode
method:>>> iob_encoder = IobEncoder() >>> input_tokens = ["__START_PER__", "John", "__END_PER__", "said"] >>> iob_encoder.encode(input_tokens) [('John', 'B-PER'), ('said', 'O')]
Get the result in another format using
encode_split
method:>>> input_tokens = ["hello", "__START_PER__", "John", "Doe", "__END_PER__", "__START_PER__", "Mary", "__END_PER__", "said"] >>> tokens, tags = iob_encoder.encode_split(input_tokens) >>> tokens, tags (['hello', 'John', 'Doe', 'Mary', 'said'], ['O', 'B-PER', 'I-PER', 'B-PER', 'O'])
Note that IobEncoder is stateful. This means you can encode incomplete stream and continue the encoding later:
>>> iob_encoder = IobEncoder() >>> iob_encoder.encode(["__START_PER__", "John"]) [('John', 'B-PER')] >>> iob_encoder.encode(["Mayer", "__END_PER__", "said"]) [('Mayer', 'I-PER'), ('said', 'O')]
To reset internal state, use
reset method
:>>> iob_encoder.reset()
Group results to entities:
>>> iob_encoder.group(iob_encoder.encode(input_tokens)) [(['hello'], 'O'), (['John', 'Doe'], 'PER'), (['Mary'], 'PER'), (['said'], 'O')]
Input token stream is processed by
InputTokenProcessor()
by default; you can pass other token processing class to customize which tokens are considered start/end tags.-
classmethod
group
(data, strict=False)[source]¶ Group IOB2-encoded entities.
data
should be an iterable of(info, iob_tag)
tuples.info
could be any Python object,iob_tag
should be a string with a tag.Example:
>>> >>> data = [("hello", "O"), (",", "O"), ("John", "B-PER"), ... ("Doe", "I-PER"), ("Mary", "B-PER"), ("said", "O")] >>> for items, tag in IobEncoder.iter_group(data): ... print("%s %s" % (items, tag)) ['hello', ','] O ['John', 'Doe'] PER ['Mary'] PER ['said'] O
By default, invalid sequences are fixed:
>>> data = [("hello", "O"), ("John", "I-PER"), ("Doe", "I-PER")] >>> for items, tag in IobEncoder.iter_group(data): ... print("%s %s" % (items, tag)) ['hello'] O ['John', 'Doe'] PER
Pass ‘strict=True’ argument to raise an exception for invalid sequences:
>>> for items, tag in IobEncoder.iter_group(data, strict=True): ... print("%s %s" % (items, tag)) Traceback (most recent call last): ... ValueError: Invalid sequence: I-PER tag can't start sequence
-
classmethod