Miscellaneous¶
Utils¶
- class webstruct.utils.BestMatch(known)[source]¶
Bases: object
Class for finding best non-overlapping matches in a sequence of tokens. Override get_sorted_ranges() method to define which results are best.
- class webstruct.utils.LongestMatch(known)[source]¶
Bases: webstruct.utils.BestMatch
Class for finding longest non-overlapping matches in a sequence of tokens.
>>> known = {'North Las', 'North Las Vegas', 'North Pole', 'Vegas USA', 'Las Vegas', 'USA', "Toronto"} >>> lm = LongestMatch(known) >>> lm.max_length 3 >>> tokens = ["Toronto", "to", "North", "Las", "Vegas", "USA"] >>> for start, end, matched_text in lm.find_ranges(tokens): ... print(start, end, tokens[start:end], matched_text) (0, 1, ['Toronto'], 'Toronto') (2, 5, ['North', 'Las', 'Vegas'], 'North Las Vegas') (5, 6, ['USA'], 'USA')
- webstruct.utils.flatten(sequence) → list[source]¶
Return a single, flat list which contains all elements retrieved from the sequence and all recursively contained sub-sequences (iterables).
Examples:
>>> [1, 2, [3,4], (5,6)] [1, 2, [3, 4], (5, 6)] >>> flatten([[[1,2,3], (42,None)], [4,5], [6], 7, (8,9,10)]) [1, 2, 3, 42, None, 4, 5, 6, 7, 8, 9, 10]
- webstruct.utils.get_combined_keys(dicts)[source]¶
>>> sorted(get_combined_keys([{'foo': 'egg'}, {'bar': 'spam'}])) ['bar', 'foo']
- webstruct.utils.html_document_fromstring(data, encoding=None)[source]¶
Load HTML document from string using lxml.html.HTMLParser
>>> from lxml.html import fragment_fromstring, tostring >>> root = fragment_fromstring('<div><h1>head 1</h1></div>') >>> kill_html_tags(root, ['h1']) >>> tostring(root) '<div>head 1</div>'
>>> root = fragment_fromstring('<div><h1>head 1</h1></div>') >>> kill_html_tags(root, ['h1'], False) >>> tostring(root) '<div></div>'
- webstruct.utils.merge_dicts(*dicts)[source]¶
>>> sorted(merge_dicts({'foo': 'bar'}, {'bar': 'baz'}).items()) [('bar', 'baz'), ('foo', 'bar')]
Replace lxml elements’ tag.
>>> from lxml.html import fragment_fromstring, document_fromstring, tostring >>> root = fragment_fromstring('<h1>head 1</h1>') >>> replace_html_tags(root, {'h1': 'strong'}) >>> tostring(root) '<strong>head 1</strong>'
>>> root = document_fromstring('<h1>head 1</h1> <H2>head 2</H2>') >>> replace_html_tags(root, {'h1': 'strong', 'h2': 'strong', 'h3': 'strong', 'h4': 'strong'}) >>> tostring(root) '<html><body><strong>head 1</strong> <strong>head 2</strong></body></html>'
- webstruct.utils.run_command(args, verbose=True)[source]¶
Execute a command in a subprocess, terminate it if exception occurs, raise CalledProcessError exception if command returned non-zero exit code.
If verbose == True then print output as it appears using “print”. Unlike subprocess.check_call it doesn’t assume that stdout has a file descriptor - this allows printing to works in IPython notebook.
- webstruct.utils.smart_join(tokens)[source]¶
Join tokens without adding unneeded spaces before punctuation:
>>> smart_join(['Hello', ',', 'world', '!']) 'Hello, world!' >>> smart_join(['(', '303', ')', '444-7777']) '(303) 444-7777'
- webstruct.utils.substrings(txt, min_length=2, max_length=10, pad='')[source]¶
>>> substrings("abc", 1) ['a', 'ab', 'abc', 'b', 'bc', 'c'] >>> substrings("abc", 2) ['ab', 'abc', 'bc'] >>> substrings("abc", 1, 2) ['a', 'ab', 'b', 'bc', 'c'] >>> substrings("abc", 1, 3, '$') ['$a', 'a', '$ab', 'ab', '$abc', 'abc', 'abc$', 'b', 'bc', 'bc$', 'c', 'c$']
- webstruct.utils.human_sorted()¶
sorted that uses alphanum_key() as a key function
Text Tokenization¶
- class webstruct.tokenizers.WordTokenizer[source]¶
This tokenizer is copy-pasted version of TreebankWordTokenizer that doesn’t split on @ and ‘:’ symbols and doesn’t split contractions:
>>> from nltk.tokenize.treebank import TreebankWordTokenizer >>> s = u'''Good muffins cost $3.88\nin New York. Email: muffins@gmail.com''' >>> TreebankWordTokenizer().tokenize(s) [u'Good', u'muffins', u'cost', u'$', u'3.88', u'in', u'New', u'York.', u'Email', u':', u'muffins', u'@', u'gmail.com'] >>> WordTokenizer().tokenize(s) [u'Good', u'muffins', u'cost', u'$', u'3.88', u'in', u'New', u'York.', u'Email:', u'muffins@gmail.com'] >>> s = u'''Shelbourne Road,''' >>> WordTokenizer().tokenize(s) [u'Shelbourne', u'Road', u','] >>> s = u'''population of 100,000''' >>> WordTokenizer().tokenize(s) [u'population', u'of', u'100,000'] >>> s = u'''Hello|World''' >>> WordTokenizer().tokenize(s) [u'Hello', u'|', u'World']
- webstruct.tokenizers.tokenize(self, text)¶
Sequence Encoding¶
- class webstruct.sequence_encoding.IobEncoder(token_processor=None)[source]¶
Utility class for encoding tagged token streams using IOB2 encoding.
Encode input tokens using encode method:
>>> iob_encoder = IobEncoder() >>> input_tokens = ["__START_PER__", "John", "__END_PER__", "said"] >>> iob_encoder.encode(input_tokens) [('John', 'B-PER'), ('said', 'O')]
Get the result in another format using encode_split method:
>>> input_tokens = ["hello", "__START_PER__", "John", "Doe", "__END_PER__", "__START_PER__", "Mary", "__END_PER__", "said"] >>> tokens, tags = iob_encoder.encode_split(input_tokens) >>> tokens, tags (['hello', 'John', 'Doe', 'Mary', 'said'], ['O', 'B-PER', 'I-PER', 'B-PER', 'O'])
Note that IobEncoder is stateful. This means you can encode incomplete stream and continue the encoding later:
>>> iob_encoder = IobEncoder() >>> iob_encoder.encode(["__START_PER__", "John"]) [('John', 'B-PER')] >>> iob_encoder.encode(["Mayer", "__END_PER__", "said"]) [('Mayer', 'I-PER'), ('said', 'O')]
To reset internal state, use reset method:
>>> iob_encoder.reset()
Group results to entities:
>>> iob_encoder.group(iob_encoder.encode(input_tokens)) [(['hello'], 'O'), (['John', 'Doe'], 'PER'), (['Mary'], 'PER'), (['said'], 'O')]
Input token stream is processed by InputTokenProcessor() by default; you can pass other token processing class to customize which tokens are considered start/end tags.
- classmethod group(data, strict=False)[source]¶
Group IOB2-encoded entities. data should be an iterable of (info, iob_tag) tuples. info could be any Python object, iob_tag should be a string with a tag.
Example:
>>> >>> data = [("hello", "O"), (",", "O"), ("John", "B-PER"), ... ("Doe", "I-PER"), ("Mary", "B-PER"), ("said", "O")] >>> for items, tag in IobEncoder.iter_group(data): ... print("%s %s" % (items, tag)) ['hello', ','] O ['John', 'Doe'] PER ['Mary'] PER ['said'] O
By default, invalid sequences are fixed:
>>> data = [("hello", "O"), ("John", "I-PER"), ("Doe", "I-PER")] >>> for items, tag in IobEncoder.iter_group(data): ... print("%s %s" % (items, tag)) ['hello'] O ['John', 'Doe'] PER
Pass ‘strict=True’ argument to raise an exception for invalid sequences:
>>> for items, tag in IobEncoder.iter_group(data, strict=True): ... print("%s %s" % (items, tag)) Traceback (most recent call last): ... ValueError: Invalid sequence: I-PER tag can't start sequence