Miscellaneous¶
Utils¶
-
class
webstruct.utils.
BestMatch
(known)[source]¶ Bases:
object
Class for finding best non-overlapping matches in a sequence of tokens. Override
get_sorted_ranges()
method to define which results are best.
-
class
webstruct.utils.
LongestMatch
(known)[source]¶ Bases:
webstruct.utils.BestMatch
Class for finding longest non-overlapping matches in a sequence of tokens.
>>> known = {'North Las', 'North Las Vegas', 'North Pole', 'Vegas USA', 'Las Vegas', 'USA', "Toronto"} >>> lm = LongestMatch(known) >>> lm.max_length 3 >>> tokens = ["Toronto", "to", "North", "Las", "Vegas", "USA"] >>> for start, end, matched_text in lm.find_ranges(tokens): ... print(start, end, tokens[start:end], matched_text) 0 1 ['Toronto'] Toronto 2 5 ['North', 'Las', 'Vegas'] North Las Vegas 5 6 ['USA'] USA
LongestMatch
also accepts a dict instead of a list/set for aknown
argument. In this case dict keys are used:>>> lm = LongestMatch({'North': 'direction', 'North Las Vegas': 'location'}) >>> tokens = ["Toronto", "to", "North", "Las", "Vegas", "USA"] >>> for start, end, matched_text in lm.find_ranges(tokens): ... print(start, end, tokens[start:end], matched_text) 2 5 ['North', 'Las', 'Vegas'] North Las Vegas
-
webstruct.utils.
flatten
(sequence) → list[source]¶ Return a single, flat list which contains all elements retrieved from the sequence and all recursively contained sub-sequences (iterables).
Examples:
>>> [1, 2, [3,4], (5,6)] [1, 2, [3, 4], (5, 6)] >>> flatten([[[1,2,3], (42,None)], [4,5], [6], 7, (8,9,10)]) [1, 2, 3, 42, None, 4, 5, 6, 7, 8, 9, 10]
-
webstruct.utils.
get_combined_keys
(dicts)[source]¶ >>> sorted(get_combined_keys([{'foo': 'egg'}, {'bar': 'spam'}])) ['bar', 'foo']
-
webstruct.utils.
get_domain
(url)[source]¶ >>> get_domain("http://example.com/path") 'example.com' >>> get_domain("https://hello.example.com/foo/bar") 'example.com' >>> get_domain("http://hello.example.co.uk/foo?bar=1") 'example.co.uk'
-
webstruct.utils.
html_document_fromstring
(data, encoding=None)[source]¶ Load HTML document from string using lxml.html.HTMLParser
>>> from lxml.html import fragment_fromstring, tostring >>> root = fragment_fromstring('<div><h1>head 1</h1></div>') >>> kill_html_tags(root, ['h1']) >>> tostring(root).decode() '<div>head 1</div>'
>>> root = fragment_fromstring('<div><h1>head 1</h1></div>') >>> kill_html_tags(root, ['h1'], False) >>> tostring(root).decode() '<div></div>'
-
webstruct.utils.
merge_dicts
(*dicts)[source]¶ >>> sorted(merge_dicts({'foo': 'bar'}, {'bar': 'baz'}).items()) [('bar', 'baz'), ('foo', 'bar')]
Replace lxml elements’ tag.
>>> from lxml.html import fragment_fromstring, document_fromstring, tostring >>> root = fragment_fromstring('<h1>head 1</h1>') >>> replace_html_tags(root, {'h1': 'strong'}) >>> tostring(root).decode() '<strong>head 1</strong>'
>>> root = document_fromstring('<h1>head 1</h1> <H2>head 2</H2>') >>> replace_html_tags(root, {'h1': 'strong', 'h2': 'strong', 'h3': 'strong', 'h4': 'strong'}) >>> tostring(root).decode() '<html><body><strong>head 1</strong> <strong>head 2</strong></body></html>'
-
webstruct.utils.
run_command
(args, verbose=True)[source]¶ Execute a command in a subprocess, terminate it if exception occurs, raise CalledProcessError exception if command returned non-zero exit code.
If
verbose == True
then print output as it appears using “print”. Unlikesubprocess.check_call
it doesn’t assume that stdout has a file descriptor - this allows printing to work in IPython notebook.Example:
>>> run_command(["python", "-c", "print(1+2)"]) 3 >>> run_command(["python", "-c", "print(1+2)"], verbose=False)
-
webstruct.utils.
smart_join
(tokens)[source]¶ Join tokens without adding unneeded spaces before punctuation:
>>> smart_join(['Hello', ',', 'world', '!']) 'Hello, world!' >>> smart_join(['(', '303', ')', '444-7777']) '(303) 444-7777'
-
webstruct.utils.
substrings
(txt, min_length, max_length, pad='')[source]¶ >>> substrings("abc", 1, 100) ['a', 'ab', 'abc', 'b', 'bc', 'c'] >>> substrings("abc", 2, 100) ['ab', 'abc', 'bc'] >>> substrings("abc", 1, 2) ['a', 'ab', 'b', 'bc', 'c'] >>> substrings("abc", 1, 3, '$') ['$a', 'a', '$ab', 'ab', '$abc', 'abc', 'abc$', 'b', 'bc', 'bc$', 'c', 'c$']
-
webstruct.utils.
train_test_split_noshuffle
(*arrays, **options)[source]¶ Split arrays or matrices into train and test subsets without shuffling.
It allows to write
X_train, X_test, y_train, y_test = train_test_split_noshuffle(X, y, test_size=test_size)
instead of
X_train, X_test = X[:-test_size], X[-test_size:] y_train, y_test = y[:-test_size], y[-test_size:]
Parameters: - *arrays : sequence of lists
- test_size : float, int, or None (default is None)
If float, should be between 0.0 and 1.0 and represent the proportion of the dataset to include in the test split. If int, represents the absolute number of test samples. If None, test size is set to 0.25.
Returns: - splitting : list of lists, length=2 * len(arrays)
List containing train-test split of input array.
Examples
>>> train_test_split_noshuffle([1,2,3], ['a', 'b', 'c'], test_size=1) [[1, 2], [3], ['a', 'b'], ['c']] >>> train_test_split_noshuffle([1,2,3,4], ['a', 'b', 'c', 'd'], test_size=0.5) [[1, 2], [3, 4], ['a', 'b'], ['c', 'd']]
-
webstruct.utils.
human_sorted
()¶ sorted
that usesalphanum_key()
as a key function
Text Tokenization¶
-
class
webstruct.text_tokenizers.
TextToken
¶ -
chars
¶ Alias for field number 0
-
length
¶ Alias for field number 2
-
position
¶ Alias for field number 1
-
-
class
webstruct.text_tokenizers.
WordTokenizer
[source]¶ This tokenizer is copy-pasted version of TreebankWordTokenizer that doesn’t split on @ and ‘:’ symbols and doesn’t split contractions. It supports span_tokenize(in terms of nltk tokenizers) method -
segment_words()
:>>> s = '''Good muffins cost $3.88\nin New York. Email: muffins@gmail.com''' >>> WordTokenizer().segment_words(s)
- [TextToken(chars=’Good’, position=0, length=4),
- TextToken(chars=’muffins’, position=5, length=7), TextToken(chars=’cost’, position=13, length=4), TextToken(chars=’$’, position=18, length=1), TextToken(chars=‘3.88’, position=19, length=4), TextToken(chars=’in’, position=24, length=2), TextToken(chars=’New’, position=27, length=3), TextToken(chars=’York.’, position=31, length=5), TextToken(chars=’Email:’, position=37, length=6), TextToken(chars=’muffins@gmail.com’, position=44, length=17)]
>>> s = '''Shelbourne Road,''' >>> WordTokenizer().segment_words(s) [TextToken(chars='Shelbourne', position=0, length=10), TextToken(chars='Road', position=11, length=4), TextToken(chars=',', position=15, length=1)]
>>> s = '''population of 100,000''' >>> WordTokenizer().segment_words(s) [TextToken(chars='population', position=0, length=10), TextToken(chars='of', position=11, length=2), TextToken(chars='100,000', position=14, length=7)]
>>> s = '''Hello|World''' >>> WordTokenizer().segment_words(s) [TextToken(chars='Hello', position=0, length=5), TextToken(chars='|', position=5, length=1), TextToken(chars='World', position=6, length=5)]
>>> s2 = '"We beat some pretty good teams to get here," Slocum said.' >>> WordTokenizer().segment_words(s2) [TextToken(chars='``', position=0, length=1), TextToken(chars='We', position=1, length=2), TextToken(chars='beat', position=4, length=4), TextToken(chars='some', position=9, length=4), TextToken(chars='pretty', position=14, length=6), TextToken(chars='good', position=21, length=4), TextToken(chars='teams', position=26, length=5), TextToken(chars='to', position=32, length=2), TextToken(chars='get', position=35, length=3), TextToken(chars='here', position=39, length=4), TextToken(chars=',', position=43, length=1), TextToken(chars="''", position=44, length=1), TextToken(chars='Slocum', position=46, length=6), TextToken(chars='said', position=53, length=4), TextToken(chars='.', position=57, length=1)] >>> s3 = '''Well, we couldn't have this predictable, ... cliche-ridden, \"Touched by an ... Angel\" (a show creator John Masius ... worked on) wanna-be if she didn't.''' >>> WordTokenizer().segment_words(s3) [TextToken(chars='Well', position=0, length=4), TextToken(chars=',', position=4, length=1), TextToken(chars='we', position=6, length=2), TextToken(chars="couldn't", position=9, length=8), TextToken(chars='have', position=18, length=4), TextToken(chars='this', position=23, length=4), TextToken(chars='predictable', position=28, length=11), TextToken(chars=',', position=39, length=1), TextToken(chars='cliche-ridden', position=41, length=13), TextToken(chars=',', position=54, length=1), TextToken(chars='``', position=56, length=1), TextToken(chars='Touched', position=57, length=7), TextToken(chars='by', position=65, length=2), TextToken(chars='an', position=68, length=2), TextToken(chars='Angel', position=71, length=5), TextToken(chars="''", position=76, length=1), TextToken(chars='(', position=78, length=1), TextToken(chars='a', position=79, length=1), TextToken(chars='show', position=81, length=4), TextToken(chars='creator', position=86, length=7), TextToken(chars='John', position=94, length=4), TextToken(chars='Masius', position=99, length=6), TextToken(chars='worked', position=106, length=6), TextToken(chars='on', position=113, length=2), TextToken(chars=')', position=115, length=1), TextToken(chars='wanna-be', position=117, length=8), TextToken(chars='if', position=126, length=2), TextToken(chars='she', position=129, length=3), TextToken(chars="didn't", position=133, length=6), TextToken(chars='.', position=139, length=1)]
>>> WordTokenizer().segment_words('"') [TextToken(chars='``', position=0, length=1)]
>>> WordTokenizer().segment_words('" a') [TextToken(chars='``', position=0, length=1), TextToken(chars='a', position=2, length=1)]
>>> WordTokenizer().segment_words('["a') [TextToken(chars='[', position=0, length=1), TextToken(chars='``', position=1, length=1), TextToken(chars='a', position=2, length=1)]
Some issues:
>>> WordTokenizer().segment_words("Copyright © 2014 Foo Bar and Buzz Spam. All Rights Reserved.") [TextToken(chars='Copyright', position=0, length=9), TextToken(chars=u'\xa9', position=10, length=1), TextToken(chars='2014', position=12, length=4), TextToken(chars='Foo', position=17, length=3), TextToken(chars='Bar', position=21, length=3), TextToken(chars='and', position=25, length=3), TextToken(chars='Buzz', position=29, length=4), TextToken(chars='Spam.', position=34, length=5), TextToken(chars='All', position=40, length=3), TextToken(chars='Rights', position=44, length=6), TextToken(chars='Reserved', position=51, length=8), TextToken(chars='.', position=59, length=1)]
-
open_quotes
= <_sre.SRE_Pattern object>¶
-
rules
= [(<_sre.SRE_Pattern object>, u''), (<_sre.SRE_Pattern object>, u'``'), (<_sre.SRE_Pattern object>, u"''"), (<_sre.SRE_Pattern object>, None), (<_sre.SRE_Pattern object>, u'...'), (<_sre.SRE_Pattern object>, None), (<_sre.SRE_Pattern object>, None), (<_sre.SRE_Pattern object>, None), (<_sre.SRE_Pattern object>, None), (<_sre.SRE_Pattern object>, None)]¶
-
webstruct.text_tokenizers.
tokenize
(text)¶
Sequence Encoding¶
-
class
webstruct.sequence_encoding.
IobEncoder
(token_processor=None)[source]¶ Utility class for encoding tagged token streams using IOB2 encoding.
Encode input tokens using
encode
method:>>> iob_encoder = IobEncoder() >>> input_tokens = ["__START_PER__", "John", "__END_PER__", "said"] >>> def encode(encoder, tokens): return [p for p in IobEncoder.from_indices(encoder.encode(tokens), tokens)] >>> encode(iob_encoder, input_tokens) [('John', 'B-PER'), ('said', 'O')] >>> input_tokens = ["hello", "__START_PER__", "John", "Doe", "__END_PER__", "__START_PER__", "Mary", "__END_PER__", "said"] >>> tokens = encode(iob_encoder, input_tokens) >>> tokens, tags = iob_encoder.split(tokens) >>> tokens, tags (['hello', 'John', 'Doe', 'Mary', 'said'], ['O', 'B-PER', 'I-PER', 'B-PER', 'O'])
Note that IobEncoder is stateful. This means you can encode incomplete stream and continue the encoding later:
>>> iob_encoder = IobEncoder() >>> input_tokens_partial = ["__START_PER__", "John"] >>> encode(iob_encoder, input_tokens_partial) [('John', 'B-PER')] >>> input_tokens_partial = ["Mayer", "__END_PER__", "said"] >>> encode(iob_encoder, input_tokens_partial) [('Mayer', 'I-PER'), ('said', 'O')]
To reset internal state, use
reset method
:>>> iob_encoder.reset()
Group results to entities:
>>> iob_encoder.group(encode(iob_encoder, input_tokens)) [(['hello'], 'O'), (['John', 'Doe'], 'PER'), (['Mary'], 'PER'), (['said'], 'O')]
Input token stream is processed by
InputTokenProcessor()
by default; you can pass other token processing class to customize which tokens are considered start/end tags.-
classmethod
group
(data, strict=False)[source]¶ Group IOB2-encoded entities.
data
should be an iterable of(info, iob_tag)
tuples.info
could be any Python object,iob_tag
should be a string with a tag.Example:
>>> >>> data = [("hello", "O"), (",", "O"), ("John", "B-PER"), ... ("Doe", "I-PER"), ("Mary", "B-PER"), ("said", "O")] >>> for items, tag in IobEncoder.iter_group(data): ... print("%s %s" % (items, tag)) ['hello', ','] O ['John', 'Doe'] PER ['Mary'] PER ['said'] O
By default, invalid sequences are fixed:
>>> data = [("hello", "O"), ("John", "I-PER"), ("Doe", "I-PER")] >>> for items, tag in IobEncoder.iter_group(data): ... print("%s %s" % (items, tag)) ['hello'] O ['John', 'Doe'] PER
Pass ‘strict=True’ argument to raise an exception for invalid sequences:
>>> for items, tag in IobEncoder.iter_group(data, strict=True): ... print("%s %s" % (items, tag)) Traceback (most recent call last): ... ValueError: Invalid sequence: I-PER tag can't start sequence
-
classmethod
Webpage domain inferring¶
Module for getting a most likely base URL (domain) for a page. It is useful if you’ve downloaded HTML files, but haven’t preserved URLs explicitly, and still want to have cross-validation done right. Grouping pages by domain name is a reasonable way to do that.
WebAnnotator data has either <base> tags with original URLs (or at least original domains), or a commented out base tags.
Unfortunately, GATE-annotated data doesn’t have this tag. So the idea is to use a most popular domain mentioned in a page as a page’s domain.
-
webstruct.infer_domain.
get_base_href
(tree)[source]¶ Return href of a base tag; base tag could be commented out.
-
webstruct.infer_domain.
get_tree_domain
(tree, blacklist=set(['flickr.com', 'pinterest.com', 'youtube.com', 'google.com', 'fonts.com', 'paypal.com', 'twitter.com', 'fonts.net', 'addthis.com', 'facebook.com', 'googleapis.com', 'linkedin.com']), get_domain=<function get_domain>)[source]¶ Return the most likely domain for the tree. Domain is extracted from base tag or guessed if there is no base tag. If domain can’t be detected an empty string is returned.
-
webstruct.infer_domain.
guess_domain
(tree, blacklist=set(['flickr.com', 'pinterest.com', 'youtube.com', 'google.com', 'fonts.com', 'paypal.com', 'twitter.com', 'fonts.net', 'addthis.com', 'facebook.com', 'googleapis.com', 'linkedin.com']), get_domain=<function get_domain>)[source]¶ Return most common domain not in a black list.