Entity Grouping¶
Often it is not enough to find all entities on a webpage. For example, one may want to extract separate “entity groups” with combined information about individual offices from a page that has contact details of several offices. An “entity group” may consist of the name of the office along with office address (street, city, zipcode) and contacts (phones, faxes) in this case.
webstruct.grouping module provides a simple unsupervised
algorithm to group extracted entities into clusters. It works this way:
- Each HTML token is assigned a position (an integer number). Position increases with each token and when HTML element changes.
- Distances between subsequent entities are calculated.
- If a distance between 2 subsequent entities is greater than a certain threshold then new “cluster” is started.
- Clusters are scored - longer clusters get larger scores, but clusters with several entities of the same type are penalized (unless user explicitly asked not to penalize this entity type). Total clustering score is calculated as a sum of scores of individual clusters.
- Threshold value for the final clustering is selected to maximize total clustering score (4). Each input page gets its own threshold.
-
webstruct.grouping.choose_best_clustering(html_tokens, tags, score_func=None, score_kwargs=None)[source]¶ Select a best way to split
html_tokensandtagsinto clusters of named entities. Return(threshold, score, clusters)tuple.clustersin the resulting tuple is a list of clusters; each cluster is a list of named entities:(html_tokens, tag, distance)tuples.html_tokensandtagscould be a result ofwebstruct.model.NER.extract_raw().If
score_funcis None,choose_best_clustering()usesdefault_clustering_score()to compute the score of a set of clusters under consideration (optimization objective). You can pass your own scoring function to change the heuristic used. Your function must have 2 positional parameters:clustersandthreshold(and any number of keyword arguments) and return a score (number) which should be large if the clustering is good and small or negative if it is bad.score_kwargsis a dict of keyword arguments passed to scoring function. For example, if you use defaultscore_func, the goal is to group contact information, and you want to allow several phones (TEL) and faxes (FAX) in the same group, passscore_kwargs={'dont_penalize': {'TEL', 'FAX'}}.
-
webstruct.grouping.default_clustering_score(clusters, threshold, dont_penalize=None)[source]¶ Heuristic scoring function for clusters:
- larger clusters get bigger scores;
- clusters that have multiple entities of the same tag are penalized
(unless the tag is in
dont_penalizeset); - total score is computed as a sum of scores of all clusters.
dont_penalizeis a set of tags for which duplicates are not penalized. It is empty by default.