Entity Grouping

Often it is not enough to find all entities on a webpage. For example, one may want to extract separate “entity groups” with combined information about individual offices from a page that has contact details of several offices. An “entity group” may consist of the name of the office along with office address (street, city, zipcode) and contacts (phones, faxes) in this case.

webstruct.grouping module provides a simple unsupervised algorithm to group extracted entities into clusters. It works this way:

  1. Each HTML token is assigned a position (an integer number). Position increases with each token and when HTML element changes.
  2. Distances between subsequent entities are calculated.
  3. If a distance between 2 subsequent entities is greater than a certain threshold then new “cluster” is started.
  4. Clusters are scored - longer clusters get larger scores, but clusters with several entities of the same type are penalized (unless user explicitly asked not to penalize this entity type). Total clustering score is calculated as a sum of scores of individual clusters.
  5. Threshold value for the final clustering is selected to maximize total clustering score (4). Each input page gets its own threshold.
webstruct.grouping.choose_best_clustering(html_tokens, tags, score_func=None, score_kwargs=None)[source]

Select a best way to split html_tokens and tags into clusters of named entities. Return (threshold, score, clusters) tuple.

clusters in the resulting tuple is a list of clusters; each cluster is a list of named entities: (html_tokens, tag, distance) tuples.

html_tokens and tags could be a result of webstruct.model.NER.extract_raw().

If score_func is None, choose_best_clustering() uses default_clustering_score() to compute the score of a set of clusters under consideration (optimization objective). You can pass your own scoring function to change the heuristic used. Your function must have 2 positional parameters: clusters and threshold (and any number of keyword arguments) and return a score (number) which should be large if the clustering is good and small or negative if it is bad.

score_kwargs is a dict of keyword arguments passed to scoring function. For example, if you use default score_func, the goal is to group contact information, and you want to allow several phones (TEL) and faxes (FAX) in the same group, pass score_kwargs={'dont_penalize': {'TEL', 'FAX'}}.

webstruct.grouping.default_clustering_score(clusters, threshold, dont_penalize=None)[source]

Heuristic scoring function for clusters:

  • larger clusters get bigger scores;
  • clusters that have multiple entities of the same tag are penalized (unless the tag is in dont_penalize set);
  • total score is computed as a sum of scores of all clusters.

dont_penalize is a set of tags for which duplicates are not penalized. It is empty by default.