Adaptive graph walk based similarity measures in entity-relation graphs

  • Authors:
  • William W. Cohen;Einat Minkov

  • Affiliations:
  • Carnegie Mellon University;Carnegie Mellon University

  • Venue:
  • Adaptive graph walk based similarity measures in entity-relation graphs
  • Year:
  • 2008

Quantified Score

Hi-index 0.00

Visualization

Abstract

Relational or semi-structured data is naturally represented by a graph schema, where nodes denote entities and directed typed edges represent the relations between them. Such graphs are heterogeneous in the sense that they describe different types of objects and multiple types of links. For example, email data can be described in a graph that includes messages, persons, dates and other objects; in this graph, a message may be associated with a person with different relations, such as “sent-to”, “sent-from” and so on. In the past, researchers have suggested to apply random graph walks in order to elicit a measure of similarity between entities that are not directly connected in a graph. In this thesis, we suggest a general framework, in which different arbitrary queries (for instance, “what persons are most related to this email message?”) are addressed using random walks. Naturally, there are many types of queries possible that correspond to various flavors of inter-entity similarity; several learning techniques are therefore suggested and evaluated that adapt the graph-walk based search to a query type. The framework is applied in the thesis to two different domains. The first domain is personal information management, where it is shown how seemingly different tasks like alias finding, intelligent message threading and person name disambiguation, can be addressed uniformly as search queries using the adaptive graph-walk based similarity measure. The second domain evaluated is the processing of parsed text, where a graph represents corpora of structured parsed text, and adaptive graph walks are applied to induce inter-word similarity measures for tasks such as coordinate term extraction. Finally, design and scalability considerations are discussed.