Building structured web community portals: a top-down, compositional, and incremental approach

Authors:
Pedro DeRose;Warren Shen;Fei Chen;AnHai Doan;Raghu Ramakrishnan
Affiliations:
University of Wisconsin-Madison;University of Wisconsin-Madison;University of Wisconsin-Madison;University of Wisconsin-Madison;Yahoo! Research
Venue:
VLDB '07 Proceedings of the 33rd international conference on Very large data bases
Year:
2007

Citing 23
Cited 33

The anatomy of a large-scale hypertextual Web search engine

WWW7 Proceedings of the seventh international conference on World Wide Web 7
DEADLINER: building a new niche search engine

Proceedings of the ninth international conference on Information and knowledge management
A vector space model for automatic indexing

Communications of the ACM
SEAL: a framework for developing SEmantic PortALs

Proceedings of the 1st international conference on Knowledge capture
Accelerated focused crawling through online relevance feedback

Proceedings of the 11th international conference on World Wide Web
OntoWeb - A Semantic Web Community Portal

PAKM '02 Proceedings of the 4th International Conference on Practical Aspects of Knowledge Management
A Machine Learning Approach to Building Domain-Specific Search Engines

IJCAI '99 Proceedings of the Sixteenth International Joint Conference on Artificial Intelligence
Declarative specification of Web sites with S

The VLDB Journal — The International Journal on Very Large Data Bases
SemTag and seeker: bootstrapping the semantic web via automated semantic annotation

WWW '03 Proceedings of the 12th international conference on World Wide Web
WebODE in a nutshell

AI Magazine
How to build a WebFountain: An architecture for very large-scale text analytics

IBM Systems Journal
Finding Related Pages Using the Link Structure of the WWW

WI '04 Proceedings of the 2004 IEEE/WIC/ACM International Conference on Web Intelligence
UIMA: an architectural approach to unstructured information processing in the corporate research environment

Natural Language Engineering
An integrated, conditional model of information extraction and coreference with application to citation matching

UAI '04 Proceedings of the 20th conference on Uncertainty in artificial intelligence
Unsupervised named-entity extraction from the web: an experimental study

Artificial Intelligence
Efficient Batch Top-k Search for Dictionary-based Entity Recognition

ICDE '06 Proceedings of the 22nd International Conference on Data Engineering
POLYPHONET: an advanced social network extraction system from the web

Proceedings of the 15th international conference on World Wide Web
To search or to crawl?: towards a query optimizer for text-centric tasks

Proceedings of the 2006 ACM SIGMOD international conference on Management of data
Managing information extraction: state of the art and research directions

Proceedings of the 2006 ACM SIGMOD international conference on Management of data
Record linkage: similarity measures and algorithms

Proceedings of the 2006 ACM SIGMOD international conference on Management of data
Data integration: the teenage years

VLDB '06 Proceedings of the 32nd international conference on Very large data bases
Seeking stable clusters in the blogosphere

VLDB '07 Proceedings of the 33rd international conference on Very large data bases
Declarative information extraction using datalog with embedded extraction predicates

VLDB '07 Proceedings of the 33rd international conference on Very large data bases

Community systems research at Yahoo!

ACM SIGMOD Record
OLAP over imprecise data with domain constraints

VLDB '07 Proceedings of the 33rd international conference on Very large data bases
Declarative information extraction using datalog with embedded extraction predicates

VLDB '07 Proceedings of the 33rd international conference on Very large data bases
A relational approach to incrementally extracting and querying structure in unstructured data

VLDB '07 Proceedings of the 33rd international conference on Very large data bases
Toward best-effort information extraction

Proceedings of the 2008 ACM SIGMOD international conference on Management of data
YAGO: A Large Ontology from Wikipedia and WordNet

Web Semantics: Science, Services and Agents on the World Wide Web
Multidimensional content eXploration

Proceedings of the VLDB Endowment
On the provenance of non-answers to queries over extracted data

Proceedings of the VLDB Endowment
Harvesting, searching, and ranking knowledge on the web: invited talk

Proceedings of the Second ACM International Conference on Web Search and Data Mining
Database and information-retrieval methods for knowledge discovery

Communications of the ACM - A Direct Path to Dependable Software
Information extraction challenges in managing unstructured data

ACM SIGMOD Record
Purple SOX extraction management system

ACM SIGMOD Record
The YAGO-NAGA approach to knowledge discovery

ACM SIGMOD Record
SOFIE: a self-organizing framework for information extraction

Proceedings of the 18th international conference on World wide web
A web of concepts

Proceedings of the twenty-eighth ACM SIGMOD-SIGACT-SIGART symposium on Principles of database systems
Efficiently incorporating user feedback into information extraction and integration programs

Proceedings of the 2009 ACM SIGMOD International Conference on Management of data
Optimizing complex extraction programs over evolving text data

Proceedings of the 2009 ACM SIGMOD International Conference on Management of data
Intelligence in wikipedia

AAAI'08 Proceedings of the 23rd national conference on Artificial intelligence - Volume 3
Qualitative effects of knowledge rules and user feedback in probabilistic data integration

The VLDB Journal — The International Journal on Very Large Data Bases
Data integration for the relational web

Proceedings of the VLDB Endowment
From information to knowledge: harvesting entities and relationships from web sources

Proceedings of the twenty-ninth ACM SIGMOD-SIGACT-SIGART symposium on Principles of database systems
Find your advisor: robust knowledge gathering from the web

Procceedings of the 13th International Workshop on the Web and Databases
Entity-relationship queries over wikipedia

SMUC '10 Proceedings of the 2nd international workshop on Search and mining user-generated contents
Extracting local web communities using lexical similarity

DASFAA'10 Proceedings of the 15th international conference on Database systems for advanced applications
Just-in-time data integration in action

Proceedings of the VLDB Endowment
Collective extraction from heterogeneous web lists

Proceedings of the fourth ACM international conference on Web search and data mining
An analysis of structured data on the web

Proceedings of the VLDB Endowment
Automatic web-scale information extraction

SIGMOD '12 Proceedings of the 2012 ACM SIGMOD International Conference on Management of Data
Entity-Relationship Queries over Wikipedia

ACM Transactions on Intelligent Systems and Technology (TIST)
Incrementally improving dataspaces based on user feedback

Information Systems
Building, maintaining, and using knowledge bases: a report from the trenches

Proceedings of the 2013 ACM SIGMOD International Conference on Management of Data
Beyond search: Retrieving complete tuples from a text-database

Information Systems Frontiers
Entity extraction, linking, classification, and tagging for social media: a wikipedia-based approach

Proceedings of the VLDB Endowment

Quantified Score

Hi-index	0.00

Visualization

Abstract

Structured community portals extract and integrate information from raw Web pages to present a unified view of entities and relationships in the community. In this paper we argue that to build such portals, a top-down, compositional, and incremental approach is a good way to proceed. Compared to current approaches that employ complex monolithic techniques, this approach is easier to develop, understand, debug, and optimize. In this approach, we first select a small set of important community sources. Next, we compose plans that extract and integrate data from these sources, using a set of extraction/integration operators. Executing these plans yields an initial structured portal. We then incrementally expand this portal by monitoring the evolution of current data sources, to detect and add new data sources. We describe our initial solutions to the above steps, and a case study of employing these solutions to build DBLife, a portal for the database community. We found that DBLife could be built quickly and achieve high accuracy using simple extraction/integration operators, and that it can be maintained and expanded with little human effort. The initial solutions together with the case study demonstrate the feasibility and potential of our approach.