The multiple sequence alignment problem in biology
SIAM Journal on Applied Mathematics
Algorithms for string searching
ACM SIGIR Forum
Identifying syntactic differences between two programs
Software—Practice & Experience
On the editing distance between unordered labeled trees
Information Processing Letters
Algorithms on strings, trees, and sequences: computer science and computational biology
Algorithms on strings, trees, and sequences: computer science and computational biology
A scalable comparison-shopping agent for the World-Wide Web
AGENTS '97 Proceedings of the first international conference on Autonomous agents
An Algorithm for Finding the Largest Approximately Common Substructures of Two Trees
IEEE Transactions on Pattern Analysis and Machine Intelligence
A hierarchical approach to wrapper induction
Proceedings of the third annual conference on Autonomous Agents
Record-boundary discovery in Web documents
SIGMOD '99 Proceedings of the 1999 ACM SIGMOD international conference on Management of data
Generating finite-state transducers for semi-structured data extraction from the Web
Information Systems - Special issue on semistructured data
The Tree-to-Tree Correction Problem
Journal of the ACM (JACM)
Wrapper induction: efficiency and expressiveness
Artificial Intelligence - Special issue on Intelligent internet systems
IEPAD: information extraction based on pattern discovery
Proceedings of the 10th international conference on World Wide Web
A flexible learning system for wrapping tables and lists in HTML documents
Proceedings of the 11th international conference on World Wide Web
A machine learning based approach for table detection on the web
Proceedings of the 11th international conference on World Wide Web
Template detection via data mining and its applications
Proceedings of the 11th international conference on World Wide Web
New algorithm for ordered tree-to-tree correction problem
Journal of Algorithms
Structural extraction from visual layout of documents
Proceedings of the eleventh international conference on Information and knowledge management
Mining the Web: Discovering Knowledge from HyperText Data
Mining the Web: Discovering Knowledge from HyperText Data
RoadRunner: Towards Automatic Data Extraction from Large Web Sites
Proceedings of the 27th International Conference on Very Large Data Bases
Data extraction and label assignment for web databases
WWW '03 Proceedings of the 12th international conference on World Wide Web
Table extraction using conditional random fields
Proceedings of the 26th annual international ACM SIGIR conference on Research and development in informaion retrieval
Extracting structured data from Web pages
Proceedings of the 2003 ACM SIGMOD international conference on Management of data
Mining data records in Web pages
Proceedings of the ninth ACM SIGKDD international conference on Knowledge discovery and data mining
Learning block importance models for web pages
Proceedings of the 13th international conference on World Wide Web
Automatic detection of fragments in dynamically generated web pages
Proceedings of the 13th international conference on World Wide Web
Automatic web news extraction using tree edit distance
Proceedings of the 13th international conference on World Wide Web
Mining tables from large scale HTML texts
COLING '00 Proceedings of the 18th conference on Computational linguistics - Volume 1
Using the structure of Web sites for automatic segmentation of tables
SIGMOD '04 Proceedings of the 2004 ACM SIGMOD international conference on Management of data
Opinion observer: analyzing and comparing opinions on the Web
WWW '05 Proceedings of the 14th international conference on World Wide Web
ViPER: augmenting automatic information extraction with visual perceptions
Proceedings of the 14th ACM international conference on Information and knowledge management
2D Conditional Random Fields for Web information extraction
ICML '05 Proceedings of the 22nd international conference on Machine learning
Recovering semantic relations from web pages based on visual cues
Proceedings of the 11th international conference on Intelligent user interfaces
Interactive wrapper generation with minimal user effort
Proceedings of the 15th international conference on World Wide Web
Structure-driven crawler generation by example
SIGIR '06 Proceedings of the 29th annual international ACM SIGIR conference on Research and development in information retrieval
Simultaneous record detection and attribute labeling in web data extraction
Proceedings of the 12th ACM SIGKDD international conference on Knowledge discovery and data mining
A Survey of Web Information Extraction Systems
IEEE Transactions on Knowledge and Data Engineering
Automatic extraction of dynamic record sections from search engine result pages
VLDB '06 Proceedings of the 32nd international conference on Very large data bases
Summarizing personal web browsing sessions
UIST '06 Proceedings of the 19th annual ACM symposium on User interface software and technology
Enabling web browsers to augment web sites' filtering and sorting functionalities
UIST '06 Proceedings of the 19th annual ACM symposium on User interface software and technology
Structured Data Extraction from the Web Based on Partial Tree Alignment
IEEE Transactions on Knowledge and Data Engineering
Automatically maintaining wrappers for semi-structured web sources
Data & Knowledge Engineering
Homepage live: automatic block tracing for web personalization
Proceedings of the 16th international conference on World Wide Web
Towards domain-independent information extraction from web tables
Proceedings of the 16th international conference on World Wide Web
Adaptive record extraction from web pages
Proceedings of the 16th international conference on World Wide Web
U-REST: an unsupervised record extraction system
Proceedings of the 16th international conference on World Wide Web
Interactive Tuples Extraction from Semi-Structured Data
WI '06 Proceedings of the 2006 IEEE/WIC/ACM International Conference on Web Intelligence
Extracting Web Data Using Instance-Based Learning
World Wide Web
Dynamic hierarchical Markov random fields and their application to web data extraction
Proceedings of the 24th international conference on Machine learning
Extraction of flat and nested data records from web pages
AusDM '06 Proceedings of the fifth Australasian conference on Data mining and analystics - Volume 61
Mining templates from search result records of search engines
Proceedings of the 13th ACM SIGKDD international conference on Knowledge discovery and data mining
Webpage understanding: an integrated approach
Proceedings of the 13th ACM SIGKDD international conference on Knowledge discovery and data mining
Context-aware wrapping: synchronized data extraction
VLDB '07 Proceedings of the 33rd international conference on Very large data bases
Towards a global schema for web entities
Proceedings of the 17th international conference on World Wide Web
Understanding web documents: finding pagelets for transformation using structural patterns
International Journal of Web Engineering and Technology
Detecting data records in semi-structured web sites based on text token clustering
Integrated Computer-Aided Engineering
Bootstrapping Information Extraction from Semi-structured Web Pages
ECML PKDD '08 Proceedings of the 2008 European Conference on Machine Learning and Knowledge Discovery in Databases - Part I
Dynamic Hierarchical Markov Random Fields for Integrated Web Data Extraction
The Journal of Machine Learning Research
WRAPPER INFERENCE FOR AMBIGUOUS WEB PAGES
Applied Artificial Intelligence
Automated Semantic Analysis of Schematic Data
World Wide Web
Automatic wrapper induction from hidden-web sources with domain knowledge
Proceedings of the 10th ACM workshop on Web information and data management
Tuning up FOIL for extracting information from the web
International Journal of Computer Applications in Technology
Discriminating Meaningful Web Tables from Decorative Tables Using a Composite Kernel
WI-IAT '08 Proceedings of the 2008 IEEE/WIC/ACM International Conference on Web Intelligence and Intelligent Agent Technology - Volume 01
Exploring websites through contextual facets
Proceedings of the SIGCHI Conference on Human Factors in Computing Systems
Incorporating site-level knowledge to extract structured data from web forums
Proceedings of the 18th international conference on World wide web
Characterizing insecure javascript practices on the web
Proceedings of the 18th international conference on World wide web
Extracting data records from the web using tag path clustering
Proceedings of the 18th international conference on World wide web
Juicer: Scalable Extraction for Thread Meta-information of Web Forum
PAISI '09 Proceedings of the Pacific Asia Workshop on Intelligence and Security Informatics
Automatic hidden-web table interpretation, conceptualization, and semantic annotation
Data & Knowledge Engineering
A survey on sentiment detection of reviews
Expert Systems with Applications: An International Journal
Can we learn a template-independent wrapper for news article extraction from a single training site?
Proceedings of the 15th ACM SIGKDD international conference on Knowledge discovery and data mining
A Personal Web Information/Knowledge Retrieval System
Proceedings of the 2008 conference on Information Modelling and Knowledge Bases XIX
Profile-based focused crawling for social media-sharing websites
Journal on Image and Video Processing
Table extraction using spatial reasoning on the CSS2 visual box model
AAAI'06 proceedings of the 21st national conference on Artificial intelligence - Volume 2
Automatic wrapper generation using tree matching and partial tree alignment
AAAI'06 proceedings of the 21st national conference on Artificial intelligence - Volume 2
Enhanced Gestalt Theory Guided Web Page Segmentation for Mobile Browsing
WI-IAT '09 Proceedings of the 2009 IEEE/WIC/ACM International Joint Conference on Web Intelligence and Intelligent Agent Technology - Volume 03
Entropy-Based Visual Tree Evaluation on Block Extraction
WI-IAT '09 Proceedings of the 2009 IEEE/WIC/ACM International Joint Conference on Web Intelligence and Intelligent Agent Technology - Volume 01
Automatic web data extraction using tree alignment
Proceedings of the 18th ACM conference on Information and knowledge management
Information extraction for search engines using fast heuristic techniques
Data & Knowledge Engineering
Answering table augmentation queries from unstructured lists on the web
Proceedings of the VLDB Endowment
Harvesting relational tables from lists on the web
Proceedings of the VLDB Endowment
Extracting Structured Data from Web Pages with Maximum Entropy Segmental Markov Model
WISE '09 Proceedings of the 10th International Conference on Web Information Systems Engineering
FireCite: lightweight real-time reference string extraction from webpages
NLPIR4DL '09 Proceedings of the 2009 Workshop on Text and Citation Analysis for Scholarly Digital Libraries
Visual extraction of information from web pages
Journal of Visual Languages and Computing
Web data extracion using visual features
Proceedings of the International Conference and Workshop on Emerging Trends in Technology
BIS'07 Proceedings of the 10th international conference on Business information systems
Using structured tokens to identify webpages for data extraction
APWeb/WAIM'07 Proceedings of the joint 9th Asia-Pacific web and 8th international conference on web-age information management conference on Advances in data and web management
Automatic extraction of clickable structured web contents for name entity queries
Proceedings of the 19th international conference on World wide web
WMS-extracting multiple sections data records from search engine results pages
Proceedings of the 2010 ACM Symposium on Applied Computing
Towards a wrapper-driven ontology-based framework for knowledge extraction
KSEM'07 Proceedings of the 2nd international conference on Knowledge science, engineering and management
Mining subtrees with frequent occurrence of similar subtrees
DS'07 Proceedings of the 10th international conference on Discovery science
Automatic hidden-web table interpretation by sibling page comparison
ER'07 Proceedings of the 26th international conference on Conceptual modeling
Labeling data extracted from the web
OTM'07 Proceedings of the 2007 OTM Confederated international conference on On the move to meaningful internet systems: CoopIS, DOA, ODBASE, GADA, and IS - Volume Part I
An effective method supporting data extraction and schema recognition on deep web
APWeb'08 Proceedings of the 10th Asia-Pacific web conference on Progress in WWW research and development
An Intelligent information segmentation approach to extract financial data for business valuation
Expert Systems with Applications: An International Journal
Web news extraction based on path pattern mining
FSKD'09 Proceedings of the 6th international conference on Fuzzy systems and knowledge discovery - Volume 7
Cost-effective web search in bootstrapping for named entity recognition
DASFAA'08 Proceedings of the 13th international conference on Database systems for advanced applications
An automatic HTTP cookie management system
Computer Networks: The International Journal of Computer and Telecommunications Networking
Style and branding elements extraction from businessweb sites
Proceedings of the 10th ACM symposium on Document engineering
Automatic extraction of web data records containing user-generated content
CIKM '10 Proceedings of the 19th ACM international conference on Information and knowledge management
A unified approach for extracting multiple news attributes from news pages
PRICAI'10 Proceedings of the 11th Pacific Rim international conference on Trends in artificial intelligence
Exploiting content redundancy for web information extraction
Proceedings of the VLDB Endowment
ObjectRunner: lightweight, targeted extraction and querying of structured web data
Proceedings of the VLDB Endowment
Collective extraction from heterogeneous web lists
Proceedings of the fourth ACM international conference on Web search and data mining
A novel method for bilingual web page acquisition from search engine web records
COLING '10 Proceedings of the 23rd International Conference on Computational Linguistics: Posters
2D correlative-chain conditional random fields for semantic annotation of web objects
Journal of Computer Science and Technology
Link-based hidden attribute discovery for objects on Web
Proceedings of the 14th International Conference on Extending Database Technology
Harvesting relational tables from lists on the web
The VLDB Journal — The International Journal on Very Large Data Bases
Incremental structured web database crawling via history versions
WISE'10 Proceedings of the 11th international conference on Web information systems engineering
An approach to assess the quality of web pages in the deep web
DASFAA'11 Proceedings of the 16th international conference on Database systems for advanced applications
A site oriented method for segmenting web pages
Proceedings of the 34th international ACM SIGIR conference on Research and development in Information Retrieval
From one tree to a forest: a unified solution for structured web data extraction
Proceedings of the 34th international ACM SIGIR conference on Research and development in Information Retrieval
Web information extraction using markov logic networks
Proceedings of the 17th ACM SIGKDD international conference on Knowledge discovery and data mining
Extracting general lists from web documents: a hybrid approach
IEA/AIE'11 Proceedings of the 24th international conference on Industrial engineering and other applications of applied intelligent systems conference on Modern approaches in applied intelligence - Volume Part I
Towards a spatial instance learning method for deep web pages
ICDM'11 Proceedings of the 11th international conference on Advances in data mining: applications and theoretical aspects
An indent shape based approach for web lists mining
WISM'11 Proceedings of the 2011 international conference on Web information systems and mining - Volume Part II
Concluding pattern of web page based on string pattern matching
WISM'11 Proceedings of the 2011 international conference on Web information systems and mining - Volume Part II
News information extraction based on adaptive weighting using unsupervised Bayesian algorithm
WISM'11 Proceedings of the 2011 international conference on Web information systems and mining - Volume Part II
Semi-supervised multi-task learning of structured prediction models for web information extraction
Proceedings of the 20th ACM international conference on Information and knowledge management
Towards a unified solution: data record region detection and segmentation
Proceedings of the 20th ACM international conference on Information and knowledge management
Extracting data records from query result pages based on visual features
BNCOD'11 Proceedings of the 28th British national conference on Advances in databases
Hybrid method for automated news content extraction from the web
WISE'06 Proceedings of the 7th international conference on Web Information Systems
Extracting web data using instance-based learning
WISE'05 Proceedings of the 6th international conference on Web Information Systems Engineering
NET – a system for extracting web data from flat and nested data records
WISE'05 Proceedings of the 6th international conference on Web Information Systems Engineering
Image description mining and hierarchical clustering on data records using HR-Tree
APWeb'06 Proceedings of the 8th Asia-Pacific Web conference on Frontiers of WWW Research and Development
RecipeCrawler: collecting recipe data from WWW incrementally
WAIM '06 Proceedings of the 7th international conference on Advances in Web-Age Information Management
Maintaining web navigation flows for wrappers
DEECS'06 Proceedings of the Second international conference on Data Engineering Issues in E-Commerce and Services
TreeWrapper: automatic data extraction based on tree representation
AI'06 Proceedings of the 19th Australian joint conference on Artificial Intelligence: advances in Artificial Intelligence
Generating syntactic tree templates for feature-based opinion mining
ADMA'11 Proceedings of the 7th international conference on Advanced Data Mining and Applications - Volume Part II
Data extraction for search engine using safe matching
AI'11 Proceedings of the 24th international conference on Advances in Artificial Intelligence
Extracting multiple news attributes based on visual features
Journal of Intelligent Information Systems
Computationally effective algorithm for information extraction and online review mining
Proceedings of the 2nd International Conference on Web Intelligence, Mining and Semantics
Automatically extracting user reviews from forum sites
Computers & Mathematics with Applications
Peer matrix alignment: a new algorithm
PAKDD'12 Proceedings of the 16th Pacific-Asia conference on Advances in Knowledge Discovery and Data Mining - Volume Part II
Sift: an end-user tool for gathering web content on the go
Proceedings of the 2012 ACM symposium on Document engineering
Turn the page: automated traversal of paginated websites
ICWE'12 Proceedings of the 12th international conference on Web Engineering
Clustering visually similar web page elements for structured web data extraction
ICWE'12 Proceedings of the 12th international conference on Web Engineering
An unsupervised method for author extraction from web pages containing user-generated content
Proceedings of the 21st ACM international conference on Information and knowledge management
Web table discrimination with composition of rich structural and content information
Applied Soft Computing
WPPS: a framework for web page processing
WISE'12 Proceedings of the 13th international conference on Web Information Systems Engineering
Multiple sections extraction using visual cue
ICONIP'12 Proceedings of the 19th international conference on Neural Information Processing - Volume Part V
Wikipedia entity expansion and attribute extraction from the web using semi-supervised learning
Proceedings of the sixth ACM international conference on Web search and data mining
Towards web-scale structured web data extraction
Proceedings of the sixth ACM international conference on Web search and data mining
A general theory of spatial relations to support a graphical tool for visual information extraction
Journal of Visual Languages and Computing
A measurement study of insecure javascript practices on the web
ACM Transactions on the Web (TWEB)
Assessing relevance and trust of the deep web sources and results based on inter-source agreement
ACM Transactions on the Web (TWEB)
Feature-based object identification for web automation
Proceedings of the 28th Annual ACM Symposium on Applied Computing
Visually extracting data records from the deep web
Proceedings of the 22nd international conference on World Wide Web companion
Web news extraction via path ratios
Proceedings of the 22nd ACM international conference on Conference on information & knowledge management
A learning classifier-based approach to aligning data items and labels
BNCOD'13 Proceedings of the 29th British National conference on Big Data
Self-supervised automated wrapper generation for weblog data extraction
BNCOD'13 Proceedings of the 29th British National conference on Big Data
Linkage of compound objects for supporting maintenance of large-scale web sites
Proceedings of the 8th International Conference on Ubiquitous Information Management and Communication
Automated cookie collection testing
ACM Transactions on Software Engineering and Methodology (TOSEM)
Scalable and noise tolerant web knowledge extraction for search task simplification
Decision Support Systems
Leveraging spatial join for robust tuple extraction from web pages
Information Sciences: an International Journal
Hi-index | 0.00 |
This paper studies the problem of extracting data from a Web page that contains several structured data records. The objective is to segment these data records, extract data items/fields from them and put the data in a database table. This problem has been studied by several researchers. However, existing methods still have some serious limitations. The first class of methods is based on machine learning, which requires human labeling of many examples from each Web site that one is interested in extracting data from. The process is time consuming due to the large number of sites and pages on the Web. The second class of algorithms is based on automatic pattern discovery. These methods are either inaccurate or make many assumptions. This paper proposes a new method to perform the task automatically. It consists of two steps, (1) identifying individual data records in a page, and (2) aligning and extracting data items from the identified data records. For step 1, we propose a method based on visual information to segment data records, which is more accurate than existing methods. For step 2, we propose a novel partial alignment technique based on tree matching. Partial alignment means that we align only those data fields in a pair of data records that can be aligned (or matched) with certainty, and make no commitment on the rest of the data fields. This approach enables very accurate alignment of multiple data records. Experimental results using a large number of Web pages from diverse domains show that the proposed two-step technique is able to segment data records, align and extract data from them very accurately.