Term-weighting approaches in automatic text retrieval
Information Processing and Management: an International Journal
The Journal of Machine Learning Research
Information genealogy: uncovering the flow of ideas in non-hyperlinked document databases
Proceedings of the 13th ACM SIGKDD international conference on Knowledge discovery and data mining
Recommending citations with translation model
Proceedings of the 20th ACM international conference on Information and knowledge management
Hi-index | 0.00 |
One goal of text mining is to provide readers with automatic methods for quickly finding the key ideas in individual documents and whole corpora. To this effect, we propose a statistically well-founded method for identifying the original ideas that a document contributes to a corpus, focusing on self-referential diachronic corpora such as research publications, blogs, email, and news articles. Our statistical model of passage impact defines (interesting) original content through a combination of impact and novelty, and it can be used to identify the most original passages in a document. Unlike heuristic approaches, this statistical model is extensible and open to analysis. We evaluate the approach on both synthetic and real data, showing that the passage impact model outperforms a heuristic baseline method.