Experiments in discourse analysis impact on information classification and retrieval algorithms

Authors:
J. Morato;J. Llorens;G. Genova;J. A. Moreiro
Affiliations:
Department of Computer Science, Universidad Carlos III de Madrid, Av. Universidad, 30-28911 Leganés, Madrid, Spain;Department of Computer Science, Universidad Carlos III de Madrid, Av. Universidad, 30-28911 Leganés, Madrid, Spain;Department of Computer Science, Universidad Carlos III de Madrid, Av. Universidad, 30-28911 Leganés, Madrid, Spain;Department of Computer Science, Universidad Carlos III de Madrid, Av. Universidad, 30-28911 Leganés, Madrid, Spain
Venue:
Information Processing and Management: an International Journal
Year:
2003

Citing 11
Cited 4

Domain analysis for reusability

Software reuse: emerging technology
Information retrieval: data structures and algorithms

Information retrieval: data structures and algorithms
The relationship between mechanical indexing, structural linguistics and information retrieval

Journal of Information Science
Subtopic structuring for full-length document access

SIGIR '93 Proceedings of the 16th annual international ACM SIGIR conference on Research and development in information retrieval
Highlights: language- and domain-independent automatic indexing terms for abstracting

Journal of the American Society for Information Science
A text filter for the automatic identification of empirical articles

Journal of the American Society for Information Science
Text windows and phrases differing by discipline, location in document, and syntactic structure

Information Processing and Management: an International Journal
An algorithm for term conflation based on tree structures

Journal of the American Society for Information Science and Technology
Software construction using components

Software construction using components
Recognizing text genres with simple metrics using discriminant analysis

COLING '94 Proceedings of the 15th conference on Computational linguistics - Volume 2
Text-level structure of research papers: implications for text-based information processing systems

IRSG'97 Proceedings of the 19th Annual BCS-IRSG conference on Information Retrieval Research

Generating domain representations using a relationship model

Information Systems
ADROIT: automatic discourse relation organizer of internet-based text

AAAI'08 Proceedings of the 23rd national conference on Artificial intelligence - Volume 3
Materiality and oral documents

Proceedings of the 2011 iConference
Rhetorical relations for information retrieval

SIGIR '12 Proceedings of the 35th international ACM SIGIR conference on Research and development in information retrieval

Quantified Score

Hi-index	0.00

Visualization

Abstract

Researchers in indexing and retrieval systems have been advocating the inclusion of more contextual information to improve results. The proliferation of full-text databases and advances in computer storage capacity have made it possible to carry out text analysis by means of linguistic and extralinguistic knowledge. Since the mid 80s, research has tended to pay more attention to context, giving discourse analysis a more central role. The research presented in this paper aims to check whether discourse variables have an impact on modern information retrieval and classification algorithms. In order to evaluate this hypothesis, a functional framework for information analysis in an automated environment has been proposed, where the n-grams (filtering) and the k-means and Chen's classification algorithms have been tested against sub-collections of documents based on the following discourse variables: "Genre", "Register", "Domain terminology", and "Document structure". The results obtained with the algorithms for the different sub-collections were compared to the MeSH information structure. These demonstrate that n-grams does not appear to have a clear dependence on discourse variables, though the k-means classification algorithm does, but only on domain terminology and document structure, and finally Chen's algorithm has a clear dependence on all of the discourse variables. This information could be used to design better classification algorithms, where discourse variables should be taken into account. Other minor conclusions drawn from these results are also presented.