Besoins lexicaux a la lumiere de l'analyse statistique du corpus de textes du projet "BREF": le lexique "BDLEX" du francais ecrit et oral

  • Authors:
  • I. Ferrane;M. de Calmes;D. Cotto;J. M. Pecatte;G. Perennou

  • Affiliations:
  • Université Paul Sabatier, Toulouse, France;Université Paul Sabatier, Toulouse, France;Université Paul Sabatier, Toulouse, France;Université Paul Sabatier, Toulouse, France;Université Paul Sabatier, Toulouse, France

  • Venue:
  • COLING '92 Proceedings of the 14th conference on Computational linguistics - Volume 4
  • Year:
  • 1992

Quantified Score

Hi-index 0.00

Visualization

Abstract

In this paper, we describe lexical needs for spoken and written French surface processing, like automatic text correction, speech recognition and synthesis.We present statistical observations made on a vocabulary compiled from real texts like articles. These texts have been used for building a recorded speech database called BREF. Developed by the Limsi, within the research group GDR-PRC CHM (Groupe De Recherche - Programme de Recherches Concertées, Communication Homme-Machine --- Research Group - Concerted Research Program, Man Machine Communication), this database is intended for dictation machine development and assessment.In this study, the informations available in our lexical database BDLEX (Base de Données LEXicales - Lexical Database) are used as reference materials. Belonging to the same research group than BREF, BDLEX has been developed for spoken and written French. Its purpose is to create, organize and provide lexical materials intended for automatic speech and text processing.Lexical covering takes an important part in such system assessment. Our first purpose is to value the rate of lexical covering that a 50, 000 word lexicon can reach.By comparison between the vocabulary provided (LexBref, composed of 84, 900 items, mainly distinct inflected forms) and the forms generated from BDLEX, we obtain about 62% of known forms, taking in account some acronyms and abbreviations.Then, we approach the unexpected word question looking into the 38% of left forms. Among them we can find numeration, neologisms, foreign words and proper names, as well as other acronyms and abbreviations. So, to obtain a large text covering, a lexical component must take in account all these kinds of words and must be fault tolerant, particularly with typographic faults.Last, we give a general description of the BDLEX project, specially of its lexical content. We describe some lexical data recently inserted in BDLEX according to the observations made on real texts. It concerns more particularly the lexical item representation using phonograms (i.e. letters/sounds associations), informations about acronyms and abbreviations as well as morphological knowledge about derivative words. We also present a set of linguistic tools connected to BDLEX and working on the phonological, orthographical and morphosyntactical levels.