SALS-SIG Seminar

Home ButtonPeople ButtonDOTG Buttonltg buttonEmail MRI

A Framework for Complex Tokenisation and its Application to Newspaper Text


Robert Dale
MRI Language Technology Group

When: Tuesday, 4th March 1997

Time: 11:30am

Where: Room E6A357, Macquarie University

Abstract:

A word is more than a sequence of characters between two spaces. This fact has generally been ignored in research on natural language processing; but recognising the complexity of what it is to be a word is of crucial importance if we are to add sophisticated natural language processing techniques to existing document processing applications to make them more language-sensitive.

This paper describes a framework for the tokenisation of text that tries to address this problem by providing a parameterisable approach to the tokenisation task, so that NLP components can be provided with a richer analysis of real texts. We demonstrate the ideas with application to the wide variety of word forms that appear in newspaper text.


Enquiries: sals@mri.mq.edu.au

Last modified: July 1997