A Framework for Complex Tokenisation and its Application to Newspaper Text
Robert Dale
MRI Language Technology Group
When: Tuesday, 4th March 1997
Time: 11:30am
Where: Room E6A357, Macquarie University
Abstract:
A word is more than a sequence of characters between two spaces. This fact has generally been ignored in research on natural language processing; but recognising the complexity of what it is to be a word is of crucial importance if we are to add sophisticated natural language processing techniques to existing document processing applications to make them more language-sensitive.
This paper describes a framework for the tokenisation of text that tries to address this problem by providing a parameterisable approach to the tokenisation task, so that NLP components can be provided with a richer analysis of real texts. We demonstrate the ideas with application to the wide variety of word forms that appear in newspaper text.
Enquiries: sals@mri.mq.edu.au
| Last modified: July 1997 |