![]() |
![]() |
![]() |
|
|
|
||
![]() ![]() ![]() ![]() ![]() ![]() |
Honours Projects | |
|
|
||
|
This page lists possible Honours projects in Language Technology for 2005. If you find that a project listed here is close to something you're interested in, but isn't quite what you were looking for, you should speak to the project supervisor to see if an appropriate project can be constructed. More generally you'll find that members of staff are usually open to suggestions for projects. Note that you need to provide the honours convenor with your project title and supervisor's name by the Monday of Week 2, and you have to submit your proposal and make a presentation in Week 4.
There is also a collection of project topics using Nokia's state-of-the-art mobile network laboratory. See also our information on: Projects in the area of Speech RecognitionEMU: Tools for Annotation and Corpus QueryingSupervisor: Steve Cassidy I'm the main author of Emu which is a set of software tools for research with annotated speech corpora. The development of Emu is ongoing and there are likely to be various projects apart from the ones listed here. Most of these projects require no knowledge of speech and can be seen as general Software Engineering/Database projects. For more details consult Steve Cassidy's list of Honours projects. Speaker Identification in MeetingsSupervisor: Steve Cassidy We have an ongoing project to analyse audio recordings made in meetings. In the first phase we are trying to segment the audio stream according to who is talking: speaker segmentation and indentification. Possible student projects in this area might involve evaluating different speaker identification algorithms; looking at applying speech recognition to the audio stream to build an index for information retrieval; investigating algorithms for coping with varying room acoustics in different meeting rooms. If you are interested in hardware there are ideas to follow up in building a special purpose meeting recorder device -- something like a PDA which can be used to obtain high quality recordings of meetings and do some of the indexing work on the captured speech signal.
Recognising Australian SpeechSupervisor: Steve Cassidy This project involves training a speech recogniser to work on Australian speech. This would fit in with the Centre's DARPA Communicator project, using the Sphynx speech recognition engine. The project would involve getting to know Sphynx well, adapting it to and training it on the Australian data we have, and then evaluating its performance, perhaps in the context of an application like the Deaprtment's information kiosk.
Projects related to Question Answering and Information RetrievalMost of the projects in this section are related to AnswerFinder. AnswerFinder is a question answering system that finds the returns the answer to an arbitrary question by exploring text documents. To do this, AnswerFinder constructs the logical forms of the questions and compares them with the logical forms of the answers. To speed up the process and make it possible to explore considerable volumes of text, AnswerFinder incorporates additional methods based on shallow but fast processing of text. Representing the Semantics of SentencesSupervisor: Diego Mollá Aliod AnswerFinder uses logical forms to determine if a sentence contains the answer to the question. These logical forms, however, are rather difficult to understand by humans and therefore the process of manually discovering inference rules to add to the system is time-consuming. The goal of this project is to determine methods to simplify the representation of logical forms to the user. The methods may be a combination of graphical expressions (e.g. represent the dependencies between the concepts graphically) or natural language generation (e.g. write a paraphrase that accurately describes the contents of the logical form), or something else. You decide! Graph-based Question AnsweringSupervisor: Diego Mollá Aliod A sentence is a structured collection of words. This structure can be represented as a graph where the nodes are the concepts expressed in the sentence, and the arcs are the relations between the concepts. The aim of this project is to explore the use of such graphs as means of sentence representation for the task of question answering. The project will involve the automatic creation of the graph, the use of graph theory methods to determine if a sentence can answer a specific question, and the extraction of the exact answer from the question. Question Answering from Speech DataSupervisor: Diego Mollá Aliod A speech recognition system of continuous speech may introduce up to 50% of recognition errors. This high percentage of recognition errors present new challenges to question answering systems. This project aims at developing a question answering system that uses the output of a speech recognition system as the input data. Classification of Bibliographic ReferencesSupervisor: Diego Mollá Aliod We have a BibTeX database of bibliography entries, where every entry typically contains information about the author, title, abstract, and addtional comments. Every entry is also tagged with keywords according to a keyword ontology. However, the process of updating the keywords in the bibliography entries when the ontology changes is too time-consuming and prone to errors. The goal of this project is to automatically assign keywords to the bibliography entries given an arbitrary keyword ontology. Retrieval of Bibliographic ReferencesSupervisor: Diego Mollá Aliod It is always difficult to remember who said what in what document. We have a BibTeX database of bibliography entries, where every entry typically contains information about the author, title, abstract, and addtional comments. Every entry is also tagged with keywords according to a keyword ontology. The goal of this project is to retrieve the bibliography entries that are relevant to the topic given in an arbitrary user query. An important part of the project is to account for variations of terms describing related concepts. Classification of QuestionsSupervisor: Diego Mollá Aliod Our system currently uses very simple rules to determine the type of information a question is asking for. The goal of this project is to build a question classification system that automatically learns the types of questions by analysing a corpus that is annotated with the correct question types. This project is especially suitable to those who are doing COMP348 in the first semester of 2007. Answering Complex QuestionsSupervisor: Diego Mollá Aliod Currently we are developing a system that answers complex questions where the answer needs to be composed by exploring several documents. The current system simply presents all sentences that have some part of the answer but this can be done better. The goal of this project is to combine the independent answers in such a way that the resulting answer is coherent and has reduced redundancy. Processing WikipediaThis set of projects is about extending Wikipedia to make it easier to find information in it. Question Answering on WikipediaSupervisor: Diego Mollá Aliod The goal of this project is to use a 2-stage question answering system that converts the user question into a series of Web queries on Wikipedia pages, queries Wikipedia, and collects the result. The result is processed to find the exact answer to the query by combining AnswerFinder technology with other state-of-the-art technology on question answering. Find Related InformationSupervisor: Diego Mollá Aliod Given a Wikipedia page, find other Wikipedia articles that are related to it and propose them as links from the page. Find TranslationsSupervisor: Diego Mollá Aliod Given a Wikipedia page in a language, find their equivalent pages in other languages. Search and SummariseSupervisor: Diego Mollá Aliod Find all documents relevant to a topic, and with them compose a summary (this could easily be extended to a PhD project) Learn EntailmentsSupervisor: Diego Mollá Aliod Use Wikipedia to learn text patterns that indicate entailment between two words. This could be done in two steps:
KELP: Knowledge Extraction and Linguistic PresentationKELP is a new project aimed at carrying out sophisticated extraction of information from online resources, and then combining and collating this information in novel ways, re-presenting it to users via both speech and text. The project involves the use of techniques in information extraction, natural language analysis, natural language generation, user modelling, and spoken language dialogue systems. Extracting Tabular Information from Web PagesSupervisor: Robert Dale, Rolf Schwitter or Diego Mollá Aliod Much important information in web pages is presented in tables. However, it turns out to be quite difficult to extract the information from tables in a meaningful way, because the authors of web pages use tables for a range of purposes besides laying out data. This project will explore how information extraction techniques can be used to construct well-organised data structures from the information embedded in web pages. An Information Extraction ToolkitSupervisor: Robert Dale, Rolf Schwitter or Diego Mollá Aliod Much work in information extraction involves searching for patterns in text and then extracting specific pieces of information on the basis of the patterns that are found. Although this is generally accomplished using regular expressions written expressly for the task at hand, it turns out that there are many patterns which recur from one domain to another, and a number of operations applied to manipulate these patterns that also recur. The aim of this project is to construct a toolkit that operationalises these observations, and so provides a way of easily moving KELP from one domain to another.
Other Language Technology ProjectsStock Portfolio ReportingSupervisor: Robert Dale Over the years we have developed a number of research prototypes that dynamically generate textual summaries of stock market behaviour: see, for example, the StockReporter system. In this project you will develop an application in the same domain. There are a number of possible directions here: for example
There are many other possibilities in this domain. Information Extraction from Job DescriptionsSupervisor: Robert Dale We have a corpus of over 1000 emailed job descriptions, all in the language technology domain or related areas. Searching through this amount of data for a job that you might be interested in is painful; and although simple information retrieval and search techniques based on keywords can help a little, ultimately what we really want is to be able to derive more structured information from this data, so that the job descriptions can be processed in order to populate a database. This would enable more robust queries to be posed, so that for example you might look for jobs in specific geographic areas that require specific programming languages. The aim of this project is to develop an information extraction system that can locate and extract useful elements of information from these job ads. An Automatically Constructed Conferences WebsiteSupervisor: Robert Dale We have a corpus of around 6000 conference announcements, all in the language technology domain or related areas. The aim of this project is to develop robust technology that extracts key information (such as the title of the conference, where it is being held, and the dates of the event) from these announcements, stores this information as XML, and then uses XSLT and related technologies to provide a highly-functional web interface for browsing the information. The project involves research in both information extraction and web technology. Coreference ResolutionSupervisor: Robert Dale There has been a lot of research into pronominal reference resolution, but the problem of determining co-reference between proper names is much less explored. This project, which will attract a $5000 scholarship from the Capital Markets CRC, is concerned with working out when two proper names refer to the same person. The aim is to develop new techniques that can be applied to a previously unseen text domain in order to (a) identify the proper names that appear in that domain and (b) determine when multiple names refer to the same entity. The techniques will be developed to handle both person names (as in 'Mrs Clinton', 'Hilary Clinton', and 'Bill Clinton's wife') as well as company names (as in 'BHP' and 'Broken Hill Proprietary Limited'). Machine TranslationSupervision: Mark Dras This would involve investigating a specific language pair and examining issues in machine translation with respect to that pair. Very recent work at Johns Hopkins University has been exploring integrating structural approaches (where you design rules for translation) with statistical approaches (where the system "learns" translation). A specific project would be to replicate the preliminary work from Johns Hopkins with a closer language pair (say, English-French), and to evaluate results relative to purely structural or purely statistical approaches. A more general project in this area is also possible. For more details of this project consult Mark Dras' Honours project page. ParaphraseSupervision: Mark Dras This project would be related to some of my research on paraphrasing. The idea would be to build a system, using an existing broad-coverage parser, together with an existing mathematical optimisation package, to build a system that would take a text (e.g. a paper) and fit it to a set of constraints (e.g. a 2000 word limit with sentences of middling complexity). Using the Web for Term TranslationSupervisor: Diego Mollá Aliod Human translators often find it difficult to determine the exact translation of technical terms in specialised areas. The goal of this project is to build a system that, given a term in a specific document, uses the Web to find the most likely translations in the target language. This project combines multilingual information retrieval techniques with machine translation techniques. |
||