Un-Natural Language Processing

John Harris, 'Jon Tapson' and Adam McLeod

Telluride Cognitive Neuromorphic Workshop

July 14, 2011

Abstract: We implemented a biologically-inspired natural language processing system in Matlab on a laptop computer. The user interacts with the system through spoken sentences using a simplified sentence format. The system learns from statements it hears and responds to questions using synthetic speech. At the heart of the system is a spiking neural network implementing an associate memory that “remembers” simple binding relations spoken by the user. The neural network stores and recalls patterns using activation patterns in the spiking network.

I Introduction

We were initially inspired by O’Reilly’s sentence gestalt model where the Leabra model learns to parse the sentence into entities—such as the agent, co-agent, action, and object—purely through being presented with a series of sentences using a small vocabulary. Even with the simplified vocabulary, the system takes more than 10,000 epochs of training to learn from scratch. We decided to shift our focus on the interaction of a simplified natural language with a neural working memory system.

2. Language model

The system is called the “Un-natural” language processing system since the language is not natural at all, and the position of the words in the sentence directly indicates the part of speech of each word. Valid sentences consist of 3 words: a subject, verb and an object, strictly in that order. No articles such as “a” or “the” are considered. There is a limited vocabulary for each of the types of words. Valid sentences include: “Bob has dog” and “Charlie has book.” If the system is told these two statements and is then asked “Who has dog?” the system responds “Bob has dog.” If it is then asked “Charlie has what?”, the system will respond “Charlie has book”. The possible subjects are: “Adam”, “Bob”, “Charlie”, “Dave” and “Who”. The only verb currently allowed is “has”. The possible objects are “apple”, “book”, “cow”, “dog” and “what”.

3. Memory model

There are N subjects and N objects in the vocabulary of the system, giving N2 possible bindings between the words. A memory system needs to store the relations extracted from the natural language statements. The system also needs to recall the relations from the spoken sentences. There are two primary classes of associate memory that can be used in this kind of problem. The first involves synaptic weight updates which typically takes a long time to train but is more permanent. If learning is to be accelerated in such a system, a second memory will likely erase the first. To work around this problem, both memories must be presented in sequence, and trained over many epochs, to allow the two memories to develop distributed representations that can co-exist with one-another.

The second type of learning is activation-based learning. This was the solution that was chosen. For this type, no synaptic weights need to be modified. The pattern of activation among the neurons and the underlying attractor dynamics store the information. Obviously, this leads to much faster learning but the memories are more transient since they rely of pattern of activations remaining. In our spiking neuron model, there is one neuron representing each subject and one representing each object. When a certain relation is applied to the network, weights are applied to a third neuron to represent the relation between these two neurons. A recurrent connection holds the activation at that neuron.

Another alternative between these two extremes of slow weight-based and activation learning, are models of the hippocampus. Here fast learning occurs with weight updates and the memories can remain for the long-term. We looked closely at the hippocampus model in Leabra described in O’Reilly’s textbook but there were several problems with using it directly. First, there wasn’t time to build the appropriate plug-ins to Leabra to handle the real-time speech input and output. Second, learning in this model still takes a few epochs which is inconsistent with the working memory system we required. We plan a future frontal cortex type model that does not require N2 neurons to bind N subjects to N objects. Instead, quick Hebbian learning will be used to increase the weights between the two neurons representing the subject and object to be bound. If the network is queried with partial information that stimulates with the subject or the object, the other neuron will then be driven to fire.

4. Speech input/output concerns

The speech recognition component is the most challenging part of the I/O functionality. For training purposes, the program first prompts the user to speak all of the words in the vocabulary. An energy-based segmentation routine was written to break the utterances up into individual words. A temporal threshold was used to classify utterances with short, low-energy phonemes (such as fricatives) as a single word. Words were then labeled and stored for the duration of the session. During the actual program, the user is prompted to speak a sentence. The segmentation algorithm is then called to break the sentence up into individual words via the same energy/temporal thresholding procedure. Windowed Mel-Frequency Ceptral Coefficients are computed for each word using Malcolm’s Slaney’s Auditory Toolbox. Dynamic time warping is applied the extracted features of each word to find the closest word in the trained vocabulary. On the demo night the speech recognition worked flawlessly and failed only one time when there was a very loud conversation going on in the room. The system was trained on only one speaker and no attempt was made to generalize to other speakers. There are a few obvious ways to extend the DTW speech recognition to other speakers but this was not the focus of our work.

Speech output just uses the existing speech synthesis routine available in OS X on the Mac.

5. Conclusions and future work [BR]

The are numerous extensions that are planned for this work including the development of a more natural language input, the implementation of a more robust speech recognition system and development a more realistic neural associative memory. Our main interest is to provide mechanisms with reinforcement learning for the system to automatically understand a wider vocabulary of words and what they mean. For instance, we would like to present the system with a sentence like: “Bob has the ball.” And “Bob gave the ball to Charlie.” The sytem should learn that Bob no longer has the ball and the word “gave” means a transfer of possession. Rather than explicitly build a network that understands all of these words. We want the system to learn these concepts from the presentation of possibly thousands of examples with queries that provide the right answer with reinforcement learning.

The “Un-Natural Language Processing” project was the first effort at Telluride at developing a spoken dialog between a robot and a human. As the drive towards cognition continues, such dialog systems will be necessary to interact with systems at the cognitive level.