Wednesday, March 31, 2010

On grandmother cells and parallel distributed models

Jeff Bowers has published a paper or two arguing for the viability of grandmother cells -- cells that represent whole "objects" such as a specific face (or your grandmother's face). At issue, of course, is whether the brain represents information in a localist or distributed fashion and Jeff has used his case for grandmother cells as evidence against a basic assumption of parallel distributed processing (PDP) models. But the PDP folks don't seem to think "distributed" is a necessary property of PDP models. So in the guest post below, Jeff asks, What does the D in PDP actually mean? This is an interesting question, and Jeff would like to know your thoughts (see the new survey to respond). I'd also be interested in your thoughts on grandmother cells!

Guest Post from Jeff Bowers:
I’ve been involved in a recent debate regarding the relative merits of localist representations and the distributed representations learned in Parallel Distributed Processing (PDP) networks. By localist, I mean something like a word unit in the interactive activation (IA) model – a unit that represents a specific word (like a “grandmother cell”). By distributed, I mean that a familiar word (or an object or a face, etc.) is coded as a pattern of activation across a set of units, with no single unit sufficient for representing an item (you need to consider the complete pattern). In Bowers (2009, 2010) I argue that the neuroscience is more consistent with localist coding compared to the distributed representations in PDP networks, contrary to the widespread assumption in the cognitive science community. That is, single-cell recordings of neurons in cortex and hippocampus often reveal neurons that are remarkably selective in their responding (e.g., a neuron that responds to one face out of many). I took this to be more consistent with localist compared to distributed PDP theories.

This post, however, is not with regards to whether localist or PDP models are more biological plausible. Rather, I’m curious as to what people think is the theory behind PDP models; specifically, what is your understanding regarding the relation between distributed representations and PDP models? In Bowers (2009, 2010) I claim that PDP models are committed to the claim that information is coded in a distributed format rather than a localist format. On this view, the IA model of word identification that includes single units to code for specific words (e.g., a DOG unit) is not a PDP model. Neither are neural networks that learn localist representations, like the ART models of Grossberg. On my understanding, a key (necessary) feature of the Seidenberg and McClelland model of word naming that makes it part of the PDP family is that it learns distributed representations of words – it gets rid of localist word representations.
However, Plaut and McClelland (2010) challenge this characterization of PDP models. That is, they write:

In accounting for human behavior, one aspect of PDP models
that is especially critical is their reliance on interactivity and
graded constraint satisfaction to derive an interpretation of an input
or to select an action that is maximally consistent with all of the
system’s knowledge (as encoded in connection weights between
units). In this regard, models with local and distributed representations
can be very similar, and a number of localist models remain
highly useful and influential (e.g., Dell, 1986; McClelland &
Elman, 1986; McClelland & Rumlehart, 1981; McRae, Spivey-
Knowlton, & Tenenhaus, 1998). In fact, given their clear and
extensive reliance on parallel distributed processing, we think it
makes perfect sense to speak of localist PDP models alongside
distributed ones. (p 289).

That is, they argue that the PDP approach is not in fact committed to distributed representations. Elsewhere they write:

In fact, the approach takes no specific stance on the number of units that
should be active in representing a given entity or in the degree
of similarity of the entities to which a given unit responds.
Rather, one of the main tenets of the approach is to discover
rather than stipulate representations (p. 286)

So on this view, the PDP approach does not rule out the possibility that a neural network might actually learn localist grandmother cells in the appropriate training conditions.

With this as background, I would be interested in people’s views on this. Here is my question:

Are PDP theories of cognition committed to the claim that knowledge is coded in a distributed rather than a localist format? [see new survey]

Thanks for your thoughts,

Jeff

References

Bowers JS (2009). On the biological plausibility of grandmother cells: implications for neural network theories in psychology and neuroscience. Psychological review, 116 (1), 220-51 PMID: 19159155

Bowers JS (2010). More on grandmother cells and the biological implausibility of PDP models of cognition: a reply to Plaut and McClelland (2010) and Quian Quiroga and Kreiman (2010). Psychological review, 117 (1) PMID: 20063980

Plaut, D., & McClelland, J. (2010). Locating object knowledge in the brain: Comment on Bowers’s (2009) attempt to revive the grandmother cell hypothesis. Psychological Review, 117 (1), 284-288 DOI: 10.1037/a0017101

7 comments:

Greig de Zubicaray said...

Hi Jeff,

Thanks for the interesting post. I touched on some of these issues in relation to neuroimaging in my 2006 article in B & C, citing both your own work and Mike Page's BBS position paper. So, I think PDP theories are not committed to that claim. Similarly, most localist models incorporate some form of distributed processing, so I don't see the localist approach as being strict either.

best wishes,

Greig

Page, M. (2000). Connectionist modelling in psychology: A localist manifesto. Behavioral and Brain Sciences, 23, 443–467. http://journals.cambridge.org/action/displayAbstract?aid=65457

de Zubicaray, G. (2006). Cognitive neuroimaging: cognitive science out of the armchair. Brain and Cognition, 60, 272-281. http://dx.doi.org/10.1016/j.bandc.2005.11.008

Marc Joanisse said...

I'll go out on a limb here and say: I do think actual neural coding happens in a distributed sense, to the extent that at the very least a) groupings of cells will be used to code individual perceptuomotor or cognitive categories; and b) that at least a subset of these cells are likely re-used for coding other categories; and c) the degree of similarity among categories can be captured as the degree of similarity in the activity of neurons used to code them.

I'm sure that Jeff has much to say on all these points but to the extent that what I'm saying is right, this is what I believe to be true. The issue is the extent to which a PDP model must use distributed coding to capture the phenomena at hand. In that sense I don't think distributed coding is the sine qua non of the PDP enterprise. Rather what is distributed is the connections among these different neurons, and indeed this is where much of the work (i.e, processing) is being done.

A broader comment that is still relevant to this: I've always felt that it's dangerous to judge all connectionist/PDP enterprises equally. Researchers vary widely in their commitment/adherence any of the basic tenets of parallel distributed processing. Even with the original PDP volumes (McClelland & Rumelhart & the PDP research group, 1986, MIT Press) there is considerable variability in how distributed coding is used or not. For instance the TRACE-II model (now known as TRACE) codes word identities using unique units that could be thought of as lexical entries or grandmother cells; in contrast the McClelland & Rumelhart past tense model encodes words as assemblies of Wickelfeatures, which is an arguably distributed position-sensitive coding system for phonology (the same system is used in the above-mentioned Seidenberg & McClelland 1989 reading model, though this was updated in the Plaut et al study referenced in the original post).

More tangentially, there can be some confusion about whether PDP is a framework or a theory. I see it as framework for implementing a theory and consequently individual models need to be assessed on their own merits (including what representational coding scheme they use).

Coming back to the original question, the appropriateness of localist versus distributed features hinges on whether it represents a theoretical claim, or a simplifying assumption that is made in the interest of computational simplicity and interpretability. In the TRACE model example above, they use distributed coding for acoustic information, but localist coding for lexical units. The motivation is presumably that nothing about the phenomena being simulated hinges on whether the model is able to code degrees of similarity in the lexical/semantic domain, whereas it is critical that it code acoustic-phonetic similarity among phonemes. The fact that by doing so, it abstracts away from semantic relatedness, does not invalidate it with respect to its claims about the sublexical phenomena it was designed to address.

Do we need distributed coding in at least some cases? I think we do, in the sense that many explanations for cognitive phenomena turn on the idea of degrees of similarity (for an exploration of this with respect to semantic memory, see Cree & McRae, 2003). That said, is it necessary in all cases? I don't think so. Thinking of my own work, we have previously used a connectionist model to examine effects of semantic vs. phonological deficits in aphasia to explain dissociations in these patients' problems with past tense (Joanisse & Seidenberg, 1998). We used a distributed code for phonology but used localist coding for semantics. The broad claim there was that semantic relatedness was perhaps not as important to the phenomena being explored as was phonological similarity. Simplifying assumptions like these are part and parcel of all models; otherwise they wouldn't be models.

Jeff said...

Hi Marc, you may be right that the brain relies on distributed coding – the main point of my articles was to point out that localist coding schemes should not be dismissed so quickly. In conversations, I’ve been struck the extent to which advocates of the PDP approach often rely on the biological plausibility of distributed compared to localist coding in support of their view. That underlying assumptions makes it difficult to even take localist models seriously I think. What I show in the papers (I think) is this widespread assumption is unsafe.

You write:
Even with the original PDP volumes (McClelland & Rumelhart & the PDP research group, 1986, MIT Press) there is considerable variability in how distributed coding is used or not. For instance the TRACE-II model (now known as TRACE) codes word identities using unique units that could be thought of as lexical entries or grandmother cells;

It is very true that that almost all PDP models include localist representations – more often they include localist input output units and learn distributed internal representations (which is considered key to making them PDP models). And sometimes advocates of the PDP approach develop models that include localist word representations. But in both cases, this is considered a implementational convenience. The claim still remains that knowledge at the input and hidden layers should be distributed - it was just easier to use localist codes in the simplified model. Localist modelers, however, take the localist codes as a core feature of their model – not something to be gotten rid of in a more realistic model.

Jeff said...

Hi Greig, thanks for the reference to your article. In it you write:

“At its outset, connectionism made the central claim that knowledge is coded in a distributed manner (Rumelhart et al., 1986 Rumelhart, D. E., McClelland, J. L., & the PDP Research Group (1986). Parallel distributed processing: Explorations in the microstructure of cognition (Vol. 1 and 2). Cambridge, MA: MIT Press.Rumelhart, McClelland, & the PDP Research Group, 1986). However, localist representations do emerge in connectionist models, and more recent approaches to connectionist modelling have explicitly incorporated localist representations (see Bowers, 2002 and Page, 2000).”

I agree with you that PDP modelers originally claimed that knowledge is coded in a distributed manner. In fact, this was the key claim that distinguished this type of neural network model view from previous models that included (and sometimes learned) localist representations. So while I agree with you that neural network models can learn localist representations, this was not supposed to be the case with PDP models. Here is a nice quote that captures this I think:

“Every once in a while, by some unknown means, people come up with ideas that change the way we think. I believe that connectionism embodies some genuinely original ideas. In particular, there is a novel way of representing knowledge – in terms of pattern of activation over units encoding distributed representations... The people who developed this framework—Rumelhart, McClelland, Hinton, and the others—have managed to come up with some insights that extend the range of ways of thinking about cognition. And this is leading to new discoveries about the mind and brain (Seidenberg, 1993, p. 234)”

And then there is the term Parallel Distributed Processing – what does the D stand for? I’ve been told that the "distributed" in PDP is a modifier of "processing” (so that the D does not imply a commitment to distributed representations). But if that is the case, what is the difference between “distributed” and “parallel” in PDP? Why not call the distributed distributed processing? If PDP models are now said to be consistent with learning localist or distributed representations, what is the core claim of the approach? Why introduce the term PDP at all? Why not just stick with the term neural network?

So my question to you everyone who votes NO to the poll question: Are you using the term PDP and neural network synonymously? In that case, are the symbolic neural network models by Hummel, ART models by Grossberg, “implementational neural networks” advocated by Pinker, etc. all PDP models? (despite the fact that the these authors all take their models to be inconsistent with the PDP approach).

I don’t think this is just a terminological issue. There are long ongoing debates between PDP modelers and researchers like Coltheart, Pinker, Besner, Hummel, Grainger, Davis, etc., and it is important that everyone be clear about what the debate is about. One thing it is not about is that the brain is a neural network – everyone agrees with this. Everyone also agrees that neural network learn (and can learn both localist and distributed representations). Coltheart, Pinker, Besner, Hummel, Grainger, Davis, etc., all disagree with the PDP approach, because they think this approach rejects localist (and symbolic) representations. Are they wrong about this – should they also be described as PDP modelers (just PDP modelers who believe in localist as opposed to distributed representations)? It is more productive to have an argument if everyone agrees on the terms.

Marc Joanisse said...

Hi Jeff, I appreciate your response and perspective.

For me it all boils down to the idea that PDP is a framework for implementing and testing theories, and not a theory in itself. The up-side is that people who purport to disagree connectionism might find something to like in symbolic-based approaches - in a similar way that I've found many things to like in current theories of generative grammar that admit stochastic rather than strictly deterministic rules.

Greig de Zubicaray said...

Hi Jeff,

I see what you are getting at. However, I agree with Marc. I think that PDP is a framework for modelling, and we need to examine the explicit assumptions instantiated in the models, otherwise we could just as easily ask why many localist/symbolic models incorporate forms of distributed processing.

Pavan said...

Hi there,

Thanks for the illuminating discussion. I am yet to look up the various papers cited here and I am outsider to the field and may be unfamiliar with the jargons, but I have given the issue some thought. Thoughts in no particular order:

It seems to me that the "parallel" in PDP refers to distribution in time, whereas "distributed" refers to dostribution in space.

I do agree with Marc that "PDP" (or distributed spatiotemporal representation) is a framework and not a theory. If it were a complete theory, then we would just be explaining away grandmother cells with grandmother networks, i.e. we would instantiate a new network which takes on a distributed spatiotemporal representation for each and every new sensory/cognitive phenomenon or memory representation. This would not be a theory with predictive power, but just an arbitrary look-up table of networks and functions. I feel that neuroimagers should be careful of this eventuality.

Another question I have is: have connectionist modelers tried to explain with their models how unified percepts arise from distributed representations? Presumably, a localist/GMcell approach does not suffer from this issue. Or do scientists believe this is a completely independent problem, with models having no bearing on such emergence.

Any feedback or pointers are welcome.

Thanks and best regards,
Pavan