Wednesday, July 15, 2009

The Irvine Phonotactic Online Dictionary (iPhod)

The Irvine Phonotactic Online Dictionary or iPhod was developed in the Hickok Lab by my (now former) grad student Kenny Vaden. iPhod provides word frequency, phonotactic probability, neighborhood density, etc. values for a large number of English words, as well as measurements for nonwords. The dictionary is publicly available for research use either by downloading it or simply using the online search that Kenny has recently set up. Check it out at:

www.iphod.com

Kenny has also set up an iPhod blog to provide a forum for questions and future development of the database.

Here is Kenny's more detailed description of what iPhod does:

The Irvine Phonotactic Online Dictionary (iPhOD) is a resource that was developed at UC Irvine in 2003 for research on phonological processing of words and pseudowords. The database can be used for word and pseudoword selection, in order to control or manipulate sublexical or lexical phonological aspects of stimuli. The IPhOD contains 33,432 words and 815,066 pseudowords with Kucera-Francis word frequencies (1967), CMU Pronouncing Dictionary transcriptions (Weide, 1994), and several values that we derived: phonological neighborhood density, positional probabilities, and second- and third-order phoneme-sequence probabilities. The database is publicly available online to search or download, so other researchers may use it in their studies. If a word or pseudoword is not included in the database, some IPhOD values can be calculated online using input phonological transcriptions. On the website, we describe the motivation for the database, the computations used, and examples of their use in experiments concerned with phonological processes in speech. There is also a blog so users can give us feedback, ask questions, and make suggestions for other interesting phonological measures. http://www.iphod.com

4 comments:

Mark Seidenberg said...

This seems like a really useful database. I'm just regretting that the frequency data is from KF. Was there some particular reason for using KF? There are better databases and the problems with the KF sample (too small; 1960s usage) are significant. I have thought about writing a paper with a title like "the corpus from the black lagoon" or "KF must die" (or something genuinely witty) because it just won't go away.

best, Mark Seidenberg

Kenny said...

We originally wanted to replicate effects observed in studies that calculated their measures this manner. However, criticism of KF in the literature has grown despite its continued brand recognition and use. I could add a different frequency measure, so researchers can select which databases their computations consult. Is there a frequency source that you would recommend?

Best,
Kenny Vaden

marc said...

Hi Kenny,

The Kucera and Francis word frequencies are bad indeed. We have an article on this:

Brysbaert, M. & New, B. (in press) Moving beyond Kucera and Francis: A critical evaluation of current word frequency norms and the introduction of a new and improved word frequency measure for American English. Behavior Research Methods.

(see http://expsy.ugent.be/subtlexus/Brysbaert&NewBehaviorResearchMethods.pdf)

The new SUBTL frequency measure, based on film and sitcom subtitles explains up to 10% more of the variance in the Elexicon naming and lexical decision times.

You can download excel and text files at http://expsy.ugent.be/subtlexus

We have preliminary data suggesting that the frequency measure does particularly good for spoken word recognition (as opposed to written frequencies).

All the best, marc brysbaert

Kenny said...

I have been considering replacing KF for some time, so I think I'll go ahead with version 2.0 of IPhOD. I was reading about the SUBTLEXus database, and it sounds like a perfect candidate.

Thanks for the helpful suggestions, Mark and Marc!