A growing literature shows that children are highly sensitive to statistical features of their linguistic input (Saffran et al 1996; Gomez & Gerken 2000). This literature assumes that children can reliably encode all of the information available in this input, contrasting with the observation that children are sometimes sensitive to features of their language out of proportion with their statistical reliability. This contrast highlights the difference between the input, the linguistic information available in the environment and the intake, the proportion of the input the child actually uses. The current paper explains this difference as it is apparent in noun classification by introducing uncertainty in the detection of certain features.

Many languages classify nouns according to grammatical gender. Cross-linguistically, these noun classes correlate with both semantic and phonological features of nouns. For example, in Tsez, a Nakh-Dagestanian language with 4 noun classes spoken in the Northeast Caucasus, semantic features such as natural gender (of humans) and animacy (of non humans) are very reliable predictors of noun class. Phonological features, such as the first segment of the noun, can also predict noun class but does so less reliably. Behavioral experiments have shown that adults and children are sensitive to these semantic and phonological regularities and can use them classify novel nouns. Surprisingly, while adults behave in line with the statistical reliability of the cues in question, 4-7 year olds prefer to use phonological features, rather than the more predictive semantic features, when the two types make conflicting predictions (Gagliardi & Lidz, under review). Here we present a Bayesian model of noun classification to show how this behavior might arise from simple misperception of semantic features.

We propose that children's classification patterns for novel nouns are not random, but instead reflect children's beliefs about the features on the nouns in their lexicon. One possibility is that if semantic features are more difficult to perceive than phonological features, children may be misperceiving semantic cues. Thus when they try to estimate the predictiveness of a semantic cue, they have sparse or distorted data to from which to make this estimation. We test this hypothesis by making a formal link between the feature counts in a child's lexicon and classification behavior through a Bayesian model.