Next up in our geek-out in the current issue of the Proceedings of the National Academy of Sciences, this fascinating paper on some advances in unsupervised learning of languages:
Unsupervised learning of natural languages
We address the problem, fundamental to linguistics, bioinformatics, and certain other disciplines, of using corpora of raw symbolic sequential data to infer underlying rules that govern their production. Given a corpus of strings (such as text, transcribed speech, chromosome or protein sequence data, sheet music, etc.), our unsupervised algorithm recursively distills from it hierarchically structured patterns. The ADIOS (automatic distillation of structure) algorithm relies on a statistical method for pattern extraction and on structured generalization, two processes that have been implicated in language acquisition. It has been evaluated on artificial context-free grammars with thousands of rules, on natural languages as diverse as English and Chinese, and on protein data correlating sequence with function. This unsupervised algorithm is capable of learning complex syntax, generating grammatical novel sentences, and proving useful in other fields that call for structure discovery from raw data, such as bioinformatics.
Scalabale, speedy, and accurate algorithms for unsupervised learning of natural languages has long been a kind of grail quest in computer science, so this paper really caught my eye. The authors claim to have developed an approach (called ADIOS) that is generalizable across grammars as diverse as bioinformatics and Chinese. (Consider this from the bioinformatics results: Despite using exclusively the raw sequence information, ADIOS attained classification performance comparable with that of the SVM-PROT system (success rate of 95%).)
Related posts: