This GRF project concerns the crowdsourcing methodology for language resource construction and experimental linguistic studies, semantic transparency, and mental lexicon...
Progress:Empirical approaches to the scientific studies of language developed rapidly in the last few decades due to the introduction of psychological experiments and electronic corpora. As experiment and measurement tools become more and more sophisticated, and corpora grow bigger and more diversified, new research topics are frequently introduced and exciting discoveries are made. However, regardless of these two successful new directions, we still have not overcome one very basic bottleneck in linguistic research: a reasonably representative sampling size. Language is an ability shared by thousands, even millions, of speakers. So far, the experimental approach can only access the language production data of no more than a few scores of speakers, while corpus sampling cannot reflect distributional variations by a number of different speakers. Ideally, linguistic studies should be based on the data produced by a substantial sample of all speakers from different background. The recent development of crowdsourcing offers a new and unique opportunity to collect linguistic behaviour data from a substantial number of speakers effectively and economically.
Internet has emerged as one of the most dominant media of linguistic communication yet this medium is under-explored for linguistic research. Recently, crowdsourcing offers efficient tools for mining public opinion mining and for large scale language resources collection and annotation. crowdsourcing allows researcher to collect reliable data of tasks requiring human intelligence from a much larger number of subjects than traditional experimental methods. This study aims to explore and establish a new research paradigm in language sciences by applying internet-based tools for crowdsourcing. The overarching goal is to apply research methodologies to enable efficient collection of large scale felicitous linguistic judgements and/or behaviours. The success of our research will greatly increase the number of subjects in linguistic studies and allow generalizations to be made based on the linguistic judgements of a significantly large number of native speakers. We plan to establish this new paradigm by comparing the data and generalizations collected from internet crowdsourcing with corresponding studies using psycholinguistic experiments or corpus-based human annotation. We will propose three sets of experiments, which concern segmentation and transparency of compounds, to generalize Chinese native speakers’ performance on identifying the concept of word boundaries and the internal composition of a word. crowdsourcing data, using Mechanical Turk (MTurk), will be compared with both annotated corpus data and corresponding psycholinguistic research to establish a theoretical interpretation of the data. This cross-validation approach will not only create synergy among computational, psychological, and linguistic approaches, but will also bring new perspectives to the scientific studies of language.
The study has three major impacts: on research methodology in language sciences, on our understanding of how the concept of word words for the Chinese language, and on how large scale Chinese language resources can be collected. The first major impact will be to introduce a new research methodology to language sciences and to establish how results of this methodology can be evaluated and interpreted in comparison to previous studies. The availability of internet crowdsourcing tools, such as the Mechanical Turk, allows researcher to design tasks require human intelligence and to gather data of a great number of subjects performing these tasks. This new methodology allows us to gain a more complete picture on the status of the concept of word in Chinese, a still contentious subject among Chinese linguists. Most Chinese speakers are able to identify words given some instruction. However, there are great variations among themselves as to what constitutes a word. As crowdsourcing tools were developed in an English environment, our study will identify and resolve any issues which may arise when they apply to Chinese language. Our in-depth linguistic study will also inform future language technology research and applications using crowdsourcing in the Chinese context; and open a new way for efficient construction of large scale annotated language resources for Chinese.
Name | Affiliation |
---|---|
Chu-Ren Huang | Department of Chinese and Bilingual Studies, The Hong Kong Polytechnic University |
Name | Affiliation |
---|
Name | Affiliation |
---|---|
Shichang Wang |
The term World Chineses (全球華語), though not as common as World Englishes, is becoming more and more widely used with the increasing popularity of Chinese as a second language and with the Chinese diaspora spreading and growing...
Progress:The term World Chineses (全球華語), though not as common as World Englishes, is becoming more and more widely used with the increasing popularity of Chinese as a second language and with the Chinese diaspora spreading and growing.
The lexical variations among World Chineses, such as regionspecific neologism (new words), meaning variations of the same word, and the use of different words to express the same meaning, are easily observed and often studied. However, such studies are typically based on incidental observations and lacks in both coverage and rigor to give a systematic account of the core and variable properties of World Chineses.
The availability of comparable corpora (i.e. two or more corpora with similar topics and coverage) of different variants of Chinese enabled corpusstudies of such lexical variations and heralded many possibilities of research on World Chineses. For instance, the LIVAC synchronic Chinese corpus generated exciting and comprehensive studies on lexical variations among different Chinese communities and recently completed Chinese Gigaword Corpus offered additional possibilities. Yet, to better understand the dynamicity of World Chineses and how variants of World Chineses can overcome their differences to communicate efficiently, we need to study grammatical variations of World Chineses.
In this study, we propose a comparablecorpusbased approach to study grammatical variations among three of the most dominant form of World Chineses: Mainland, Taiwan, and Hong Kong; while including Singapore Mandarin for some of the studies. An innovative statistical method for automatic comparison and extraction of possible patterns of grammatical variations is adopted. Such variations are then carefully studied by linguists to offer explanatory generalizations and to make possible predictions.
We will focus on three sets of constructions anchored by verbs: light verb constructions, VO compounds, and aspectual markers. All grammatical variations will be documented and generalizations will be given. Our study will be the first such comprehensive study of World Chineses and will shed light on how typological characteristics of Mandarin Chinese and its unique orthography contribute to and restrict possible variations. We also expect results of our study to be incorporated in the proposed World Chinese Dictionary.
Name | Affiliation |
---|---|
Chu-Ren Huang | Department of Chinese and Bilingual Studies, The Hong Kong Polytechnic University |
Name | Affiliation |
---|
Name | Affiliation |
---|---|
Menghan Jiang | None |
While primary emotions are probably sums of our survival instincts, success in our highly sophisticated society often depends on how we harness our emotions, and address the emotional needs underlying a specific task...
Progress:While primary emotions are probably sums of our survival instincts, success in our highly sophisticated society often depends on how we harness our emotions, and address the emotional needs underlying a specific task. This is one of crucial premises in Roger Martin’s new book ‘Design of Business: Why Design Thinking the Next Competitive Advantage.’ By introducing design thinking, Martin incorporated the skills for observing empathy and resolving competing emotion needs as essential to business. The way designers think can rarely be separated from emotion, as each design aims to evoke certain emotions and avoid some others. It is hence not surprising that the recent developments and applications in automatic extraction of sentiments attracted a full-page report in the technology section of New York Times (2009.08.23). Illustrated with a few successful cases, New York Times showed that sentiment analysis has attracted growing business interest, particularly for online information extraction. Indeed several companies, such as Scout Labs, Jodange and Newssift, have provided this kind of service. Furthermore, besides specialized searches in areas like e-commerce, sentiment analysis also begins to influence general-purpose Web searching.
However, in terms of linguistic and computational theories of emotions, the field is surprisingly underdeveloped. Sentiment computing has succeeded to identify positive and negative sentiments in a context, but cannot yet reliably identify underlying emotions such as anger, fear, and happiness. Neither is there a comprehensive linguistic theory predicting the emotion expressed based on the words expressed. Most crucially, theories fail to make explicit the links between events which evoke emotions, and between felt emotions and the events and activities caused by them. The current proposal aims to develop a theory predicting the dependencies between emotions and events, based on linguistic cues in context. Our study starts with developing a formal theory of representation, and will develop a large corpus annotated with rich information based on this theory. Lastly, the annotated data will be used both to verify a qualitative framework and to develop a stochastic model for identifying and classifying emotions and events automatically.
Name | Affiliation |
---|---|
Chu-Ren Huang | Department of Chinese and Bilingual Studies, The Hong Kong Polytechnic University |
Name | Affiliation |
---|---|
LEE Yat-mei Sophia |
Name | Affiliation |
---|---|
Hongzhi Xu |
We developed a large and comparatively high-quality emotion corpus, which may be used for emotion computing. The corpus is freely provided by clicking on this project and filling a module.
Progress:Please
Name | Affiliation |
---|---|
Chu-Ren Huang | Department of Chinese and Bilingual Studies, The Hong Kong Polytechnic University |
Name | Affiliation |
---|---|
LEE Yat-mei Sophia |
Name | Affiliation |
---|---|
Hongzhi Xu |
Dataset containing Semantic Relations and Metadata, for Training and Evaluating Distributional Semantic Models. The Dataset is freely available at the link.
Progress:Name | Affiliation |
---|---|
Chu-Ren Huang | Department of Chinese and Bilingual Studies, The Hong Kong Polytechnic University |
Name | Affiliation |
---|
Name | Affiliation |
---|---|
Enrico Santus |