Victoria Rubin's Research Interests


Natural Language Processing

Within NLP, my research contributions are divided into 5 themes: modeling certainty, emotions, credibility, trust, and deception. What unites these concepts is that each of these complex human behaviors is, at least partly, expressed through language. The language in each case exhibits subtle but discernable properties that can be identified and traced computationally. I have been investigating and modeling these five concepts, and developing methodologies for acquiring appropriate and reliable cues first manually, than computationally based on NLP techniques.

  1. Modeling Certainty
  2. I empirically analyzed a writer's (un)certainty, or epistemic modality, as a linguistic expression of an estimated likelihood of a proposition being true. I proposed an analytical framework for certainty categorization and used the framework to describe how explicitly marked certainty can be predictably and dependably identified from newspaper article data (Rubin, 2007). The certainty identification framework serves as a foundation for a novel type of text analysis that can enhance question-and-answering, search, and information retrieval capabilities (Rubin, Kando, & Liddy, 2004; Rubin, Liddy, & Kando, 2005). I have contextualized this framework for user interactions with Information Retrieval systems in the information seeking context, in light of linguistic awareness about levels of epistemic modality that are inherently expressed in texts (Rubin, In Press). Certainty identification is a part of the new and exciting direction in IR, NLP, and text-mining, concerned with exploration of subjective, attitudinal, and affective aspects of texts (Shanahan, Qu, & Wiebe, 2005).

  3. Linguistic Expression of Emotions
  4. Emotional coloration is another subtle but discernable text property, of interest for automated identification. Rubin, Stanton, and Liddy (2004) used an affect classification model from social and personality psychology to conduct a web-based evaluation of perceived emotions in on-line personal diaries (blogs), and analyzed the inter-rater agreement per emotion category and emotion rating strength. The study provided empirical data that linked emotions to linguistic clues identified by a large number of non-expert raters, and validated the model regarding the agreement on the perceived structure of emotions.

  5. Credibility Assessment
  6. In Rubin and Liddy (2006), with partial automation in mind, we defined a framework for assessing blog credibility, consisting of 25 indicators in four main categories: blogger's expertise and offline identity disclosure; blogger's trustworthiness and value system; information quality; and appeals and triggers of a personal nature. The framework contributes to product review-quality determination (Pang and Lee, 2008). Weerkamp and de Rijke (2008) described how to estimate several of the proposed indicators, how to integrate them into their retrieval approach; based on the TREC Blog track test set, they showed that combining credibility indicators significantly improves retrieval effectiveness. Another manuscript that investigates the connection between trust, credibility and trustworthiness is currently under review.

  7. Trust and Distrust Rhetoric
  8. Rubin (2009) formulated a new challenge in sentiment analysis and opinion-mining: distinguishing trust, distrust, and hypocritical use of trust rhetoric with ulterior motives such as an attempt to manipulate readers or gain trustworthiness. Textual indicators were analyzed and brought into an analytical model of trust incident accounts. Twelve information extraction frame components were proposed including trustor and trustee, source, textual clue, trust valence, reasons, actions, trustor-trustee relationship, narrow context and broad domain. The study drew a cross-disciplinary theoretical bridge from social science and information technology trust literature to opinion-mining, and emphasized the value of understanding trust in longer-term social relations.

  9. Deception Detection
  10. This research agenda study is driven by the idea that publicly available messages in computer-mediated communication can lead to cues of explicit or implicit awareness of deception. A pilot study (manuscript in preparation) tested whether search queries, based on common words people use about lying (e.g., mislead, cheat, phony), could uncover meaningful deception cues in various domains such as insurance, law enforcement, personal relations, and politics. A more comprehensive study towards creating an ontology for deception cues, and an automated deception detection system (to enhance and augment human abilities) is in the proposal stage and under review with funding agencies.

    For my future research, I plan to investigate the connection between subtle text properties and information credibility; and broaden my research to other linguistic, social, and cultural factors that contribute to establishing or undermining the credibility of text and other digital media information such as images. I am interested in answering questions such as: How do information users recognize and evaluate epistemic content and linguistic expression of biases, opinions, overstatements, information authenticity, and overall quality? What role do users' prior expectations and world knowledge play in establishing information credibility? How is information credibility carried across contexts, languages, and cultures?

Information Science and Technology

I have studied more specifically library classification and categorization. Based on a study of kinship terminology across 14 non-English languages and cultures, Dr. Kwasnik and I identified structural shifts that occur in the process of translating classification schemes (Kwasnik & Rubin, 2004). Using the representations of kinship terms in the Library of Congress and the Dewey Decimal Classifications as examples, we suggested that possible mapping errors could be reduced if translation processes were modified to preserve the inter-term relations.

I have also been interested in using textual access mechanisms to retrieve images. Goodrum (2003) articulated that image queries exhibit a greater level of specificity than requests for textual materials. Dr. Goodrum and I considered the cognitive processing from the moment of identifying the image need in mind, but before expressing it verbally. I hypothesized that two mental representation modes – imagistic and propositional (Pavio, 1986), are activated. Discrepancies may occur between the specificity of the mental image fixations and the vagueness of the abstract thought language, and subsequent inability to accurately and completely translate both into words.

Two of my students, Lynne Thorimbert, Yimin Chen, and I overviewed two types of natural language interaction systems (text-based and embodied conversational agents) for libraries and information centers. We surveyed 20 Canadian (10 academic and 10 public) library web-sites to determine the extent to which broader language technologies are offered in sample library systems. This paper is targeted toward practicing librarians and information professionals as we argue that it is timely and important to consider adopting conversational agents in library and information electronic settings.


References (for self-citations see Publications):

  • Goodrum, A. 2003. Image Intermediation: Visual Resource Reference Services for Digital Libraries. In R. D. Lankes & S. Nicholson & A. Goodrum (Eds.), The Digital Reference Research Agenda: Publications in Librarianship, Association of College and Research Libraries. Chicago, IL.
  • Pavio, A. 1986. Mental Representations: A Dual Coding Approach. Oxford: Oxford University Press.
  • Pang, B., and Lee, L. 2008. Opinion Mining and Sentiment Analysis. Foundations and Trends in Information Retrieval, 2(1-2): 1-135.
  • Shanahan, J. G., Qu, Y., and Wiebe, J. eds. 2005. Computing Attitude and Affect in Text: Theory and Applications (the Information Retrieval Series): Springer-Verlag New York, Inc.
  • Weerkamp, W., and de Rijke, M. 2008. Credibility Improves Topical Blog Post Retrieval. In Proceedings of the North American Association for Computational Linguistics - Human Language Technologies, 923-931.

Back to Top



LITRL Logo
only search this site
Copyright © 2015, Victoria Locktionova Rubin. All rights reserved.
Last Edited: 16 December 2011