Applicable to Problems of Cyber-Security and Cyber-Bullying
If you take a moment to listen to the people around you, to absorb the ebb and flow of their language, you’ll notice that each person has a slightly different way of speaking. Whether it’s how we phrase our thoughts or what words we choose to use, our use of language is an imprint of who we are.
Thamar Solorio’s work on cross-domain authorship analysis will help in cases involving cyber-bullying and security threats.Thamar Solorio, associate professor of computer science at the University of Houston, has made it her career to study the distinct patterns in which individuals use language. Her area of research, part of a field known as natural language processing, addresses authorship analysis.
Authorship Attribution Predicts Authors of Documents Based on Writing Style
“The problem I am trying to solve is known as authorship attribution: if you are given a document, can you predict who the author is?” said Solorio, who joined the College of Natural Sciences and Mathematics in 2014.
Solorio is addressing one of the big complications of authorship attribution, which is the fact that the way we write varies depending on what we are writing and for whom. A text written to a friend is going to differ from an email written to a boss, which will differ from an essay written for a scholarly audience.
Authorship attribution is often needed in cases of cyber-bullying or security threats, when there is an anonymous document for which the author needs to be determined. Sometimes attribution is used to predict characteristics such as age, gender, education level or native language to narrow down the list of probable authors. In other circumstances, a document with an unknown author can be compared against documents of known authors, to predict authorship.
Writing Style Varies Based on Intended Audience
“For practical problems such as cyber-bullying, a big problem is that the documents that have undisputed authorship are probably going to come from a different source than the one you are trying to resolve,” Solorio said. For example, the document with an unknown author could be a text or a post, while the samples with known authorship could be a homework assignment.
This scenario, in which writing samples from one type of document or genre are used in order to predict authorship of a different type of document, is known as cross-domain authorship attribution.
“Cross-domain authorship attribution is one of the hardest scenarios you can have,” Solorio said. “There is very little work being done in this area.”
Authors Tailor Writing Style to Fit Expected Conventions
To address the issue of cross-domain authorship attribution, Solorio came up with the idea that although people adapt their writing to fit the particular expected norms of that circumstance, our method of tailoring our writing style will probably still contain traces of our own unique use of language.
“My assumption is that we unconsciously adapt a writing style to the genre in which we are writing, but it’s also true that each genre has its own characteristics,” Solorio said. “I want to map how an author will adapt their writing style to match the characteristics of the genre.”
Rate and Frequency of Word Fragments Predict Authorship in Differing Circumstances
Solorio’s research group modified an algorithm known as structural correspondence learning (SCL) to help solve this cross-domain attribution problem. SCL is an algorithm that uses features common to multiple domains, called pivot features, to accomplish natural language processing tasks.
Solorio’s research group looked at sequences of characters, called n-grams, as possible pivot features. N-grams can be related to both style and syntax, as these character strings include features such as prefixes, suffixes and punctuation marks.
What Solorio discovered was that the common feature in the way we write, regardless of intended audience or format, appears to lie not in the words we use but in these word fragments.
“Our research showed that these character strings capture a little bit of everything: they capture word choice (vocabulary) and they also capture a bit of syntax,” said Solorio. “The frequency of these types of character strings and the rate at which they appear are very telling of the author of the document.”
In the future, Solorio hopes to build upon this research and address authorship prediction based on different modalities, such as comparisons between writing samples and speech transcriptions.
In 2014, Solorio was the recipient of the Denice Denton Emerging Leader ABIE Award from the Anita Borg Institute, which recognizes junior faculty members for their high-quality research and significant positive impacts on diversity.
Solorio’s research is funded by a prestigious CAREER award from the National Science Foundation.
- Rachel Fairbank, College of Natural Sciences and Mathematics