Vocabulary mismatch
Vocabulary mismatch is a common phenomenon in the usage of natural languages, occurring when different people name the same thing or concept differently. It is also known as the vocabulary problem, vocabulary gap, term mismatch, or semantic gap.[1]
Furnas et al. (1987) were perhaps the first to quantitatively study the vocabulary mismatch problem.[2] Their results show that on average 80% of the times different people (experts in the same field) will name the same thing differently. There are usually tens of possible names that can be attributed to the same thing. This research motivated the work on latent semantic indexing.
Causes
One source of vocabulary mismatch is inflectional form differences, such as using a female word instead of a male word, or a plural form instead of a singular form.[3] Stemming and lemmatization are two different methods of addressing this source by converting all variations of a word to one form.[3]
Vocabulary mismatch also occurs when language changes over time. For example, a doctor may search for papers about "type 1 diabetes mellitus" and not find papers about "juvenile diabetes" due to a change in terminology.[1]
In information retrieval
The vocabulary mismatch between user created queries and relevant documents in a corpus causes the term mismatch problem in information retrieval. Zhao and Callan (2010)[4] were perhaps the first to quantitatively study the vocabulary mismatch problem in a retrieval setting. Their results show that an average query term fails to appear in 30-40% of the documents that are relevant to the user query. They also showed that this probability of mismatch is a central probability in one of the fundamental probabilistic retrieval models, the Binary Independence Model. They developed novel term weight prediction methods that can lead to potentially 50-80% accuracy gains in retrieval over strong keyword retrieval models. Further research along the line shows that expert users can use Boolean Conjunctive Normal Form expansion to improve retrieval performance by 50-300% over unexpanded keyword queries.[5]
Mitigation techniques
- Full-text indexing instead of only indexing keywords or abstracts
- Use of controlled vocabularies in both indexing and retrieval, such as taxonomies or ontologies[6]
- Indexing text on inbound links from other documents (or other social tagging)
- Query expansion. Query expansion might be interactive, meaning the user can choose related words, or automatic, meaning the retrieval system adds extra words to the query without user input.[3] A 2012 study by Zhao and Callan[5] using expert created manual conjunctive normal form queries has shown that searchonym expansion in the Boolean conjunctive normal form is much more effective than the traditional bag of word expansion e.g. Rocchio expansion.
- Translation-based models
Other contexts
In software engineering, vocabulary mismatch has been described as a barrier to duplicate issue detection.[7]
References
- ^ a b Fitzgerald, Kyle Andrew; de la Harpe, Andre Charles; Uys, Corrie Susanna; Bytheway, Andrew John (9 December 2021). "Information retrieval: Solving mismatching vocabulary in closed document collections" (PDF). South African Journal of Libraries and Information Science. 87 (2).
- ^ Furnas, G., et al, The Vocabulary Problem in Human-System Communication, Communications of the ACM, 1987, 30(11), pp. 964-971.
- ^ a b c Shekarpour, Saeedeh; Marx, Edgard; Auer, Sören; Sheth, Amit (2017). RQUERY: Rewriting Natural Language Queries on Knowledge Graphs to Alleviate the Vocabulary Mismatch Problem (PDF). Thirty-First AAAI Conference on Artificial Intelligence (AAAI-17). p. 1.
- ^ Zhao, L. and Callan, J., Term Necessity Prediction, Proceedings of the 19th ACM Conference on Information and Knowledge Management (CIKM 2010). Toronto, Canada, 2010.
- ^ a b Zhao, L. and Callan, J., Automatic term mismatch diagnosis for selective query expansion, SIGIR 2012.
- ^ M. N. Asim, M. Wasim, M. U. Ghani Khan, N. Mahmood and W. Mahmood, The Use of Ontology in Retrieval: A Study on Textual, Multilingual, and Multimedia Retrieval, IEEE Access, vol. 7, pp. 21662-21686, 2019, doi: 10.1109/ACCESS.2019.2897849.
- ^ Chaparro, Oscar; Florez, Juan Manuel; Marcus, Andrian (2016). On the vocabulary agreement in software issue descriptions (PDF). 2016 IEEE International Conference on Software Maintenance and Evolution (ICSME). IEEE.