The WordSimilarity-353 Test Collection contains two sets of English word pairs along with human-assigned similarity judgements. The collection can be used to train and/or test computer algorithms implementing semantic similarity measures (i.e., algorithms that numerically estimate similarity of natural language words).
The first set (set1) contains 153 word pairs along with their similarity scores assigned by 13 subjects. The second set (set2) contains 200 word pairs, with their similarity assessed by 16 subjects. Subjects' names have been replaced by ordinal numbers (1..13, or 1..16) to protect their privacy; identical numbers in the two sets do not necessarily correspond to the same individual.
All the subjects in both experiments possessed near-native command of English. Their instructions were to estimate the relatedness of the words in pairs on a scale from 0 (totally unrelated words) to 10 (very much related or identical words). The precise instructions are available in file instructions.txt inside the ZIP archive (see section "Availability and usage" below).
Each set provides the raw scores assigned by each subject, as well as the mean score for each word pair. For convenience, a combined set (combined) is provided that contains a list of all 353 words, along with their mean similarity scores. The combined set is merely a concatenation of the two smaller sets.
All sets (set1, set2 and combined) are available in two formats:
The first two columns in each file contain word pairs, followed by a column with the (floating-point) mean score of the subjects' individual assessments. In set1 and set2 there are additional columns with individual subjects' scores (one column per subject). In the general case, all scores are floating-point, although many appear as integers.
Note: set1 includes, among others, all the 30 noun pairs from G.A. Miller and W.G. Charles, "Contextual correlates of semantic similarity", Language and Cognitive Processes, Vol. 6, No. 1, 1991, pp. 1-28 (although similarity scores have been obtained anew).
Download the data set as a ZIP file:
If you publish results based on this data set, please cite as
Lev Finkelstein, Evgeniy Gabrilovich, Yossi Matias, Ehud Rivlin, Zach Solan, Gadi Wolfman, and Eytan Ruppin, "Placing Search in Context: The Concept Revisited", ACM Transactions on Information Systems, 20(1):116-131, January 2002 [Abstract / PDF]
Please also inform your readers of the current location of the data set:
http://www.gabrilovich.com/resources/data/wordsim353/wordsim353.html

The WordSimilarity-353 Test Collection by http://www.gabrilovich.com/resources/data/wordsim353/wordsim353.html is licensed under a Creative Commons Attribution 4.0 International License.
Eneko Agirre et al. proposed to split the WordSimilarity-353 collection into two datasets, one focused on measuring similarity, and the other one on relatedness. The data is available here: http://alfonseca.org/eng/research/wordsim353.html
If you have questions or comments, please email me at gabr@cs.technion.ac.il.
If you are using the WordSimilarity-353 Test Collection and want your article(s) listed here, please email me at gabr@cs.technion.ac.il.