Crazy Linguistics Abstracts: Internet search result probabilities: Heaps' Law and Word Associativity

Journal of Quantitative Linguistics (v.16-1)
by Lansey, Jonathan C & Bukiet, Bruce

Abstracts

We study the number of internet search results returned from multi-word queries based on the number of results returned when each word is searched for individually. We derive a model to describe search result values for multi-word queries using the total number of pages indexed by Google and by applying the Zipf power law to the words per page distribution on the internet and Heaps' law for unique word counts. Based on data from 351 word pairs each with exactly one hit when searched for together, and a Zipf law coefficient determined in other studies, we approximate the Heaps' law coefficient for the indexed worldwide web (about 8 billion pages) to be b = 0.52. Previous studies used under 20,000 pages. We demonstrate through examples how the model can be used to analyze automatically the relatedness of word pairs assigning each a value we call "strength of associativity'. We demonstrate the validity of our method with word triplets and through two experiments conducted 8 months apart. We then use our model to compare the index sizes of competing search giants Yahoo and Google.

This one is from the "applied computational google fight!" dept. I think it's awesome that you could accurately model the success of searching the internet using multiple key words. Awesomer still is the result that the model provides word associativity.

It's pretty crazy what can be done with this immense database we call 'the internet'. In the near future, we'll figure out ways in which this database can be mined for answers to all of life's questions. It will also gain sentience and become our overlord, and that's ok.

It seems unfortunate, however, that the results (as established in the abstract) are not provided with any suggestions for why anyone (except maybe big search engine designers) should care. I'm 100% sure that there is at least one reason, if not several, why I should care. But it isn't clear what that reason is, and that's a shame.

Crazy Linguistics Abstracts

Monday, April 13, 2009

Internet search result probabilities: Heaps' Law and Word Associativity

No comments:

Post a Comment

Followers

Blog Archive