Find adjectives related to noun input
I want to try and determine the characteristics of a user's personality based on the words they input into a search box. Here's an example:
Search term: "computers"
Personality/descriptors detected: analytical, logical, systematic, methodical
I understand that this task is extremely non-trivial. I have used WordNet before, but I'm not sure if it includes adjective clouds for each noun node. Part-of-speech tagging is a beast of its own, so I'm not sure that building my own corpus and searching for adjective term-frequencies that coexist with keywords is the best idea, but I'll explain it below.
I am currently working with a Wikipedia dump, processing each article for term frequency after having removed stop words (and, or, of, to, a, etc...). My thought was to possibly search for the coexistence of adjectives (using WordNet for POS tagging) and nouns throughout the corpus (eg. the adjective logical often co-occurs with the noun computer), and, based on the relative, stemmed-adjective frequency, judge it to be semantically related to the noun or not. The potential applications are immense.
Another idea is to stem the noun, search for adjectives that begin with that stem, then search for synonyms of that adjective. Example:
Search term: "computers"
Adjectives with stem: computational
The problem is that adjective forms of nouns don't always have adjective forms, and some noun stems will match to horribly wrong adjectives. *BAD*example:
Search term: "running" (technically a gerund, but still a noun)
Adjectives with stem: runny
Synonyms: NOT THE WORDS I WANT. Would like to find words like "athletic", "motivated", "disciplined"
Is this something that has been done before? Do you have suggestions regarding how I might approach this? It's almost as if I'm seeking to generate adjective clouds for the "important" words in a document.
EDIT: I realize that there is no "correct" answer to this problem. I will reward the bounty to whomever presents a method with the best theoretical potential.
Assuming you have some hefty computational resources to throw at this, I would suggest using something simple like Hyperspace Analog of Language (HAL) to build up a Term X Term matrix for your dump of Wikipedia. Then, your algorithm could be something like:
- Given a query word/term, find it's (HAL) vector.
- For the vector, find the adjective components with the highest weights.
- To do this efficiently, you would probably want to us a dictionary (like WordNet) to preprocess your list of terms (i.e., those extracted by HAL) such that you know (prior to processing queries) which ones could be used as adjectives.
- For each adjective, find the N most similar vectors in your HAL space.
- Optional: You could narrow this list down by looking for words that co-occur across your search terms.
This approach basically trades off memory and computational efficiency for simplicity in terms of code and data structures. Yet, it should do pretty well for what I think you want. The first step will give you adjectives that are most commonly associated with the query term, while the vector similarity in the HAL space (step 3) will give words that are paradigmatically related (roughly, can be substituted for one another, so if you start with an adjective of a certain sort, you should get more adjectives "like it" in terms of its relationship with the query term), which should be a fairly good proxy for the "cloud" you are looking for.
WordNet doesn't have what you need - it contains (almost) no information about relation between words that aren't synonyms or aren't linked hierarchically (chair->furniture) etc.
Just use OpenNLP (http://opennlp.apache.org) and parse large amounts of text - OpenNLP parser will detect verb-adjective / noun-adjective in sentences allowing you to build a relations database. All that is left at this point is to filter the database against predefined list of adjectives.