Using SQL/Python to build probability distribution
Suppose the following jobs table:
`jobpost` - name - position - is_featured (boolean)
I would like to build a list of suggested jobs for a given user, where jobpost.position matches the user's position (for example, an accountant would receive jobs in accounting).
The basic query to accomplish this would be something like:
SELECT name FROM jobpost WHERE jobpost.position IN (list of user positions) LIMIT 10
I also want to make sure that jobs that are featured (is_featured=True) receive extra weight. Then I need to build a probility distribution list from which a random number of jobs would be selected. For this I was thinking of building a python list of tuples, with the job name and probability, and then using random.random(). For example, something like (in pseudocode):
x = [('job 1', 0.2), ('job 2', 0.2), ('job 3', 0.2), ('job 4', 0.4)] # pick three out of the list of jobs above random.random.sample(x, 3)
I have three questions related to this:
Does this seem like the right approach?
How would I use the random module (or another one) to select n number of objects with each object having a certain given probability?
In terms of giving a weighted average to a featured job over a non-featured job, would the following query be the correct approach? If not, what would be a better way?
SELECT name, 1 * (CASE WHEN is_featured=True THEN % ELSE 1) as weighted_average FROM ...
This would give me tuple with the job name and the relative weight.
I'd recommend using SQLAlchemy and make use of Python's syntax, not SQL's. But to solve your problem, here's what I'd try:
import random x = [('job 1', 0.2), ('job 2', 0.2), ('job 3', 0.2), ('job 4', 0.2), ('job 5', 0.4)] def random_choice(jobs_list, number=1): jobs = jobs_list[:] choice = random.uniform(0, 1) total = 0 while total < number: index, job_item = random.choice(list(enumerate(jobs))) job, weight = job_item if weight > random.uniform(0, 1): yield job jobs.pop(index) total += 1 print list(random_choice(x, 3))
Just as a side note, in order for your "weights" to follow a discrete PDF, they need to add up exactly to 1.0.
Truthfully, I don't think this "weighting" is a proper solution to the problem. You don't have a proper PDF to create the discrete "weight" for a given job. Instead, why not just have separate categories for jobs?