Edit: Now looking at the dataset, I wouldn't be able to use the model I developed personally.
We've run hundreds of machine learning competitions & offer a real-time leaderboard to encourage competitive participation, a very active community of data scientists, and many other features that simplify running this type of challenge.
The prize is that we'll use your algorithm to validate any matches that you go on. If that doesn't seem worthwhile to you, feel free to pass on this contest.
EDIT: ah fuck it might hack on this dataset anyways--if i get a beer and a date out of it ill help fiddle while rome burns
If this is really a competition (and not just "Here, have fun with our dataset!"), you need to define the rules a little bit more clearly. How are you weighing recall vs. precision? Or are you just looking at % correct labels, where the only two labels possible are "FRIENDS" and "NOT FRIENDS"?
You get 1 point for each friendship that you correctly predict did or did not occur. In the test data set ~50% of pairs became friends, so predicting "everyone became friends" would get 250 points, whereas a perfect algorithm would get 500 points.
I'm updating the README now to make our scoring system more clear.
The dominant factor here is going to be the rate at which the participants send and accept connection requests on Facebook. Some people send them to everyone they meet, some people never use Facebook.
KPI overfitting, yay!
(The best second-order effect is probably a multi-feature similarity measure between the participants and the person's current Facebook Friends, including graph distance to current Friends. In case anyone is taking a run at this.)
Developers who are considering participation in this, I'd suggest you build something for yourself with data acquired elsewhere.
People this may be aimed at:
* Experienced devs in boring day-jobs who are seeking some kind of off-time challenge.
* People just getting into ML and want to solve something real.
* CS students with spare time.
You know more about ML than me, but it doesn't sound like they're looking for a cancer cure; just fishing around for a one-off challenge. Or maybe they're taking names for future interview candidates.
> Developers who are considering participation in this, I'd suggest you build something for yourself with data acquired elsewhere.
Relax, dude. If people think this an interesting problem to solve, what's that to you?
1. How is it collected? From a survey, or grabbed from user FB profiles?
2. What is the platinum albums variable? Maybe the number of platinum albums that the user likes on FB??
3. Why are there some "male" entries in the f_gender column, and some "female" entries in the m_gender column?
Why does everyone have "7.5" - 8 siblings and 7.5 - 8 "weekly workouts" and 7.5 - 8 platinum albums?
- Is that the person's height in inches?
- What does the asterisk in certain column-names indicate?
- Why do the pets, platinum_albums, weekly_workouts, number_of_siblings and pokemon_collected values seem to fall in the range of 7 - 8?
Also, this dataset is far too small. There is a single male-male relationship and that's not going to provide any significant data if we're looking at genders at all.
I would also argue that it's not the best set of metrics to use to determine whether people will become friends. Age and facebook_friends_count might give you some hints, but I seriously doubt that shoe size has as big an impact on the potential for friendship as, say, common interests, shared culture, income class, or other socioeconomic factors.
I(f_facebook_friends_count,members_became_friends) = 0.117320113379
I(m_facebook_friends_count,members_became_friends) = 0.113972809724
I(m_facebook_photos_count,members_became_friends) = 0.0449092782303
I(f_facebook_photos_count,members_became_friends) = 0.0426531483254
I(m_shoe_size,members_became_friends) = 0.00276175766018
I(m_height,members_became_friends) = 0.00255043390135
I(f_shoe_size,members_became_friends) = 0.00233148724025
I(m_age,members_became_friends) = 0.00198005768283
I(f_height,members_became_friends) = 0.0013606978915
I(m_weekly_workouts,members_became_friends) = 0.00123271513215
I(f_age,members_became_friends) = 0.00122660347743
I(m_platinum_albums,members_became_friends) = 0.00111710129455
I(f_number_of_pets,members_became_friends) = 0.00108593667378
I(f_pokemon_collected,members_became_friends) = 0.000880040104571
I(m_number_of_siblings,members_became_friends) = 0.000830295252089
I(f_platinum_albums,members_became_friends) = 0.000820683185117
I(m_number_of_pets,members_became_friends) = 0.000768855827053
I(m_pokemon_collected,members_became_friends) = 0.000720822383999
I(f_weekly_workouts,members_became_friends) = 0.000620666529567
I(f_number_of_siblings,members_became_friends) = 0.00019278884716
I(f_gender,members_became_friends) = 0.000124279429698
I(m_gender,members_became_friends) = 0.000124279429698
That fact that the women are depicted as just three pairs of legs doesn't help, though.