Record Linkage on People's Names with Approximate String Matching (opens in new tab)

(innerjoin.bit.io)

2 pointsdata_dan_4y ago1 comments

1 comments

I wrote this article. For some background—

In March, we published an article on Stock Trades by members of congressional committees: https://innerjoin.bit.io/data-cant-tell-us-whether-congressi...

To conduct this research, we needed to know: (1) which members of congress made which stock trades, and (2) which members of congress belonged to which congressional committees. The data for (1) was available from the the senate/house stock watchers sites; the data for (2) came from the ProPublica Congress API. There was no primary key available for linking the two datasets: the best we had to work with were the names of the members of congress.

This would be fine, if the names were represented uniquely and consistently. This was not the case. You can't join "Mitch McConnell" to "A. Mitchell McConnell, Jr." without a bit of work.

Manually matching every single name from the first data source to every single name in the second would be tedious, time consuming, and error prone. Instead, we used the Levenshtein distance to compute a similarity metric between each name in the first dataset and each name in the second. Simply using the best match according to this metric correctly matched more than 95% of the names, and made it incredibly simple to review the list and manually fix the few incorrect matches.

There's also an accompanying Deepnote dashboard where you can compare string distances between pairs of strings of your choosing: https://deepnote.com/@dliden-bitdotio/Whats-in-a-Name-28418c...

j / k navigate · click thread line to collapse

1 comments

data_dan_OP4y ago

I wrote this article. For some background—

In March, we published an article on Stock Trades by members of congressional committees: https://innerjoin.bit.io/data-cant-tell-us-whether-congressi...

This would be fine, if the names were represented uniquely and consistently. This was not the case. You can't join "Mitch McConnell" to "A. Mitchell McConnell, Jr." without a bit of work.

There's also an accompanying Deepnote dashboard where you can compare string distances between pairs of strings of your choosing: https://deepnote.com/@dliden-bitdotio/Whats-in-a-Name-28418c...

j / k navigate · click thread line to collapse