Our client is a nonprofit that since 1997 has developed STEM curricula for use in US schools from pre-K through high school across all 50 states. This company’s data system had 5.5 million student records that contained massive numbers of duplicate data. Also, many student records had been lost or disconnected. At the time, the nonprofit did not know how to efficiently clear duplicate data records and reconnect lost student data.
Our goal was two-fold:
The EduSource team was given data backups and was able to reconnect the majority of disconnected student records using Python code to populate a new Postgres database. Then, utilizing the Dedupe.io library, EduSource manually trained a machine-learning algorithm to identify duplicates. The machine learned which fields were important for identifying duplicates, and then applied that knowledge to the rest of the data. Half a million student records were identified as duplicates with a 95% accuracy, and were automatically merged. Then the EduSource team worked on a way that this process could be re-run on a regular basis to continue to de-dupe new student records.