PyData Tel Aviv 2022

Tal Erez Hauer

Tal is a Machine Learning Scientist at PayPal, working on horizontal data science infrastructures and solutions. Her current work focuses on clustering solutions for big-data applications, serving multiple business applications in the risk and credit domains. Tal's past work includes advanced sequence processing solutions, graph-based applications and applied research in the cyber domain. She holds a BSc in Industrial Engineering from Ben Gurion University.

The speaker's profile picture


Unleash Big Data Clustering: Parallelize DBSCAN over 400M PayPal Records
Tal Erez Hauer

I was about to give up my DBSCAN clustering solution when I found out how long it takes to train it with 400 million records. The density-based clustering algorithm was exactly what we needed at PayPal to solve a few unsupervised anomaly-detection problems, but when runtime hits O(n^2) it just seemed impossible.

The talk will introduce how we re-implemented DBSCAN for big data by parallelizing it using a graph algorithm, and walk through our solution which enables clustering of 400M records in a few hours.

Track 1