PyData Tel Aviv 2022

Unleash Big Data Clustering: Parallelize DBSCAN over 400M PayPal Records
12-13, 10:30–11:00 (Asia/Jerusalem), Track 1

I was about to give up my DBSCAN clustering solution when I found out how long it takes to train it with 400 million records. The density-based clustering algorithm was exactly what we needed at PayPal to solve a few unsupervised anomaly-detection problems, but when runtime hits O(n^2) it just seemed impossible.

The talk will introduce how we re-implemented DBSCAN for big data by parallelizing it using a graph algorithm, and walk through our solution which enables clustering of 400M records in a few hours.

Among clustering algorithms, DBSCAN is the pioneer of density-based clustering techniques which can discover clusters of arbitrary shapes, don’t require the number of clusters in advance and handle outliers effectively.

Our challenge began when we realized that it is practically impossible to train DBSCAN with sklearn on 400M records. Looping over such a large scale was indeed optimistic considering O(n^2) complexity, however the density-based algorithm could only be valuable in identifying fraud-trends and anomalies when trained on the entire dataset.

This challenge, instead of blocking the project, granted us with the unique opportunity to open the hood of the DBSCAN algorithm and re-implement it. The combination of the project importance, the lack of robust open source code, and the availability of senior engineers made us take the untypical decision to create a new implementation from scratch that can deal with large scale.

It turned out that the missing piece for the re-implementation was parallelizing the global clustering component using a graph algorithm. The Connected Components algorithm was the one that allowed us to solve the puzzle and create a fully-distributed version of DBSCAN that trains on hundreds of millions of records in a few hours.

If you are into clustering, distributed algorithms, or curious about accelerating algorithms performance and capabilities - join this talk!

Tal is a Machine Learning Scientist at PayPal, working on horizontal data science infrastructures and solutions. Her current work focuses on clustering solutions for big-data applications, serving multiple business applications in the risk and credit domains. Tal's past work includes advanced sequence processing solutions, graph-based applications and applied research in the cyber domain. She holds a BSc in Industrial Engineering from Ben Gurion University.