PyData Tel Aviv 2022

Constructing and querying data model for online harm
12-13, 15:00–15:30 (Asia/Jerusalem), Track 2

One of the biggest challenges facing online platforms today and especially those with user-generated content is detecting harmful content and malicious behavior. One of the reasons harmful content detection is so challenging is that it is a multidimensional problem. Items can be in any number of formats (video, text, image, and audio), any language, and violative in any number of ways, from extreme gore and hate to suggestive or ambiguous nudity or bullying, and are uploaded or shared by a myriad of users (some of which are trying to circumvent being banned).

In order to be able to build algorithms that analyze and detect this harmful activity at scale, we need a data model that can capture the complexities of this online ecosystem. In this talk, we will discuss how ActiveFence models the online content, media, creators, and users that interact with the content with likes, shares, or comments. Modeling the relationships between these items yields a complex connected graph, and in order to calculate a score that accurately reflects the probability of harm, we need to be able to query and access all of the relations of any given item. We will dive into the details of the complex and adversarial online space, the ActiveFence data model, and how we abstract the complexity of querying a graph-like data model using traditional SQL PySpark queries to provide maximum value to our algorithms.


We suggest a data modeling way, useful in the social networks era. We discuss its pros and cons, how to overcome them using traditional SQL queries over Spark, and how we leveraged this data model for making contextual smart predictions.

Matar Haller leads the Data Group at ActiveFence, where her teams are responsible for the data and algorithms which fuel ActiveFence’s ability to ingest, detect and analyze harmful activity and malicious content at scale in an ever-changing, complex online landscape. Prior to joining ActiveFence, Matar was the Director of Algorithmic AI at SparkBeyond where she worked on developing an automated research engine to extract actionable insights from complex datasets. Matar holds a Ph.D. in Neuroscience from the University of California at Berkeley, where she studied the link between perception and action by recording and analyzing signals from electrodes surgically implanted in human brains. Matar is passionate about expanding leadership opportunities for women in STEM fields and together with her husband is raising three wonderful children who surprise and inspire her every day.

Noam Levy is an experienced Big Data Engineer in the Data Group at ActiveFence, and is part of a team that is responsible for detections for both real-time and asynchronous inferences, data modeling and management, and MLOps infrastructure for the entire R&D organization. Apart from that, Noam is the Kafka domain expert and is responsible for Data Group cloud costs optimizations.