PyData Tel Aviv 2022
Generative Adversarial Networks (GANs) are a type of unsupervised learning that are well known for their ability to generate new images, videos or text, but they can also be used for a wider range of use-cases.
In this talk I will present how we used GANs at Dell for predicting the user's next activities on Dell’s website, and also cover the fundamentals of GANs, for those less familiar with it and its various applications.You should join this talk if you want to learn the basics of GANs and a less conventional way to use it for a business use-case.
I was about to give up my DBSCAN clustering solution when I found out how long it takes to train it with 400 million records. The density-based clustering algorithm was exactly what we needed at PayPal to solve a few unsupervised anomaly-detection problems, but when runtime hits O(n^2) it just seemed impossible.
The talk will introduce how we re-implemented DBSCAN for big data by parallelizing it using a graph algorithm, and walk through our solution which enables clustering of 400M records in a few hours.
What if I told you every-day recommendation systems can be utilized to detect unwanted behaviors? Now, what if I told you they can also be harnessed to prevent internal security violations in organizations? Well, It’s happening. Kind of neat, right?
Synthethic Data Generation is now a popular and important addition to standard image augmentation methods and real life data acquisition and labeling. We will show how the unique, metadata-driven nature of synthetic data pipelines enables new method to improve AI explainability, accuracy and training times
Imagine you’re conducting a salary survey with the goal of training a model to predict the salary. Cool, right? Not if you don’t handle user privacy… How can we make sure the collected data can’t be used to identify the users, while still being able to properly train our model?
In this session, we’ll eat the cake and leave it whole: We’ll use a less known model called Deming regression to handle our anonymized data, and it’ll have a quality similar to a model trained on the private data! And all will be live coded, starting with an empty Jupyter notebook. Join the fun ;)
Does the indirect protection of the vaccine biases vaccine effectiveness (VE) estimations?
SARS-CoV-2 vaccines provide high protection against infection to the vaccinated individual and indirect protection to its surroundings by blocking further transmission. Divergent results have been reported on the effectiveness of the SARS-CoV-2 vaccines. Here, we argue that this divergence is because the analyses did not consider indirect protection. Using a novel heterogeneous infection model (python) and real-world data, we demonstrate that heterogeneous vaccination rates among families and communities, both spatially and temporally, and the study design that is used may significantly skew the VE estimations
Ok, I lied, I still write tests. But instead of the example-based tests that we normally write, have you heard of property-based testing? By using Hypothesis, instead of thinking about what data I should test it for, it will generate test data, including boundary cases, for you.
Interactivity, traceability, transparency and efficiency are becoming increasingly important, yet challenging, in today’s data-rich analysis applications. Inevitably, data analysis pipelines are often heavily parametrized, and we lack good ways to trace which specific parameters affect a focal downstream result and to evaluate the effects of changing parameters in ways that are interactive, transparent and computationally efficient. In the talk, we will introduce “Quibbler” - a new open source, pure-python package for building inherently interactive, yet traceable, transparent and efficient data analysis applications. Founded on a data-flow paradigm, Quibbler allows processing data through any series of analysis steps, while automatically tracking functional relationships between downstream results and upstream parameters. Quibbler facilitates and embraces human interventions as an inherent part of the analysis pipeline: input parameters, as well as algorithmic exceptions and overrides, can be specified interactively, and any such interventions are automatically recorded and documented. Changes to upstream parameters propagate downstream, pinpointing which specific data items, or even slices thereof, are affected, thereby vastly saving unnecessary recalculations. Importantly, Quibbler does not require learning any new programming syntax; it seamlessly integrates into any standard Python analysis code. We are just launching Quibbler as an open-source project, and are eager to see it being used and integrated within a range of data science applications. We are of course also looking for feedback, suggestions and help.
Jupyter Notebooks have seen enthusiastic adoption among the data science community to become the default environment for research.
But, are Jupyter Notebooks really the best home for data scientists to develop production-ready projects? The non-linear workflow, lack of versioning capabilities, no IDE integration, and inadequate debugging tools make it laborious to productionize a project created in a Jupyter Notebook environment.
Should we just throw our Jupyter Notebooks out the window and move to classic IDEs? Probably not – Jupyter Notebooks are, after all, a great tool that gives us superhuman abilities. We can, however, be more production-oriented when using them. How does this look in practice? That is exactly what we'll cover in this talk.
From summarizing distributions to pooling operators and measuring the goodness of fit, averaging plays a unique and universally recognized role in the field of machine learning. In the talk we present Generalized Average (GA) - a continuous and fully differentiable average operator that allows for flexible interpolation between min, max and three different types of averages: arithmetic, geometric and harmonic. We share the results of two lines of experiments: (1) using GA in hyperparameter tuning for false-positive-averse cases (e.g. fraud detection) and (2) using GA as a pooling operator in Graph Attention Networks to improve the model’s flexibility. Finally, we present an open-source Python package with our implementation of GA. The talk is addressed to machine learning practitioners, who are interested in enriching their toolbox.
One of the biggest challenges facing online platforms today and especially those with user-generated content is detecting harmful content and malicious behavior. One of the reasons harmful content detection is so challenging is that it is a multidimensional problem. Items can be in any number of formats (video, text, image, and audio), any language, and violative in any number of ways, from extreme gore and hate to suggestive or ambiguous nudity or bullying, and are uploaded or shared by a myriad of users (some of which are trying to circumvent being banned).
In order to be able to build algorithms that analyze and detect this harmful activity at scale, we need a data model that can capture the complexities of this online ecosystem. In this talk, we will discuss how ActiveFence models the online content, media, creators, and users that interact with the content with likes, shares, or comments. Modeling the relationships between these items yields a complex connected graph, and in order to calculate a score that accurately reflects the probability of harm, we need to be able to query and access all of the relations of any given item. We will dive into the details of the complex and adversarial online space, the ActiveFence data model, and how we abstract the complexity of querying a graph-like data model using traditional SQL PySpark queries to provide maximum value to our algorithms.
A step-by-step introduction to purchase prediction. Also applicable to survival analysis and churn prediction. Including implementation in PySpark.
When gathering data to train a ML model, the common belief is ‘the more the merrier’. In reality though, individual data samples may have varying effects on the learning process. How can we automatically measure the contribution of samples towards learning, and what can we do with it?
How to analyze the extremes of a distribution to predict probabilities and magnitudes of events outside the observed range. Presenting the theoretical framework of Extreme Value Theory with examples from a case study - prediction of the 50-year wind across Israel.