Machine Learning Empowered Database Systems

Machine learning has been exploited in different computing fields, e.g., computer vision, natural language processing, artificial intelligence, bioinformatics, etc, where researchers aim to provide solutions that exhibit useful learning behavior autonomously. Undoubtedly, the field of data management is not an exception to this, as there has been a flurry of research efforts over the past few decades to explore the usage of machine learning in automatically choosing database indexes, updating the query optimizer plans, and materializing database views, among others. Although such research efforts showed the important role that machine learning can play in improving the performance of database operations, they are still limited trials and did not explore the full power of machine learning yet because they were proposed to help in learning the behavior of specific functions in the different database components.

The goal of this project is to envision a more holistic approach to build an end-to-end machine learning empowered database system that custom-tailors its performance towards user workloads and data distributions. The core components of database systems, e.g., data access methods, query optimizer, query scheduling and query execution can be fully replaced with learned components. We explored using machine learning to improve the data access and processing methods ([ICNT@ICNP'24], [CIDR'21], [AIDB@VLDB'20]), multi-query scheduling ([SIGMOD'22]), hashing ([VLDB'23], [AIDB@VLDB'21]), and in-memory join processing ([VLDB'23]).

Open Source Code: Learned Hashing [GitHub], Learned Joins [GitHub], Learned Spatial Indexes [GitHub]
Awards: [NSF/CRA Computing Innovation Fellowship (CiFellow)'20-'23]
Media: [MIT News Article] [Amazon Science Post]


Diagnosing Large Systems using Causal Inference

Production computer systems exhibit a similar level of complexity and a recurring need to diagnose phenomena quickly. However, they are only observed imperfectly, often via long, messy, semi-structured logs. Meanwhile, causal inference can quantify cause-effect relationships in domains as varied as medicine, economics and public policy.

The goal of this project is to accelerate large systems debugging by applying causal inference over logs, enabling engineers to diagnose problems and assess interventions in a principled manner. The proposed framework achieves this through two human-in-the-loop modules: (1) The Candidate Cause Ranker, through which one can determine the causes of a variable without running a full causal discovery algorithm; and (2) the Interactive Causal Graph Refiner, which helps engineers compute an unbiased estimation of their effect of interest without extensive manual causal graph verification. Both modules are powered by the insight that only part of the causal graph of the system is needed to correctly quantify an effect of interest. In addition, the framework provides a data preparation pipeline which transforms raw, messy, real-world logs into an appropriate tabular input for causal inference, using methods drawn from data transformation, cleaning, and extraction ([VLDB'25], [SIGMOD'24], [GUIDE-AI@SIGMOD'24]).

Open Source Code: [GitHub]
Awards: [Best Paper Award in GUIDE-AI Workshop at SIGMOD'24


Optimizing Video Selection Queries With Commonsense Knowledge

Video selection queries — selecting videos that contain target objects - are crucial in video analytics, as they enable precise filtering and retrieval of relevant video content from extensive datasets. Advances in neural networks allow us to detect the objects in an image, and thereby offer query systems to examine the content of the video. Unfortunately, neural network-based approaches have long inference times. Processing this type of query through a standard scan would be time-consuming and would involve applying complex detectors to numerous irrelevant videos. It is tempting to try to improve query times by computing an index in advance. But unfortunately, many frames will never be beneficial for any query. Time spent processing them, whether at index time or at query time, is simply wasted computation.

The goal of this project is to propose a novel index mechanism to optimize video selection queries with commonsense knowledge (i.e., fundamental information about the world, such as the fact that a tennis racket is a tool designed for hitting a tennis ball). To save computation, a lossy index can be intentionally created, but this may result in missed target objects and suboptimal query time performance. Our mechanism addresses this issue by constructing probabilistic models from commonsense knowledge to patch the lossy index and then prioritizing predicate-related videos at query time ([VLDB'24], [VLDB'23]).

Open Source Code: [GitHub]


SMLN: Adapting Markov Logic Networks (MLN) for Big Spatial Data and Applications

Recently, there has been a proliferation in the amounts of spatial data produced from several devices such as satellites, space telescopes, and medical devices. Various agencies need to analyze these unprecedented amounts of spatial data to extract useful information and decisions in their applications. Meanwhile, Markov Logic Networks (MLN) have been introduced as an efficient and user-friendly framework for statistical learning and inference. Unfortunately, researchers never take advantage of the recent advances in Markov Logic Networks (MLN) to boost the usability, scalability, and accuracy of spatial machine learning tasks (e.g., spatial regression and spatial-aware knowledge bases) used in these applications.

The goal of this project was to provide the first full-fledged MLN framework with a native support for spatial data, called Spatial Markov Logic Networks (SMLN). In particular, SMLN pushes the spatial awareness inside the internal data structures and core learning and inference functionalities of MLN, and hence inside all MLN-based machine learning techniques and applications. In this project, we showed three case studies on the efficiency of SMLN including Sya [ICDE'20] [SIGMOD'18 (Demo)], a system for spatial probabilistic knowledge base construction, TurboReg [TSAS'19] [SIGSPATIAL'18], a framework for scaling up spatial autologistic regression models, and Flash [SIGSPATIAL Special'19] [SIGSPATIAL'19 (SRC)] [VLDB'19 (Demo)], a framework for scalable spatial probabilistic graphical modeling. A nice introduction about the overall SMLN architecture can be found here [PhD@VLDB'19].

Tutorials related to this project: [MDM'21] [ICDE'20] [VLDB'19] [Slides]
Awards: [University of Minnesota Best Dissertation Honorable Mention'21] [University of Minnesota Doctoral Dissertation Fellowship'19-'20] [Gold Medal of Student Research Competition in SIGSPATIAL'19] [Best Paper Nomination in SIGSPATIAL'18]


CRA: Enabling Data-Intensive Applications in Containerized Environments

Common Runtime for Applications (CRA) is a software layer (library) that makes it easy to create and deploy distributed dataflow-style applications on top of resource managers such as Kubernetes, YARN, and stand-alone cluster execution. Currently, we support stand-alone execution (just deploy an .exe on every machine in your cluster) as well as execution in a Kubernetes/Docker environment. CRA has been used to build both offline and streaming analytics platforms in Microsoft such as Quill and online microservice fabrics such as Ambrosia. A nice introduction about the overall CRA architecture can be found here [ICDE'19] [Full Version].

Open Source Code: [GitHub]


Efficient Spatial Query Processing/Optimization in MapReduce-based Data Processing Frameworks

In this project, we focused on supporting efficient built-in spatial queries processing and optimization inside the popular MapReduce-based data processing frameworks. Specifically, I worked to achieve this goal within two main spatial-aware MapReduce-based systems, SpatialHadoop and Sphinx:

  • Optimizing spatial queries in SpatialHadoop: SpatialHadoop is an open-source MapReduce extension to Apache Hadoop designed specially to work with spatial data. It is used it to analyze huge spatial datasets on a cluster of machines. Recently, SpatialHadoop has been acquired under the name of GeoJini by eclipse foundation as one of its LocationTech projects. I have been working on extending SpatialHadoop to support optimizing large-scale spatial queries e.g. spatial join [SIGSPATIAL'2017] [SIGMOD'17 (SRC)].

    Open Source Code: [GitHub]
    Awards: [Selected among Top 10 Finalists of the Student Research Competition in SIGMOD'17]

  • Efficient spatial indexing and query execution in Sphinx: Sphinx is a lighting-fast, distributed SQL queries for petabytes of spatial data, based on Cloudera Impala. The main objective is to implement a full stack of spatial data processing, including query parser, indexer, query planner, and query executor. I have been working on supporting efficient spatial indexing (e.g., grid and r-tree) and query processing inside Sphinx. A nice introduction about the overall Sphinx architecture can be found here [SSTD'17].

    Open Source Code: [GitHub]

Other Projects

Device-Free Passive WLAN Localization

In this project, we developed and evaluated accurate multi-entity tracking solutions that use the human’s effect on the Radio Frequency (RF) in WiFi environments to infer the human’s presence and location. To achieve that, we investigated recent wireless technologies (e.g. 802.11n) combined with solid machine learning techniques [TMC'15] [WCNC'13] [GLOBECOM'12] [WINTECH'12 (Demo)].


Collaborative Machine Translation Evaluation

In this project, we focused on providing an efficient platform for integrating the automated evaluation of machine translations with human judgments to produce accurate quality estimation of large-scale translations. The resulting platform has been integrated as a web service for Microsoft Translator Hub that serves requests for public translation and private ones. A nice introduction about the details of this platform can be found here [Thesis].

Media: [Egypt Newspapers Coverage]


Efficient Semantic-based Recommendation

In this project, we proposed a novel approach to construct an ontology from Wikipedia graphs of categories and articles to solve the problems of using traditional ontologies for the text analysis in text-based recommendation systems. In addition, we proposed an efficient structure for users' profiles to integrate smoothly with the built ontology [ISDA'10].

Awards: [Alexandria University's Best CS Bachelor's Thesis Award in 2010]