|Regret-ratio Minimizing Set|
|Density based Clustering|
|Semi-Automated Framework for Feature Engineering|
|Knowledge Intensive Crowdsourcing|
|Privacy Implications of Database Ranking|
|Query Reranking over Hidden Web Databases|
|Rank Analytics over Hidden Databases|
|Suppressing Sensitive Aggregates over Hidden Web Databases|
|Data Analytics over Hidden Web Databases|
|Collaborative Social Content Mining|
|Skyline over Categorical Domains|
|Assisting Service Providers|
|Approximate Query Processing|
|Ranking and Top-k Queries|
We are being constantly judged by automated decision systems that have been widely criticised for being discriminatory and unfair. Since an algorithm is only as good as the data it works with, biases in the data can significantly amplify unfairness issues. We initiate research on integrating fairness conditions into database query processing and data management systems. Specifically, in this project we focus on selection bias in range queries.
Ordinary Least Square(OLS) regression is often susceptible to outliers. Quantile Regression is an alternative approach to overcome this challenge. Unfortunately, Quantile Regression is much slower for practical uses. In this project, we aim to make Quantile Regression usable for practical applications by developing innovative scalable algorithms.
Feature selection is an important step in the data science pipeline, and it is critical to develop efficient algorithms for this step. Mutual Information (MI) is one of the important measures used for feature selection, where attributes are sorted according to the descending score of MI, and top-k attributes are retained. The goal of this work is to develop a new measure to effectively approximate top-k attributes, without actually calculating MI. Calculating this new measure is faster than calculating actual MI, resulting in a better runtime for feature selection.
Pediatric obstructive sleep apnea (OSA) is a common clinical condition that results from intermittent airway obstruction during sleep. Several daytime symptoms related to OSA, such as cognitive dysfunction and behavioral irregularities as well as atypical growth are frequently reported by parents. Thus, early diagnosis and treatment may prevent potentially irreversible adverse effects. The gold standard approach to diagnose and stratify OSA in children is overnight polysomnography (PSG) which is expensive, potentially inaccessible, and resource-intensive. In our attempt to design an inexpensive and accessible approach to screen children for OSA we are using machine learning techniques to evaluate the influence of subjects' clinical and demographic variables in diagnosing severe and moderate OSA.
Location based services (LBS) have become very popular in recent years. They range from map services (e.g., Google Maps) that store geographic locations of points of interests, to online social networks (e.g., WeChat, Sina Weibo, FourSquare) that leverage user geographic locations to enable various recommendation functions. The public query interfaces of these services may be abstractly modeled as a kNN interface over a database of two dimensional points on a plane: given an arbitrary query point, the system returns the k points in the database that are nearest to the query point. In this paper we consider the problem of obtaining approximate estimates of SUM and COUNT aggregates by only querying such databases via their restrictive public interfaces.
Finding the maxima of a database based on a user preference, especially when the ranking function is a linear combination of the attributes, has been the subject of recent research. A critical observation is that the convex hull is the subset of tuples that can be used to find the maxima of any linear function. However, in real world applications the convex hull can be a significant portion of the database, and thus its performance is greatly reduced. Thus, computing a subset limited to r tuples that minimizes the regret ratio (a measure of the user’s dissatisfaction with the result from the limited set versus the one from the entire database) is of interest. We make several fundamental theoretical as well as practical advances in developing such a compact set.
Description: Signal reconstruction problem (SRP) is an important optimization problem where the objective is to identify a solution to an underdetermined system of linear equations that is closest to a given prior. It has a substantial number of applications in diverse areas including network traffic engineering, medical image reconstruction, acoustics, astronomy and many more. Most common approaches for SRP do not scale to large problem sizes. We propose a dual formulation of this problem and show how adapting database techniques developed for scalable similarity joins provides a significant speedup.
Machine learning has become an essential toolkit for complex analytic processing. Data is typically stored in large data warehouses with multiple dimension hierarchies. Often, data used for building an ML model are aligned on OLAP hierarchies such as location or time. In this paper, we investigate the feasibility of efficiently constructing approximate ML models for new queries from previously constructed ML models by leveraging the concepts of model materialization and reuse. For example, is it possible to construct an approximate ML model for data from the year 2017 if one already has ML models for each of its quarters? We propose algorithms that can support a wide variety of ML models such as generalized linear models for classification along with K-Means and Gaussian Mixture models for clustering.
An LBS provides a public (often web-based) search interface over its backend database (of tuples with 2D geolocations), taking as input a 2D query point and returning k tuples in the database that are closest to the query point, where k is usually a small constant such as 20 or 50. In this project, we consider a novel problem of enabling density based clustering over the backend database of an LBS using nothing but limited access to the kNN interface provided by the LBS. In order to address the various types of restrictions imposed by the LBS, our goal here is to mine from the LBS a cluster assignment function f(·), such that for any tuple t in the database (which may or may not have been accessed), f(·) can produce the cluster assignment of t with high accuracy.
Feature engineering is an important and time consuming step in the data science pipeline. Currently there are fully manual and fully automated procedures to perform this task. In a fully manual setting, a human analyst usually makes several passes through the dataset to come up with the most effective derived attributes. Fully automated methods ask the user to upload the data file and does all the processing to provide derived attributes. Although faster than manual approach, fully automated methods are expensive. They mostly produce derived attributes that are not good for human interpretation. In this research we investigate on developing a framework that enables a human analyst to find interpretable derived attributes much more quickly than fully manual methods. This will address the issue of opaque attributes generated by fully automated methods by providing interpretable derived attributes. Instead of overwhelming the analyst by showing all attributes, the framework creates few buckets where each bucket contains a small subset of attributes that are related and highly discriminative. The analyst can focus on creating derived attributes using only the raw attributes in a given bucket. Our research aims to establish a principled way to achieve this goal.
Crowdsourcing systems have gained popularity in a variety of domains. The next generation crowdsourcing systems will be collaborative and knowledge-intensive in nature. They need to treat the crowdsourcing problem not in optimization silos, but as an adaptive optimization problem by seamlessly handling the three main crowdsourcing processes (worker skill estimation, task assignment, task evaluation) and incorporating the uncertainty stemming from human factors. The main thrust behind this project is to develop algorithms for such an adaptive, knowledge-intensive crowdsourcing scenario by quantifying and incorporating the human factors into the three major crowdsourcing processes.
In recent years, there has been much research in the adoption of Ranked Retrieval model (in addition to the Boolean retrieval model) in structured databases, especially those in a client-server environment (e.g., web databases). With this model, a search query returns top-k tuples according to not just exact matches of selection conditions, but a suitable ranking function. While much research has gone into the design of ranking functions and the efficient processing of top-k queries, this paper studies a novel problem on the privacy implications of database ranking. The motivation is a novel yet serious privacy leakage we found on real-world web databases which is caused by the ranking function design.
The ranked retrieval model has rapidly become the de facto way for search query processing in client-server databases, especially those on the web. Despite of the extensive efforts in the database community on designing better ranking functions/mechanisms, many such databases in practice still fail to address the diverse and sometimes contradicting preferences of users on tuple ranking, perhaps (at least partially) due to the lack of expertise and/or motivation for the database owner to design truly effective ranking functions. This project takes a different route on addressing the issue by defining a novel query reranking problem.
Structured hidden databases are widely prevalent on the Web. They provide restricted form-like search interfaces that allow users to execute search queries by specifying desired attribute values of the sought-after tuples, and the system responds by returning a few (e.g., top-k) tuples that satisfy the selection conditions, sorted by a suitable ranking function. The top-k output constraint prevents many interesting third-party (e.g., mashup) services from being developed over real-world web databases. This research involves developing effective techniques for retrieving more than top-k tuples for any query and support additional rank based analytics such as estimating the rank of a tuple or compare the rank of two arbitrary tuples to determine which of them is highly ranked. Our techniques access the hidden structured databases via their public interfaces and operate without any knowledge of the underlying static ranking function.
The objective of this project is to understand, evaluate, and contribute towards the suppression of sensitive aggregates over hidden databases. While owners of hidden databases would like to allow individual search queries, many also want to maintain a certain level of privacy for aggregates over their hidden databases. This has implications in the commercial domain (e.g., to prevent competitors from gaining strategic advantages) as well as in homeland-security related applications (e.g., to prevent potential terrorists from learning flight occupancy distributions). This project investigates techniques to suppress the sensitive aggregates while maintaining the usability of hidden databases for bona fide search users.
Structured hidden databases are widely prevalent on the Web. They provide restricted form-like search interfaces that allow users to execute search queries by specifying desired attribute values of the sought-after tuples, and the system responds by returning a few (e.g., top-k) tuples that satisfy the selection conditions, sorted by a suitable ranking function. Although search interfaces for hidden databases are designed with focused search queries in mind, for certain applications it may be advantageous to infer more aggregated views of the data from the returned results of search queries. This research involves developing effective techniques for performing data analytics, especially sampling, over hidden structured databases via their public interfaces. The outcomes include efficient algorithms for sampling hidden databases with a heterogeneous mix of data types, achievability results for sampling different types of search interfaces, and a prototypical toolset which demonstrates the sampling of real-world hidden databases.
The widespread use and growing popularity of online collaborative content sites today has created rich resources for users to consult in order to make purchasing decisions on various items such as movies, restaurants, e-commerce products, etc. It has also created new opportunities for content producers of such web items to design new improved items, compose eye-catching advertisement snippets, etc. in order to improve business. This project concerns developing data mining and exploration algorithms for performing aggregate analytics over user feedback (ratings, tags, likes, visits, etc.) available from collaborative content sites in order to benefit experience and decision making of both content producers and consumers. The key challenges exist in the form of information explosion and overload, besides user-item interaction intractability.
The ever-expanding volume and increasing complexity of information on the web has made recommendation systems essential tools for users in a variety of information seeking or e-commerce activities. Moreover, new research suggests that every digital comment made by users anywhere - a product review, social book-marking, tweets, blogs, activities on a social network site, e-mails can be mined for hints as to emotions and other thoughts. In this body of work, we intend to design novel query answering models considering the paradigm of recommendation. Our previous and ongoing works in that space consider novel recommendation problems, such as recommending items to a group of users, recommending composite items to a user, and so on. In the modeling of these problems, we tap into the latent social information sources and leverage that in a principled way to enhance query-answering tasks, and analyze that information for future learning and opportunities.
Platforms such as AirBnB, Zillow, Yelp, and related sites have transformed the way we search for accommodation, restaurants, etc. The underlying datasets in such applications have numerous attributes that are mostly Boolean or Categorical. Discovering the skyline of such datasets over a subset of attributes would identify entries that stand out while enabling numerous applications.
Peer to peer marketplaces enable transactional exchange of services directly between people. In such platforms, those providing a service are faced with various choices. For example in travel peer to peer marketplaces, although some amenities (attributes) in a property are fixed, others are relatively flexible and can be provided without significant effort. Providing an attribute is usually associated with a cost. Naturally, different sets of attributes may have a different “gains” (monetary or otherwise) for a service provider. Consequently, given a limited budget, deciding which attributes to offer is challenging. In this project, we propose techniques that help service providers in decision making.
Faceted Search is an exploratory search interface that helps user to navigate and browse through a large information space through facets. A large number of information seeking and e-commerce applications benefit from such interface. We analyze the opportunities of adopting principles of the faceted search paradigm for tuple search. However, unlike past works on images and unstructured data such as text, here, the challenge is to dynamically determine the facets that are best suited for enabling a faceted search interface. Our proposed faceted search framework is dynamic and completely user interaction dependent as compared to existing faceted search systems where the facets and hierarchies are predefined and static. Our overall goal is to judiciously select the facet(s) dynamically based on user query that the user employs further to drill down in the results, such that user’s navigational cost during this exploration process is minimized. We investigate dynamic minimum effort driven faceted search in conjunction with structured and unstructured data (primarily Wikipedia).
In many OLAP and decision support environments, it is often desirable to answer complex long-running aggregate database queries approximately, provided some estimate of the error is also given. For example, when a sales manager asks give me the aggregate sales of Product X, grouped by the US states, she/he is probably not interested in getting answers to the nearest cent. We approach this difficult problem using statistical sampling-based techniques. Our objective is to propose practical solutions that require minimal changes to the underlying DBMS systems.
Ranking: Repositories such as relational databases - e.g., searching online databases of homes, used cars, and electronic goods. In many of these applications, the user often experiences "information overload", which occurs when the system responds to an under-specified user query by returning an overwhelming number of tuples, each displayed with a huge number of features (or attributes). We have developed a search and retrieval system that tackles this information overload problem from two angles. First, we show how to automatically rank and display the top-n most relevant tuples. Our ranking functions are either based on traditional distance-based metrics, or use probabilistic information retrieval principles that learn user preferences by exploiting past query workloads. Second, our system offers techniques for ordering the attributes of the returned tuples in decreasing order of "usefulness" and selects only a few of the most useful attributes to display. We have built demos of the system on used cars and a homes for sale dataset. User surveys have shown that our system improves the user's query experience.
Top-k: Top-k queries on large multi-attribute data sets are fundamental operations in information retrieval and ranking applications. In particular given specific top-k algorithms we were interested in studying their progress towards identification of the correct result at any point of the algorithms' execution. We adopted a probabilistic approach where we seek to report at any point the scores of the top-k results the algorithm has identified, as well as associate a confidence with this prediction. Such functionality can be a valuable asset when one is interested to reduce the runtime cost of top-k computations. We showed analytically that such probability and confidence are monotone in expectation. We presented a thorough experimental evaluation to validate our techniques using both synthetic and real data sets.
Attribute Recommendation: When advertising a product in the e-marketplace, it is very important to make sure that its content should be attractive to potential buyers, and that it beats the competitive products in the market. We wish to design methods which can assist the seller in deciding which attributes of the product (e.g., product features, keywords, etc) should be emphasized or recommended when preparing the advertisement. We have developed algorithms for several variants of the problem across different application domains, e.g., car/home sales, products advertising in newspapers, creating catchy titles for an article, discovering useless attributes of an object (e.g., a homebuilder can find out that 'adding a fireplace does not make the home more desirable in this market') and so on.
© 2019 dbxlab.uta.edu