Projects In Hidden Web Databases


Rank Analytics over Hidden Databases

Structured hidden databases are widely prevalent on the Web. They provide restricted form-like search interfaces that allow users to execute search queries by specifying desired attribute values of the sought-after tuples, and the system responds by returning a few (e.g., top-k) tuples that satisfy the selection conditions, sorted by a suitable ranking function. The top-k output constraint prevents many interesting third-party (e.g., mashup) services from being developed over real-world web databases. This research involves developing effective techniques for retrieving more than top-k tuples for any query and support additional rank based analytics such as estimating the rank of a tuple or compare the rank of two arbitrary tuples to determine which of them is highly ranked. Our techniques access the hidden structured databases via their public interfaces and operate without any knowledge of the underlying static ranking function.

  • [TZD13b] Saravanan Thirumuruganathan, Nan Zhang, and Gautam Das. Rank Discovery From Web Databases. In PVLDB 2013.
  • [TZD13a] Saravanan Thirumuruganathan, Nan Zhang, Gautam Das. Breaking the Top-k Barrier of Hidden Web Databases. In ICDE 2013.
Participating Students

Suppressing Sensitive Aggregates over Hidden Web Databases

The objective of this project is to understand, evaluate, and contribute towards the suppression of sensitive aggregates over hidden databases. While owners of hidden databases would like to allow individual search queries, many also want to maintain a certain level of privacy for aggregates over their hidden databases. This has implications in the commercial domain (e.g., to prevent competitors from gaining strategic advantages) as well as in homeland-security related applications (e.g., to prevent potential terrorists from learning flight occupancy distributions). This project investigates techniques to suppress the sensitive aggregates while maintaining the usability of hidden databases for bona fide search users.

  • [ZZD12] M. Zhang, N. Zhang, and G. Das, Aggregate Suppression for Enterprise Search Engines, SIGMOD 2012.
  • [JMZ+11] X. Jin, A. Mone, N. Zhang, G. Das, Randomized Generalization for Aggregate Suppression Over Hidden Web Databases, VLDB 2011.
  • [DZDC09] A. Dasgupta, N. Zhang, G. Das, and S. Chaudhuri, Privacy Preservation of Aggregates in Hidden Databases: Why and How?, SIGMOD 2009.
  • [MDZD09] Anirban Maiti, Arjun Dasgupta, Nan Zhang, and Gautam Das, Revealing data behind web form interfaces, Demo Paper, in ACM SIGMOD International Conference on Management of Data (SIGMOD), 2009.
  • [DZ09b] Gautam Das and Nan Zhang, "Privacy Risks In Health Databases from Aggregate Disclosure", in Workshop on Privacy and Security in Pervasive e-Health and Assistive Environments (PSPAE), 2009.
  • [DZ09a] Gautam Das and Nan Zhang, "Aggregates disclosure in hidden web databases: an urgent challenge", in NSF Workshop on Data and Applications Security, 2009.
Participating Students
  • Arjun Dasgupta, PhD student, University of Texas at Arlington
  • Xin Jin, PhD student, George Washington University
  • Aditya Mone, PhD student, University of Texas at Arlington
  • Mingyang Zhang, PhD student, George Washington University
  • Anirban Maiti, Masters student, University of Texas at Arlington

Data Analytics over Hidden Web Databases

Structured hidden databases are widely prevalent on the Web. They provide restricted form-like search interfaces that allow users to execute search queries by specifying desired attribute values of the sought-after tuples, and the system responds by returning a few (e.g., top-k) tuples that satisfy the selection conditions, sorted by a suitable ranking function. Although search interfaces for hidden databases are designed with focused search queries in mind, for certain applications it may be advantageous to infer more aggregated views of the data from the returned results of search queries. This research involves developing effective techniques for performing data analytics, especially sampling, over hidden structured databases via their public interfaces. The outcomes include efficient algorithms for sampling hidden databases with a heterogeneous mix of data types, achievability results for sampling different types of search interfaces, and a prototypical toolset which demonstrates the sampling of real-world hidden databases.

  • [DJJZD10] Arjun Dasgupta, Xin Jin, Bradley Jewell, Nan Zhang, and Gautam Das. Unbiased estimation of size and other aggregates over hidden web databases, in ACM SIGMOD International Conference on Management of Data (SIGMOD), 2010.
  • [DZD10]Arjun Dasgupta, Nan Zhang and Gautam Das. Turbo-Charging Hidden Database Samplers with Overfowing Queries and Skew Reduction., in EDBT International Conference on Extending Database Technology, 2010.
  • [DZD09] Arjun Dasgupta, Nan Zhang, and Gautam Das, Leveraging COUNT Information in Sampling Hidden Databases, in IEEE International Conference on Data Engineering (ICDE), 2009.
  • [DDM07] Arjun Dasgupta, Gautam Das, and Heikki Mannila, A random walk approach to sampling hidden databases, in ACM SIGMOD International Conference on Management of Data (SIGMOD), 2007.
Participating Students
  • Arjun Dasgupta, PhD student, University of Texas at Arlington
  • Xin Jin, PhD student, George Washington University
  • Bradley Jewell, Undergraduate student, University of Texas at Arlington

Projects In Social Computing


Collaborative Social Content Mining

The widespread use and growing popularity of online collaborative content sites today has created rich resources for users to consult in order to make purchasing decisions on various items such as movies, restaurants, e-commerce products, etc. It has also created new opportunities for content producers of such web items to design new improved items, compose eye-catching advertisement snippets, etc. in order to improve business. This project concerns developing data mining and exploration algorithms for performing aggregate analytics over user feedback (ratings, tags, likes, visits, etc.) available from collaborative content sites in order to benefit experience and decision making of both content producers and consumers. The key challenges exist in the form of information explosion and overload, besides user-item interaction intractability.

  • [DTADY13] Mahashweta Das, Saravanan Thirumuruganathan, Sihem Amer-Yahia, Gautam Das, and Cong Yu: An Expressive Framework and Efficient Algorithms for the Analysis of Collaborative Tagging. In VLDBJ 2013.
  • [ARTCD14]Sofiane Abbar, Habibur Rahman, Saravanan T., Carlos Castillo, and Gautam Das. Ranking Item Features by Mining Online User-Item Interactions. In Proceedings of ICDE 2014.
  • [DRDH13] Mahashweta Das, Habibur Rahman, Gautam Das, and Vagelis Hristidis: Generating Informative Snippet to Maximize Item Visibility. In Proceedings of ACM CIKM 2013.
  • [DTADY12] Mahashweta Das, Saravanan Thirumuruganathan, Sihem Amer-Yahia, Gautam Das, and Cong Yu: Who Tags What? An Analysis Framework. In PVLDB 2012.
  • [TDDADY12] Mahashweta Das, Saravanan Thirumuruganathan, Shrikant Desai, Sihem Amer-Yahia, Gautam Das, and Cong Yu: MapRat: Meaningful Explanation, Interpretation and Geo-Visualization of Collaborative Rating. In PVLDB 2012.
  • [DADY11] Mahashweta Das, Sihem Amer-Yahia, Gautam Das, and Cong Yu: MRI: Meaningful Interpretations of Collaborative Ratings. In PVLDB 2011.
  • [DDH11] Mahashweta Das, Gautam Das, and Vagelis Hristidis: Leveraging Collaborative Tagging for Web Item Design. In Proceedings of ACM SIGKDD 2011.
Participating Students
  • Mahashweta Das, PhD student, University of Texas at Arlington
  • Saravanan Thirumuruganathan, PhD student, University of Texas at Arlington.
  • Azade Nazi, PhD student, University of Texas at Arlington
  • Habibur Rahman, PhD student, University of Texas at Arlington

Group Recommendation

The ever-expanding volume and increasing complexity of information on the web has made recommendation systems essential tools for users in a variety of information seeking or e-commerce activities. Moreover, new research suggests that every digital comment made by users anywhere - a product review, social book-marking, tweets, blogs, activities on a social network site, e-mails can be mined for hints as to emotions and other thoughts. In this body of work, we intend to design novel query answering models considering the paradigm of recommendation. Our previous and ongoing works in that space consider novel recommendation problems, such as recommending items to a group of users, recommending composite items to a user, and so on. In the modeling of these problems, we tap into the latent social information sources and leverage that in a principled way to enhance query-answering tasks, and analyze that information for future learning and opportunities.

  • [BTADY14] Senjuti Basu Roy, Saravanan Thirumuruganathan, Sihem Amer-Yahia, Gautam Das and Cong Yu: Exploiting Group Recommendation Functions for Flexible Preferences, In Proceedings of ICDE 2014.
  • [BADY11] Senjuti Basu Roy, Sihem Amer-Yahia, Gautam Das and Cong Yu: Interactive Itinerary Planning, In Proceedings of ICDE 2011.
  • [BACDY10] Senjuti Basu Roy, Sihem Amer-Yahia, Ashish Chawla, Gautam Das and Cong Yu. Space Efficiency in Group Recommendations, In VLDB Journal Special Issue on Data Management and Mining for Social Networks and Social Media, 2010.
  • [ABCDY10] Sihem Amer-Yahia, Senjuti Basu Roy, Ashish Chawla, Gautam Das and Cong Yu: Constructing and Exploring Composite Items. In Proceedings of SIGMOD 2010.
  • [ABCDY10] Sihem Amer-Yahia, Senjuti Basu Roy, Ashish Chawla, Gautam Das, Cong Yu: Group Recommendation: Semantics and Efficiency. Full paper, in VLDB 2009.
Participating Students

Projects In Crowdsourcing


Knowledge Intensive Crowdsourcing

Crowdsourcing systems have gained popularity in a variety of domains. The next generation crowdsourcing systems will be collaborative and knowledge-intensive in nature. They need to treat the crowdsourcing problem not in optimization silos, but as an adaptive optimization problem by seamlessly handling the three main crowdsourcing processes (worker skill estimation, task assignment, task evaluation) and incorporating the uncertainty stemming from human factors. The main thrust behind this project is to develop algorithms for such an adaptive, knowledge-intensive crowdsourcing scenario by quantifying and incorporating the human factors into the three major crowdsourcing processes.

  • Habibur Rahman, Senjuti Basu Roy, Saravanan Thirumuruganathan, Sihem Amer-Yahia and Gautam Das: Task Assignment Optimization in Collaborative Crowdsourcing. To Appear in IEEE Conference on Data Mining (ICDM 2015).
  • Habibur Rahman, Saravanan Thirumuruganathan, Senjuti Basu Roy, Sihem Amer-Yahia, and Gautam Das: Worker Skill Estimation in Team-Based Tasks. To appear in PVLDB 2015.
  • Senjuti Basu Roy, Saravanan Thirumuruganathan, Sihem Amer-Yahia, Gautam Das: Task-Assignment Optimization in Knowledge Intensive Crowdsourcing. To appear in VLDB Journal, 2015.​
  • [BLTADY13] Senjuti Basu Roy, Ioanna Lykourentzou, Saravanan Thirumuruganathan, Sihem Amer-Yahia, Gautam Das. Crowds, not Drones: Modeling Human Factors in Interactive Crowdsourcing. In DBCrowd 2013 held in conjunction with VLDB 2013.
Participating Students
  • Saravanan Thirumuruganathan, PhD student, University of Texas at Arlington.
  • Habibur Rahman, PhD student, University of Texas at Arlington

Past Projects


Faceted Search

Faceted Search is an exploratory search interface that helps user to navigate and browse through a large information space through facets. A large number of information seeking and e-commerce applications benefit from such interface. We analyze the opportunities of adopting principles of the faceted search paradigm for tuple search. However, unlike past works on images and unstructured data such as text, here, the challenge is to dynamically determine the facets that are best suited for enabling a faceted search interface. Our proposed faceted search framework is dynamic and completely user interaction dependent as compared to existing faceted search systems where the facets and hierarchies are predefined and static. Our overall goal is to judiciously select the facet(s) dynamically based on user query that the user employs further to drill down in the results, such that user’s navigational cost during this exploration process is minimized. We investigate dynamic minimum effort driven faceted search in conjunction with structured and unstructured data (primarily Wikipedia).

  • [YLBRD10] Ning Yan, Chengkai Li, Senjuti Basu Roy, Rakesh Ramegowda, Gautam Das, Facetedpedia: Enabling Query-Dependent Faceted Search for Wikipedia, Demo paper, in Proceedings of CIKM 2010.
  • [LYBLD10] Chengkai Li, Ning Yan, Senjuti Basu Roy, Lekhendro Lisham, and Gautam Das: Facetedpedia: Dynamic Generation of Query-Dependent Faceted Interfaces for Wikipedia. In Proceedings of WWW 2010
  • [BD09] Senjuti Basu Roy, Gautam Das: Top-k Implementation Techniques of Minimum Effort Driven Faceted Search for Databases. In Proceedings of COMAD 2009
  • [BWNDM09] Senjuti Basu Roy, Haidong Wang, Ullas Nambiar, Gautam Das, Mukesh Mohania: DynaCet: Building Dynamic Faceted Search Systems over Databases. Demo paper, in Proc. ICDE 2009.
  • [BWNDM08] Senjuti Basu Roy, Haidong Wang, Ullas Nambiar, Gautam Das and Mukesh Mohania: Minimum-Effort Driven Dynamic Faceted Search in Structured Databases. In Proc. CIKM 2008.
Participating Students
  • Senjuti Basu Roy, PhD student, University of Texas at Arlington
  • Ning Yan, PhD student, University of Texas at Arlington
  • Rakesh Ramegowda, MS student, University of Texas at Arlington
  • Lekhendro Lisham, MS student, University of Texas at Arlington
  • Haidong Wang, MS student, University of Texas at Arlington

Approximate Query Processing

In many OLAP and decision support environments, it is often desirable to answer complex long-running aggregate database queries approximately, provided some estimate of the error is also given. For example, when a sales manager asks give me the aggregate sales of Product X, grouped by the US states, she/he is probably not interested in getting answers to the nearest cent. We approach this difficult problem using statistical sampling-based techniques. Our objective is to propose practical solutions that require minimal changes to the underlying DBMS systems.

  • [CDN07] Surajit Chaudhuri, Gautam Das, Vivek Narasayya. Optimized Stratified Sampling for Approximate Query Processing. ACM Transactions on Database Systems 2007.
  • [DDM07] Arjun Dasgupta, Gautam Das, Heikki Mannila. A random walk approach to sampling Hidden Databases. SIGMOD 2007.
  • [ADGK06] Benjamin Arai, Gautam Das, Dimitrios Gunopulos and Vana Kalogeraki. Approximating Aggregations in Peer-to-Peer Databases. HDMS 2006.
  • [D05a] Gautam Das: Approximate Query Processing. Tutorial, SBBD 2005.
  • [D05b] Gautam Das: Sampling Methods in Approximate Query Answering Systems. Invited Book Chapter, Encyclopedia of Data Warehousing and Mining. Editor John Wang, Information Science Publishing, 2005.
  • [D05c] Gautam Das: Approximate Query Processing Techniques. Invited Tutorial at the 11th International Conference on Management of Data (COMAD) 2005.
Participating Students
  • Arjun Dasgupta, PhD student, University of Texas at Arlington
  • Benjamin Arai, PhD student, University of California, Riverside

Ranking and Top-k

Ranking: Repositories such as relational databases - e.g., searching online databases of homes, used cars, and electronic goods. In many of these applications, the user often experiences "information overload", which occurs when the system responds to an under-specified user query by returning an overwhelming number of tuples, each displayed with a huge number of features (or attributes). We have developed a search and retrieval system that tackles this information overload problem from two angles. First, we show how to automatically rank and display the top-n most relevant tuples. Our ranking functions are either based on traditional distance-based metrics, or use probabilistic information retrieval principles that learn user preferences by exploiting past query workloads. Second, our system offers techniques for ordering the attributes of the returned tuples in decreasing order of "usefulness" and selects only a few of the most useful attributes to display. We have built demos of the system on used cars and a homes for sale dataset. User surveys have shown that our system improves the user's query experience.

Top-k: Top-k queries on large multi-attribute data sets are fundamental operations in information retrieval and ranking applications. In particular given specific top-k algorithms we were interested in studying their progress towards identification of the correct result at any point of the algorithms' execution. We adopted a probabilistic approach where we seek to report at any point the scores of the top-k results the algorithm has identified, as well as associate a confidence with this prediction. Such functionality can be a valuable asset when one is interested to reduce the runtime cost of top-k computations. We showed analytically that such probability and confidence are monotone in expectation. We presented a thorough experimental evaluation to validate our techniques using both synthetic and real data sets.

Attribute Recommendation: When advertising a product in the e-marketplace, it is very important to make sure that its content should be attractive to potential buyers, and that it beats the competitive products in the market. We wish to design methods which can assist the seller in deciding which attributes of the product (e.g., product features, keywords, etc) should be emphasized or recommended when preparing the advertisement. We have developed algorithms for several variants of the problem across different application domains, e.g., car/home sales, products advertising in newspapers, creating catchy titles for an article, discovering useless attributes of an object (e.g., a homebuilder can find out that 'adding a fireplace does not make the home more desirable in this market') and so on.

  • [MDHM09] Muhammed Miah, Gautam Das, Vagelis Hristidis, Heikki Mannila: Determining Attributes to Maximize Visibility of Objects. In IEEE Transactions on Data Engineering (TKDE), 2009.
  • [MDHM08] Muhammed Miah, Gautam Das, Vagelis Hristidis, Heikki Mannila: Standing Out in a Crowd: Selecting Attributes for Maximum Visibility. In Proceedings of ICDE 2008.
  • [DKPP08] Gautam Das, Nick Koudas, Manos Papagelis, Sushruth Puttaswamy: Efficient Sampling of Information in Social Networks. In Proceedings of CIKM/SSM 2008.
  • [KDHSW07] The STAR system, by Nishant Kapoor, Gautam Das, Vagelis Hristidis, S. Sudarshan, and Gerhard Weikum was demo-ed in ICDE 2007 conference.
  • [CDHW06] Surajit Chaudhuri, Gautam Das, Vagelis Hristidis, Gerhard Weikum. Probabilistic Information Retrieval Approach for Ranking of Database Query Results. To appear in ACM Transactions on Database Systems, 2006.
  • [P06] Sushruth Puttaswamy. Personalizing (Re-Ranking) web search results using information present on a social network. Masters Thesis, University of Texas at Arlington 2006.
  • [DGKT06] Gautam Das, Dimitrios Gunopulos, Nick Koudas, Dimitris Tsirogiannis. Answering Top-k Queries Using Views. VLDB 2006.
  • [DHKS06] Gautam Das, Vagelis Hristidis, Nishant Kapoor and S. Sudarshan. Ordering the Attributes of Query Results. SIGMOD 2006.
Participating Students
  • Sushruth Puttaswamy, PhD student, University of Texas at Arlington
  • Nishant Kapoor, PhD student, University of Texas at Arlington
  • Manos Papagelis, PhD student, University of Toronto
  • Dimitris Tsirogiannis, PhD student, University of Toronto