Advisor: Dr. Da Yan (UAB CS)
Dr. Da Yan (Chair, UAB CS)
Dr. Purushotham Bangalore (UAB CS)
Dr. Chengcui Zhang (UAB CS)
Dr. Sidharth Kumar (UAB CS)
Dr. Carmeliza Navasca (UAB Mathematics)
Dr. Yang Zhou (Auburn University CS)
Scalable Subgraph Mining in a Big Graph
Finding from a big graph those subgraphs that satisfy certain conditions is useful in many applications such as community detection and subgraph matching. These problems have a high time complexity, but existing systems to scale them are all IO-bound in execution. As a result, existing applications have to mine other less informative dense structures such as k-core and k-truss over large graphs thanks to their lower time complexity. One aim of this work is to fill the void of efficient pseudo-clique mining in large graphs.
In ICDE 2020, we proposed the G-thinker framework for subgraph finding that avoids the IO bottleneck, so the performance scales well with the number of CPU cores used in a cluster. However, I identified that as a task-based system where each task is spawned from individual vertices, the old G-thinker design does not balance the workloads of different subgraph-mining tasks sufficiently, leading to the straggler problem when mining expensive pseudo-clique structures such as quasi-cliques and k-plexes. In this proposal, we propose a system-algorithm codesign solution which will address this challenge by redesigning G-thinker’s execution engine to prioritize long-running tasks for mining, and by utilizing a novel time-delayed divide-and-conquer strategy to effectively decompose the workloads of long-running tasks to improve load balancing.
Moreover, since cliques are defined over undirected graphs, existing pseudo-clique definitions also only work on undirected graphs, limiting their application in many real networks that are directed. We have explored and confirmed the viability of generalize pseudo-clique definitions to the directed scenarios, and we will design their efficient algorithms including the parallel counterparts in G-thinker.
Another problem that this work will tackle is an online subgraph matching query engine atop G-thinker, for which a proof of concept has been successfully built. We will revise the offline G-thinker engine to answer online subgraph matching queries, with initial graph loading and statistics collection, followed by a task-based decomposition of the search space tree specified by the well-known Ullman’s algorithm strengthened with the latest techniques for matching index construction and matching order selection techniques.
Monday, April 5 at 11:15am to 12:05pmVirtual Event