Illustration of a pangolin with line and bar graphs

The datasets used by many software applications can be represented as graphs, defined by sets of vertices and edges. These graphs are rich with useful information, and can be used to determine patterns and relationships among the stored data. This process of discovering relevant patterns from graphs is called Graph Pattern Mining (GPM). A team of Texas Computer Science (TXCS) researchers advised by Dr. Keshav Pingali has done groundbreaking work to make GPM programs more efficient and accessible. Their work was recently accepted to Very Large Databases (VLDB) 2020, one of the premier conferences in computer science.

GPM has applications in many areas, including chemical engineering, bioinformatics, and social sciences. For example, protein behavior can be mapped to a graph by representing the proteins as the vertices and the interactions between them as edges. Using a  GPM program, researchers would be able to make useful discoveries about the data, such as finding frequent patterns that occur within the graph and what features they relate to. This information would then shine light on the etiology of a disease and enhance drug discovery.

The applications of these programs are seemingly endless. However, GPM problems are computationally complicated, memory-intensive and time-consuming to solve. Although a lot of work has been done to attempt to make it easy for programmers to develop GPM applications, these solutions are generally not very efficient or flexible, using distributed systems or disks to manage the intermediate data structures.

 Julie Hieland, Yi-Shan Lu, Loc Hoang, Keshav Pingali, Sepideh Maleki, Vishwesh Jatala, Bozhi YouThe TXCS research team decided to find a streamlined approach to solve GPM problems, through using a single multi-core CPU or a single GPU and utilizing the memory and computation resources available on the hardware to their maximum potential. “Our system handles all the complexities regarding parallelization, synchronization, memory management and so on,” says Dr. Roshan Dathathri, a collaborator on the project. “So the programmer only has to specify the application logic—what are they trying to find, what are they trying to mine from the graphs.” The system they developed, called Pangolin, provides a simple programming interface to do this.

At the moment, Pangolin runs efficiently on a single multi core CPU or a single GPU, but its functionality still has room to grow. “The goal is to enable larger graphs and also larger patterns that could be mined within a reasonable amount of time,” says Dr. Xuhao Chen, the project’s lead researcher. Their paper is set to be published in VLDB at the end of the month.

News categories: