Tools for analyzing big networks have emerged with the development of graph databases like Neo4j and Amazon Neptune, or packages like GraphX and GraphFrame for Apache Spark. At the same time, the volume of data to be analyzed from social networks and the Internet of Things (IoT), for instance, have also continued to mushroom.
In this blog, I present some applications of graph analytics to glean valuable information from such data.
What is a (mathematical) graph?
In mathematics, a graph comprises data on discrete points (known as nodes) and a host of direct connections (called edges) between the points.
Figure 1: Directional graph
- Facebook can be interpreted as a graph: the users are the nodes, and if two users are friends, the connection between them represents the edge of the graph.
- For a publisher, subscribers are equated to the quantity of nodes in the graph. Two subscribers are connected by an edge, if one of them helped canvass the other. This is an example of a directional graph in which the edges also have a direction: the edge flows from the canvasser to the canvassed.
- Each (sub)page of a website represents a node. A link that directs you from one page to another is an edge. The same function clearly works for the whole site too.
- Each bus or tram stop in Zurich is a node. Each direct connection between two stops (nodes) forms an edge.
Figure 2: Zurich's Public Transportation Network as a Graph
Use case 1: Generate features for machine learning
The output of a machine learning model depends primarily on the features of the input data, for which we typically use aspects like buying behavior, demographics, or externally acquired parameters.
However, large enterprises also have networks within their base customers. Some customers know each other, are related, or are friends, and they communicate among themselves. As customers of the particular enterprise, they also talk about their experiences with ads they receive, new products, and offers. Commercial customers may be connected through corporate affiliations, shareholdings, and/or via business relationships.
Although much of this information remains obscured from the enterprise, there are ways to set up connections between customers. From the base of individuals as customers at a bank, you can identify some networks by looking at joint accounts and authorizations. For telecoms, one can use phone and text messaging data. You can also analyze names and addresses to identify family relationships and households. Customer recommendation programs are good indicators of friendships.
To use such information for machine learning, you need to extract and prepare the information embedded in networks – an ideal playground for graph analysis tools. The following parameters, for instance, could be useful:
- The role of a customer in the network: Does he/she influence other customers? Is this customer basically isolated? Does he/she have contacts outside the customer network?
- The number of direct and indirect contacts (2nd degree contacts)
- Betweenness measures the significance of a customer for communication within the network
There are a host of other parameters that you can compute for graph nodes. Just like when preparing data for machine learning processes, it could be helpful to try out and/or combine some of the above parameters.
Use case 2: Recognizing bots in networks
A key measure of Twitter’s success is the number of retweets. This number can be artificially boosted by setting up bot accounts that automatically retweet the target account repeatedly.
Therefore, in analyzing Twitter data enterprises face the problem that retweets by bots seriously skew the data and thus negate much of the value of the analyses.
One can prepare a directional graph to recognize bot accounts, where the accounts are the nodes and retweets define the edges. The directional nodes are weighted: the higher the retweets, the higher the weight of the node.
By computing appropriate parameters for the nodes, you can identify bot accounts from such graphs, and exclude them from your analysis.
Use Case 3: Customers as multipliers
An enterprise can use the networks in its base of individual customers for primarily marketing purposes. Some customers are very important for enterprises, not just because they are big buyers, but also because of their social contacts with other customers.
Customers in networks are divided into three groups:
- Connectors: ones with many connections to other individuals;
- Experts: those who feed networks with new information;
- Sellers: ones who convince others about something.
Connectors and sellers, in particular, serve as multipliers who can pass on the desired message to the customer base. If you can identify such key individuals, you can reach most of your customers with minimal effort.
Other use cases: Shortest route, optimized matching, etc.
Businesses use many traditional graph algorithms like finding the shortest route between two nodes, the shortest roundtrip through all nodes, or the maximum flow from one node to another. Many such processes require massive computation power and, thus, benefit greatly from new technologies like distributed computing with Spark.
Problem: find the shortest route between two defined nodes of a graph (start to finish). Short here refers to edge weights.
We have very efficient algorithms to solve such a problem, the best known being the one from Dijkstra. Almost everyone has benefited from these algorithms: GPS devices and apps for maps use them a lot.
Figure 3: Directions
Shortest roundtrip, the travelling salesman problem (TSP)
Problem: find the shortest possible roundtrip that visits all nodes of a graph only once and returns to the origin.
The solution for this problem requires a lot more computing power than the shortest route problem. We’ll look at only two of many possible applications:
Directions for field service staff: an enterprise with many staff members in the field can really boost its efficiency by optimizing directions.
Optimized production: the goal is to drill many holes at different locations in a workpiece. The drill travels in a defined sequence over the piece, returning to its origin when finished. TSP algorithms can optimize the hole boring sequence.
Problem: for a given number of points and their connections that have been assigned a value or weight, find a setup such that each point connects to no more than one other point and the sum of the values is maximized.
This approach could, for example, be used to match pending tasks with employees in an enterprise. The graph consists of employees and the tasks to be performed, whereby the weights define the suitability of employees for respective tasks.
The figure below illustrates a typical example: four pending tasks are to be performed by three employees. Each employee is ranked from 1 to 9 for the various tasks, where 1 represents very low expertise and 9 the highest expertise.
Optimized assignment/matching of tasks has a total value of 20. Solution: task 4 is done by employee 1, although employee 3 would be somewhat better. Employee 3 is matched with task 3, because he/she is much better suited than employee 2. For now, task 1 remains undone.
Figure 4: Optimized Employee Assignment
There are numerous applications for graph analytics in the business world. Regardless of whether we address marketing, identify fraud, analyze web data, manage risks, manufacture, or deal with other fields – the volumes of data keep growing, while many datasets represent connections between points. Luckily, we have technologies today that can efficiently store and analyze graphs with massive amounts of data.
In the next blog, we’ll look at using Apache Spark to prepare and analyze graphs with the GraphFrame API.