Probabilistic Entity Resolution with Imperfect Crowd: PERC
A SUMMARY OF THE CIKM 2017 RESEARCH PAPER and VLDB 2018 DEMOnstration BY VIJAYA KRISHNA YALAVARTHI, XIANGYU KE, AND ARIJIT KHAN
Introduction:
In real world scenarios, we encounter many situations where two objects may not look exactly identical, but they represent the same entity.
Example 1:
1. Vijaya Krishna Yalavarthi
|
2. Yalavarthi Vijaya Krishna
|
3. Xiangyu Ke
|
4. Arijit Khan
|
In the above table, among the four names given, first two names represent the same person - Vijaya Krishna Yalavarthi, though they are not written in an identical way.
Example 2:
Among the three images given, A and B represent Taj Mahal (India), whereas C represents Merlion (Singapore). Though A and B do not look identical, they represent the same entity - Taj Mahal.
Our problem:
|
|
|
Among the three images given, A and B represent Taj Mahal (India), whereas C represents Merlion (Singapore). Though A and B do not look identical, they represent the same entity - Taj Mahal.
What is entity resolution?
Entity resolution (ER) [2] is the task of linking and clustering of all records that represent the same real-world entity. ER, also known as record linkage or deduplication, is a crucial step in data cleaning and data integration, and has wide variety of applications like data analytics, knowledge representation, comparison shopping, and law enforcement, to name a few.
How to do ER?
Entity resolution can be done in many ways. One way is by using machine-based methods including text similarity measure, or classification by machine learning techniques. Another way of performing ER is by crowdsourcing [4, 5].
Crowdsourcing for ER:
Here, a record pair is selected and is given to people, referred to as crowd workers, for their evaluation of the record similarity. If the answer is positive (i.e, YES) from those crowd workers, then it can be concluded that the record pair represents the same entity.
For ER, recent studies [8] show that crowdsourcing can provide better results compared to machine based techniques for complex tasks, such as classification and clustering of images, video tagging, optical character recognition, and natural language processing. There are various crowdsourcing platforms, e.g., Amazon Mechanical Turk and CrowdFlower where humans perform tasks for certain amount of monetary reward.
Challenges in crowdsourcing-based ER:
Though humans provide more insightful information, crowdsourcing suffers from crowd errors due to lack of domain expertise or seriousness, ambiguity, or even due to malicious intents. Therefore, crowdsouricing all record pairs may provide very good accuracy, but it would be expensive. Hence, our goal is:
Minimize the crowdsourcing cost, while maximizing the quality of the results.
Our problem:
We consider the ER problem as an uncertain graph problem, where every node in the graph represents a record, and an uncertain, undirected edge between two nodes represent the probability with which the two records belong to the same entity. We find edge probabilities by crowdsourcing. For example, in the figure given below, let us assume that we crowdsourced the record pair <A, C>. Out of 10 people, 3 accepted that they represent the same entity, whereas 7 did not. Hence, the weight of the edge AC is 0.3.
For ER, we begin with crowd sourcing a few initial record pairs chosen uniformly at random (alternatively, one can chose another way such as forming a spanning tree connecting the records), and update the graph based on crowd responses. Then, we cluster the records into initial entities. Now, the next big question is:
Which record pair to crowdsource next such that the quality of clustering increases maximally?
Our experimental results with real world datasets attest that PERC reduces the crowd sourcing cost by 50 % compared to state-of-the-art techniques, e.g., DENSE [6], PC-Pivot [3], and MinMax [7]. These existing methods consider ad-hoc, local features to select next questions, such as individual paths (e.g., MinMax), nodes (e.g., PC-Pivot), or the set of either positive or negative edges (e.g., DENSE). Hence, they generally fail to capture the strength of the entire clustering, resulting in higher crowdsourcing cost to achieve a reasonable ER accuracy.
A snippet of our experimental results is given below.
For more information, click here (CIKM 2017 full research paper), here (VLDB 2018 demonstration paper), and here (VLDB 2018 demonstration video).
Blog post contributed by: Vijaya Krishna Yalavarthi
Blog post contributed by: Vijaya Krishna Yalavarthi
References:
[1] VK. Yalavarthi, X. Ke, and A. Khan. 2017. Select Your Questions Wisely: For Entity Resolution
With Crowd Errors. In CIKM.
[2] L. Getoor and A. Machanavajjhala. 2013. Entity Resolution for Big Data. In KDD.
[3] S. Wang, X. Xiao, and C.-H. Lee. 2015. Crowd-Based Deduplication: An Adaptive Approach. In SIGMOD.
[4] J. Wang, G. Li, T. Kraska, M. J. Franklin, and J. Feng. 2013. Leveraging Transitive Relations for Crowdsourced Joins. In SIGMOD.
[5] J.Wang, T. Kraska,M. J. Franklin, and J. Feng. 2012. CrowdER: Crowdsourcing Entity Resolution. In VLDB.
[6] V. Verroios and H. G.-Molina. 2015. Entity Resolution with Crowd Errors. In ICDE.
[7] A. Gruenheid, D. Kossmann, S. Ramesh, and F. Widmer. 2012. Crowdsourcing Entity Resolution. Technical Report. ETH Zurich.
[8] R. Gomes, P. Welinder, A. Krause, and P. Perona. 2011. Crowdclustering. In NIPS.