Home   >  Competitions   > 

 

Background

 

In many applications, such as scientific literature management, researcher search, social network analysis and etc, Name Disambiguation (aiming at disambiguating WhoIsWho) has been a challenging problem. In addition, the growth of scientific literature makes the problem more difficult and urgent. Although name disambiguation has been extensively studied in academia and industry, the problem has not been solved well due to the clutter of data and the complexity of the same name scenario.

 

Online academic search systems (such as Google Scholar, Dblp, and AMiner) have large amount of research papers, and have become important and popular academic communication and paper search platforms. However, due to the limitations of the paper assignment algorithm, there are many papers assigned to error authors. In addition, these academic platforms are collecting a large number of new papers every day. Therefore, how to accurately and quickly assign papers to existing author profiles, and maintain the consistency of author profiles is an urgent problem to be solved for current online academic systems and platforms.

 

Because the amount of data inside the academic systems is very large (AMiner has about 130,000,000 author profiles and more than 200,000,000 papers), the same name situation is very complicated. [1]

 

The competition hopes participants to propose models which can distinguish the papers of the same name but belong to different authors according to the detailed information of the paper and the link between the author and the paper, and obtain good disambiguation results.

 

Tasks

 

OAG-WhoIsWho has two tracks.

 

Track 1: Name Disambiguation from Scratch

 

Task Description: Given a bunch of papers with authors of one same name, participants will be asked to return different clusters of papers. Each cluster has one author, and different clusters have different authors, although they have the same name. 

 

Suggested Methods: The common ideas to solve this problem include using clustering algorithms to extract paper features and define similarity measures. After that, one group of papers can be clustered into several groups, so that the papers in one cluster are as similar as possible, but quite different from papers in different clusters. Therefore, one cluster of papers can be considered to have the same authorship. [7] is a classic research paper that reported a clustering method. First, using strong rules (such as if two papers have at least two same co-authors, they belong to one cluster). Second, using weak rules combine previous clusters and improve the recall. Some other work also considered low-dimension semantic space due to the limitation of traditional feature engineering methods. It mapped research papers to vectors in low-dimensional space and used clustering methods on vectors [2].

 

Track 2: Continuous Name Disambiguation

 

Task Description: The online academic platforms and systems are adding a large number of papers every day. How to accurately and quickly assign these new papers to the existing author profiles is the most urgent problem to be solved for the online academic system. Therefore, the problem can be is defined as the following: given a set of new papers and the existing author's paper list already on the system, the goal is to assign the new papers to the correct authors.

 

Suggested Methods: The incremental disambiguation track is different from the cold start disambiguation task (track 1). It is based on existed authorship profiles and needs to allocate new papers to these profiles. Therefore, a direct method is to compare the existing author’s papers with the newly added papers, extract the traditional features such as collaborators, organizations or journal similarity and etc. After that, traditional classifiers such as SVM can be used. It is also possible to use the vector representation method based on low-dimensional space. By representing authorship and papers to low-dimensional vectors, supervised learning methods can be used to extract features and train models.

 

Reference


[1].  Jie Tang, Jing Zhang, Limin Yao, Juanzi Li, Li Zhang, and Zhong Su. ArnetMiner: Extraction and Mining of Academic Social Networks. In Proceedings of the Fourteenth ACM SIGKDD International Conference on Knowledge Discovery and Data Mining (SIGKDD'2008). pp.990-998.

 

[2].  Yutao Zhang, Fanjin Zhang, Peiran Yao, and Jie Tang. Name Disambiguation in AMiner: Clustering, Maintenance, and Human in the Loop. In Proceedings of the Twenty-Forth ACM SIGKDD International Conference on Knowledge Discovery and Data Mining (KDD'18).

 

[3].  Jie Tang, A.C.M. Fong, Bo Wang, and Jing Zhang. A Unified Probabilistic Framework for Name Disambiguation in Digital Library. IEEE Transaction on Knowledge and Data Engineering (TKDE), 2012, Volume 24, Issue 6, Pages 975-987. 

 

[4].  Xuezhi Wang, Jie Tang, Hong Cheng, and Philip S. Yu. ADANA: Active Name Disambiguation. In Proceedings of 2011 IEEE International Conference on Data Mining (ICDM'11), pages 794-803.

 

[5].  https://biendata.com/competition/scholar2018/data/

 

[6].  The Microsoft Academic Search Dataset and KDD Cup 2013

 

[7].    Wang, F. , Li, J. , Tang, J. , Zhang, J. , & Wang, K. . (2008). Name Disambiguation Using Atomic Clusters. Web-Age Information Management, 2008. WAIM '08. The Ninth International Conference on.

 

 

 

OAG-WhoIsWho Track 1

¥50,000 (~ $3496)

103 teams

start

Final Submissions

2019-09-30

2019-12-02