Enhancing the Microsoft Academic Knowledge Graph via Author Name Disambiguation, Publication Classification, and Embeddings

Tracking #: 2779-3993

Michael Färber
Lin Ao

Responsible editor: 
Elena Demidova

Submission type: 
Full Paper
Although several large knowledge graphs have been proposed in the scholarly field, such graphs are limited with respect to several data quality dimensions such as accuracy and coverage. In this article, we present methods for enhancing the Microsoft Academic Knowledge Graph (MAKG), a recently published large-scale knowledge graph containing metadata about scientific publications and associated authors, venues, and affiliations. Based on a qualitative analysis of the MAKG, we address three aspects. First, we adopt and evaluate unsupervised approaches for large-scale author name disambiguation. Second, we develop and evaluate methods for tagging publications by their discipline and by keywords, facilitating enhanced search and recommendation of publications and associated entities. Third, we compute and evaluate embeddings for all 254 million authors, 210 million papers, 49,000 journals, and 16,000 conference entities in the MAKG based on several state-of-the-art embedding techniques. Finally, we provide statistics for the updated MAKG. Our final MAKG is publicly available at https://makg.org and can be used for the search or recommendation of scholarly entities, as well as enhanced scientific impact quantification.
Full PDF Version: 


Solicited Reviews:
Click to Expand/Collapse
Review #1
Anonymous submitted on 04/May/2021
Review Comment:

The authors propose an enhancement of the Microsoft Academic Knowledge Graph by tackling the three tasks of author name disambiguation, field of study classification and tagging, and the creation of embeddings. The MAKG is a valuable and publicly available resource, and any extensions or improvements on it are welcome. All three tasks are reasonable, and a detailed overview of related literature is provided. In general, I see as the main contribution of the paper the application of methods for solving different, well-known problems on a large-scale knowledge graph. However, the problems are mainly reusing existing approaches, and therefore the contributions seem rather technical but do not provide many novel insights into the tackled research questions. Therefore, I see several issues with the originality, novelty and the application of results, as well as with further issues as discussed below.

- Evaluation: In the tasks of author name disambiguation and tagging, no comparisons to existing approaches are provided. This becomes particularly evident in the case of author name disambiguation, given that this section starts with a comprehensive overview of approaches from which only one is considered further. An important dimension is the scalability of the approaches, but such analyses are only provided in Section 3.5.3 (rather vaguely) and Table 21 (only briefly).

- Approach: The approaches to all three tasks are rather simple (which is motivated by efficiency reasons) and mainly use existing tools and models, so there is not much novelty (page 22: "custom implementations can also be developed, though such tasks are not suitable to our paper").

- Author name disambiguation: The author name disambiguation follows a very simple approach that relies on many hyperparameters set in an ad-hoc manner. As a reason, the authors argue for the efficiency of that approach and the lack of training data. Indeed, their potential training data only consists of 49 positive pairs. This leads to the following questions: Without blocking, are there still only 49 positive pairs? Given this low number of positive pairs, are the (important) analyses in Table 7 actually representative for the feature importance - obviously, some features do not contribute at all (e.g., score_coauthors) or even lead to wrong results (e.g., score_titles)? I would have expected an increased amount of positive examples (e.g., by a manual extension of the positive pairs?) and proper learning of the hyperparameters (i.e., feature weights). I have some more questions/comments on the proposed approach: (i) The role of the postprocessing in Figure 6 is not entirely clear - is it to merge blocks when they exceed the size of 500? How is that done? (ii) Does the final clustering algorithm (not for evaluation) consider the ORCID labels? (iii) Is the feature based on co-authors updated iteratively when disambiguation was done? (iv) I agree to the discussion in Section 3.6 that it would be interesting to have more sophisticated blocking techniques, which would increase originality of the approach.

- Tagging: At the end of Section 4.4.5, it says "statistics about the keywords are given in Sec. 6", although this is not the case. Also, there is no evaluation of the tagging, not even an anecdotal example of a single keyword.

- Embeddings: Usage scenarios of the generated embeddings are not provided (and I don't think that "our focus [is] on the MAKG and not on machine learning" (page 22) is a proper excuse when the goal is to make the MAKG accessible for machine learning). One example could be to use it as a feature for the author name disambiguation (obviously, on the original version of the MAKG). Also, the evaluation setting is unclear, e.g., in Table 22 (average MR for "Author"? Is it for the link prediction performance of all triplets where the head is an author?).

- Statistics: Section 6 provides several statistics from which not all are directly relevant to the three tasks discussed in the paper. Also, some parts of the statistics are rather lengthy (e.g., the discussion of Figure 9), while other parts are missing (tags/keywords).

Dataset and website:
- The website's homepage is outdated: the statistics are from 2018, and only embeddings of papers are mentioned.
- Paper abstracts seem to be missing in the newest zenodo version.
- Similar to the 2018 version, it would be great to also have sample files for the new version.
- As known ("We try to fix that in the near future"), the "knowledge graph exploration" has problems.
- The prefixes provided by the SPARQL endpoint are not synchronised with Figure 7 (/the schema on the website).
- I have shamelessly used the SPARQL endpoint to query for my own papers and found a) "duplicate" papers (arxiv and conference publication -> is there also a need for paper disambiguation?) and several paper titles such as "TPDL", ESWC" and "ESWC (Satellite Events)" (which seem to be added as a second title to papers).
- In the resource view ("https://makg.org/entity/..."), prefixes are simply "ns1", "ns2", ..., which is technically fine, but not ideal.

Code (https://github.com/lin-ao/enhancing_the_makg):
- There are many lines such as "file 06 is used to generate data for table 27" in the readme, which is not very helpful and seems to refer to tables in a master thesis ("Code for the Master Thesis") instead of the reviewed paper. On the positive side, there is a proper execution script for the entity resolution (on the other hand, this is missing for the classification).

Other and minor:
- Page 3, Line 3 left (P3L3L): "239 [...] publications"
- List. 1: Does this mean an affiliation that has published 99% in biology, many citations and one machine learning paper is returned here? Citations of the respective machine learning papers would be more relevant.
- Figures 3,4,5 plus their "observations" (P4L45R): The observations can be hardly (observation 1) or not at all (observation 3) observed from the figures
- Figure 5: Which plot belongs to which y-axis? Do I understand correctly that there were 11 million Jena queries in total, but there were days when no Jena queries were queried ("avg. # days with min. 1 request" < 1)?
- P5L30L: "For instance" gives the impression there are more examples. It would be interesting to see them.
- P6L5L: "e.g., architecture" is unclear at this point (it is explained later)
- P846L: Isn't it also necessary to have a mapping from each author in A to an author in A~?
- P9L7R: "d" is not defined (only "sim")
- Section 3.6: This should be better connected to the passage before.
- Table 11: What do the numbers mean?
- P17L20L: "we prepare [our] training set"
- P18L18R: That part is repeated.
- Table 19: Which numbers are bold?
- P2111R: "together;[ ]we"
- P2237R: "select[ed] number"
- Table 24: I have not done the maths, but does it make sense to have 3 authors per paper and 3 papers per author, but 11 co-authors per author?
- P2543L: "likely misleading" -> this could be checked manually for this specific example.
- P2547R: How do outliers affect the median?
- Figure 11: There can't be a negative number of authors; the y-axis needs to be cut.
- Dataset names: There are different graphs involved (MAG, MAKG 2018, MAKG 2020 before enhancement and after; evaluation datasets, e.g., in Section 4.4.1) - clear names/abbreviations could help the reader.

Review #2
Anonymous submitted on 16/May/2021
Major Revision
Review Comment:

This paper describes the work authors have done in enhancing MAKG, a knowledge graph of large-scale scholarly data. More specifically, authors have made improvements to the KG through 1) author name disambiguation, 2) classifying papers into 19 scientific domains, 3) tagging papers by extracting keywords from abstract, 4) computing embeddings for entities in MAKG.

- Authors carried out very comprehensive work for enhancing large-scale MAKG from several dimensions.
- SToA methods in the corresponding field have been applied for most of the steps.
- Reasonable evaluations have been carried out.

- As mentioned, this paper introduces many different works around MAKG, however, by reading the paper, my feeling is that it contains too much content while each part is not thoroughly described/discussed. In many places, the paper is not self-contained. Some work mentioned in the paper is a bit disconnected from the focus of the paper.
- Regarding the author name disambiguation, authors engineered a set of features, built a ground truth dataset, and created a classifier by manually assigning weights to the features (this hand-tuning process is not described sufficiently). I don’t understand why the weights and thresholds are hand-crafted rather than trying supervised machine learning methods?
- The quality of writing is satisfying in general, the clarity of some statements could be improved.

Detailed comments:
- Authors claimed that the author name disambiguation is unsupervised, which I’m not convinced of. As the authors tuned the parameters based on evaluation results in the ground truth dataset, this process is very similar to training a supervised classifier.
- page 4, line 28 “We assume that additional and changing data sources of the MAG resulted in this change.” - this lacks supportive info, I think it’s not difficult to find supportive data for this assumption.
- Section 2.2 is very long and contains a lot of information, but most are too vague. I think it might be better to better summarize and highlight the major impact of MAKG instead of listing too many things without showing how they work and how MAKG was used. For instance, based on the description inSec 2.2, I can’t see how it is used in recommender systems, why it is discussed in enterprises, what was its impact, etc.
- P4L45 (short for page 4, line 45): “Except for in two months” - please be more specific and refer to the corresponding figure
- P4L50: The frequency of “more complex” queries (based on query length) is increasing. - it feels like this comes out of the blue. Please explain in more detail and show data to support it.
- A general comment that might be worth exploring in the future: the name pairs “Wang Wei” and “Wei Wang”, “Zhang Wei” and “Wei Zhang” are likely the same name, this First name v.s. Last name ordering issue might come from the data extraction.
- Currently, table 4 only contains information about if the approach is supervised, maybe it could contain more information, e.g. more descriptive approach name, the research question, type of data it dealt with, the information source (e.g. feature, data source) etc.
- P9L15, out of curiosity: is“affiliations” a one individual value property by the definition of MAKG ontology? Because by intuition it shouldn’t be.
- P10L32: “We use the version published in December 2019 for evaluation, though our final published ...” it reads confusing, please double check this statement.
- The description of how the parameters were chosen is insufficient.
- P14L46: “higher level” -> lower level?
- Section 4.3 is very brief, maybe populate it with more information about the model setup.
- The approach used in Sec 4.4.1 for creating dataset 2 could be potentially biased towards the domains with a larger number of subfields? E.g. if math naturally has fewer subfields than computer science, 0.99 (math)< 0.3+0.3+0.3+0.3 (computer science), is this a reasonable approach?
- P18. Please add data/plot for the discussion regarding the influence of the number of training samples on the model performance.
- Sec 4.4.5 is not talking about evaluation at all, why is it under the “evaluation” section? Also, it seems that the performance of the keyword extraction approach is not evaluated sufficiently.
- Please define/explain terms such as ‘head entity’, ‘tail entity’
- I do not see that mentioning the newly added properties, and the sameAs link to wikidata is necessary for this paper. Please either explain it in more detail (e.g. why and how is it done) or remove the irrelevant information for clarity.
- The analysis in Section 6 is interesting, however, I think it’s not all crucial for this work. I would prefer to see more assessment of the quality of the enhanced knowledge graph.

In general, I think this paper presents very valuable work. Meanwhile, the paper writing is a bit messy at this stage, the comments I gave are just anchors, the authors should go beyond the comments and revise the paper thoroughly.

Review #3
Anonymous submitted on 07/Jun/2021
Review Comment:

This paper is an attempt to enhance the Microsoft Academic Knowledge Graph. First, an approach for author name disambiguation is presented, followed by an approach to generate tags for publications to facilitate search/retrieval/recommendation. Finally, the authors apply entity embedding approaches to provide entity embeddings for the MAKG entities.

Although the research carried out is interesting and (can be) useful, the article does not present any new scientific contributions. In terms of originality and novelty, the paper is basically based on previous methods. The approaches presented do not provide new insights or advances in the literature. Thus, the only contribution of this paper is technical (if so).

The structure and organisation of the paper is a bit awkward and seem to be a compilation of previously conducted research, with the first approach being extremely detailed (author name disambiguation). The remaining ones are superficially described and not well connected. The paper is also unnecessarily long.

The authors also make several unjustifiable claims like the ones below (many others can be found throughout the entire paper):
"Surprisingly, the MAKG contains more authors than publications" - Why is it a surprise? The authors should present the fact as this is not a discussion section.
"26,000 institutions is a low number", based on what information? The authors interpret these numbers without grounds.
"...dropped from 50,202 to 16,142. An obvious reason for this reduction is the data cleaning process. " - Would it be reduced to a third due to the cleaning process?
"We assume that additional and changing data sources of the MAG resulted in this change." - Evidence?

The Author Name Disambiguation approach used by the authors are not well justified. The features were chosen empirically, and some of them apparently are not even reasonable, for example, the "title". How would the title help identify the author in a co-authored paper? It is obviously an inappropriate and inefficient feature to be chosen in this context. The authors could have used additional databases from different fields, such as DBLP for computer science, for the evaluation. The results obtained are not representative and cannot be generalised. In addition to that, the method does not take advantage of semantic relationships. The fact that the technical contribution is meant to improve the MAKG does not necessarily contribute to the semantic field. Again, the end product can be used by several communities, but it is not a direct contribution to the semantic community.

For the tagging approach, the authors fine-tune a BERT-based model and use TextRank to tag the papers. Again, none of the methods takes advantage of the semantics in the graph. Moreover, the authors do not provide a comparison with other methods. If the authors want to enhance the MAKG, then the results reported should be much improved. The methods chosen should be better motivated and described (uncased or cases, large or small model, ...).

The third approach used for generating the embeddings for authors, papers, ... is again not well-motivated. The approach section is superficial, and the main goal is unclear. How does it contribute to enhancing MAKG? The authors should have a better focus in their paper and run in-depth experiments.

Although the authors claim that they are improving the MAG graph, the effect of the proposed approaches on the graph is unclear. Overall, the description is superficial, authors make several empirical decisions, and the experiments carried out are not extensive. There is a lack of novelty and the contribution is marginal.