ScienceON Knowledge Graph System: Exploring New Frontiers in Science and Technology Information Integration System

Tracking #: 3691-4905

Authors: 
Chanuk Lim
Nam-Gyu Kang
Suhyeon Yoo
Hyun Ji Jeong
sungsu

Responsible editor: 
Guest Editors KG Construction 2024

Submission type: 
Full Paper
Abstract: 
The increasing complexity and volume of scientific and technological data necessitate advanced tools for effective data-driven analysis. Knowledge Graphs, with their capacity to encapsulate complex relationships among interconnected entities, have emerged as pivotal structures for organizing this vast amount of information. They enable a deeper understanding and exploration of data across various domains, notably in science and technology where the rapid proliferation of research outputs presents both opportunities and challenges. This paper presents the ScienceON Knowledge Graph System, a comprehensive framework designed to address integral challenges of integrating and analyzing scientific and technological data. We have developed the comprehensive infrastructure of ScienceON, a data ecosystem that harmonizes a wide spectrum of scientific and technological information. This encompasses everything from national R&D projects, scholarly papers, and patents to reports, author profiles, organizational details, keywords, and thematic categories. Our approach significantly advances the field not only by streamlining the aggregation of data via an Extract, Transform, Load process but also by facilitating the creation of a sophisticated knowledge graph. This knowledge graph meticulously interlinks research data, incorporating extensive metadata to accurately reflect the complex web of relationships within the science and technology domains. Our contributions are threefold: Firstly, we detail the creation of the ScienceON data ecosystem, highlighting an automated pipeline that ensures ongoing updates and expansion of data. Secondly, we describe the design of the ScienceON Knowledge Graph, which provides a detailed and interconnected representation of scientific and technological data. Lastly, we explore the application of the ScienceON Knowledge Graph in conducting graph-related experiments and in developing user-centric applications, demonstrating its versatility and utility. By employing rigorous data curation practices and utilizing the Resource Description Framework for data representation, we ensure the high quality and accessibility of our dataset, positioning the ScienceON Knowledge Graph as a gold standard in the realm of science and technology knowledge management. This initiative not only augments data management practices but also fosters the development of innovative applications and services, enhancing access to and understanding of the vast landscape of science and technology.
Full PDF Version: 
Tags: 
Reviewed

Decision/Status: 
Reject

Solicited Reviews:
Click to Expand/Collapse
Review #1
By Jose Emilio Labra Gayo submitted on 30/Jun/2024
Suggestion:
Reject
Review Comment:

The paper addresses the important domain of Science for the development of knowledge graphs. It describes how the authors have accomplished to create a Knowledge Graph called ScienceOn that represents information about Science with a special focus on Korean resources.

One issue with the paper is that it has been submitted as a full paper, which according to https://www.semantic-web-journal.net/reviewers implies that it should contain original research results and be reviewed along the dimensions of originality, significance of the results, quality of writing and completion of data artifacts. In my opinion, the paper would probably fit better under the “Application report” type, because it is presenting a system called “ScienceOn” which seems to be a very nice system, in which case the paper would be reviewed based on the dimensions of quality, importance and impact.

As such, one problem of the paper as it is written is that it is not clear if it is presenting new research results (section 5 provides some measurements) or the architecture of the ScienceOn system which would be more an application report style.

In my opinion, if the authors focused their paper on the research results, then they should put more emphasis on the experiments that they did and how those results are original and improve existing systems. In that case, the paper could be seen as a paper that includes original research results with a better description of a research question, and the main contributions in document classification, link prediction, network analysis, etc.…but it seems that the paper is not really focused on that and has more emphasis on the ScienceOn system. In that case, the paper should describe better the design decisions that have been taken and the usability of the system, maybe adding some statistics like the numbers of users, or the number of requests to the API, how many external applications depend on the ScienceOn system, and if the SPARQL endpoint is public or not. That would be a different type of paper and it should be submitted as an application report instead of a full paper with original research results.

I followed the link to the github repo and it doesn’t provide explanations about how to run the code and what the code is for. I would expect that it would contain information about the source code of ScienceOn, which at the end is a web system along with information about the SKG ontology and the SPARQL endpoint, but I think it doesn’t, and it contains 3 folders of Python code for node classification, link prediction and graph visualization without descriptions of how to run that code and what that code would do.

I proceed with a more detailed review following the structure of the paper:
- Title: I think the title is a bit too long and not direct enough…is it necessary the sentence starting by “Exploring new frontiers…” which looks a bit too grandiloquent?
- Abstract: I think the sentence “Our approach significantly advances the field not only by…” is not appropriate for an academic paper…which should contain falsifiable sentences and avoid those claims.
- Abstract: I would avoid the sentence “By employing rigorous data curation practices and utilizing the Resource Description Framework for data representation, we ensure the high quality and accessibility of our dataset…” because RDF by itself doesn’t warrant high quality or accessibility of any dataset…it would depend on how it is used.
- Abstract: the sentence “...enhancing access to and understanding of the vast landscape of science and technology” is also not falsifiable…if the authors say it,they should add some studies about how their system enhances them.
- Page 3, The sentence: “The rigorously vetted data is highly accurate and publicly accessible, establishing the ScienceON Knowledge Graph as the gold standard in the realm of science and technology.” should be avoided because it is not easy to prove it and I think is not appropriate for an academic journal.
- Page 3, RDF is not a standard endorsed by W3C, it is a recommendation.
- Page 3, “This initiative is distinguished by the introduction of an intuitive interface, which significantly enhances the users…”, how can you prove it? If it is true, did you do some usability testing? If so, add a cite or describe it.
- Page 5, “In a(n) education domain…”
- Page 5 about related work contains a lot of references but the authors should improve the description of those references and how those references relate to their proposal.
- I think the whole related work section needs a rewrite by describing better what are the different aspects of the related work and how they differ from the work done by the authors. There has been some recent interest in applying knowledge graphs in the research domain, like the ORKG, and other projects at a national level which also collect information from research institutions, like the Hercules one in Spain (https://repository.publisso.de/resource/frl%3A6423282, https://dspacecris.eurocris.org/bitstream/11366/1958/8/Hern%C3%A1ndez-Mo...). Some of those works have been presented in the Damalos workshop: https://zbmed.github.io/damalos/
Section 3.2 is titled “Knowledge representation” but I think it is more focused on “Knowledge representation learning”. I found that the section not easy to read as it seems to assume several technical concepts that are not introduced like “transductive conditions”, “spectral based methods”, “homogeneous graphs”, etc.
- I think the related work section is missing a subsection about similar works and approaches that integrate data from different sources and publish that data as RDF, and a comparison between the approach used by the authors with those as well as a discussion about alternatives.
- Page 8, “a comprehensive data ecosystem…” why is it comprehensive?
“...such as data discovery, acquisition, preparation , knowledge discovery and sharing”...seems to assume that there is something called data discovery which is different from knowledge discovery….is there something different?
- “It aggregates science and technology”, I think the “extensive” word is surjective and could be avoided in an academic paper
- “This comprehensive approach allows ScienceOn to serve as a pivotal resource…” is non-academic.
- “Ensuring excellent extensibility”...why not just “improving extensibility” ?
Section 4.2 about data acquisition indicates that the system is based on an ETL process without providing details about that process, which could be relevant if we consider that the one point of the project is about how to integrate heterogenous sources of information into an RDF-based system. Some questions that could be tackled are: how often do the authors run the ETL process? What technologies do they employ? How can they ensure the quality of the transformations or the mappings between the original sources and the target RDF data? I think answering those questions is relevant for a paper in a special issue about Knowledge Graph Construction and although the project on which the current paper is based seem to have addressed them, the paper lacks details about those aspects. What design decisions have been employed and what alternatives have been considered for those aspects?
- At the end of section 4.3, the authors say “with this rigorous data preparation, we are setting the stage for…” In my opinion, although the data preparation step may be required, I would not call it “rigurous” as it may generate errors during the process. - - The authors also say that “our Knowledge Graph remains an and current reflection of scientific advancements” and, again, I think the process employed for data preparation can generate errors and the accuracy is approximate. I would avoid those qualifiers so the paper writing style is more scientific.
- Section 4.4.1, “to aid in the construction of the ontology we utilize Protégé, a premier tool for ontology modeling…”, why do you consider it a premier tool? What alternatives did you consider?...the authors keep saying “known for its effectiveness in facilitating complex ontology designs”...which is again a non-academic statement.
- The authors say that they apply SHACL to validate the RDF are those SHACL shapes public? I was looking to the github repo and I didn’t find any reference about them.
The authors also talk about using R2RML mappings, but no further details are given and I didn’t see those mappings in the github repo.
- The sentence “We selected GraphDB for its exemplary performance” is not appropriate in an academic paper unless it is supported by a citation.
- “By adhering to W3C standards…”, should be W3C recommendations
- “Facilitating advanced knowledge discovery…” How did the authors measure it?
- Section 4.4.2 about the SKG ontology…is it public? How does that ontology differ from other ontologies in the research domain?
- The sentence “The incorporation of external schemas such as RDFS, XSD and OWL…” is dangerous as it is including XSD and RDFS/OWL in the same box, while they are different technologies…XSD is indeed more related to ShEx or SHACL as a data validation language, rather than as an inference or ontology definition language.
- The authors mention “provenance” slightly, what approach do they use for provenance? This can be an important aspect of this kind of applications, and I think it would deserve some more explanations.
- The ontology as it is designed seems to re-invent a lot of concepts which could be reused/mapped from other ontologies. Did the authors follow some guidelines/methodology for developing the SKG ontology? At least, I would advise to try to reuse concepts from other ontologies or map to them.
- Section 4.5, “The RESTful architecture ensures scalability, simplicity and flexibility…” is a non-academic sentence.
- I would argue that the “By facilitating data retrieval in XML format…ensures that both raw data and value-added knowledge are structured and readily accessible to end-users…” is not true…XML by itself doesn’t ensure it…and I would even consider that a web service could provide data in more formats like JSON, JSON-LD or RDF.
- “This robust infrastructure is…” is a non-academic sentence.
- I was looking to the ScienceOn web and to the link about the ScienceOn API Gateway and - I found it difficult to find information without Korean language skills. I think it should be possible to get access to that information more easily for machines and I think it is not so easy.
- Section 5.1.1. The reference [67] should be before the dot.
- The sentence “Which has demonstrated superior performance” is non-academic.
- In formula 2, the superscript (l) appears without parenthesis and with parenthesis…I think the authors should unify it.
- In the explanation for formula (3) “h is an embedding vector” should probably be “h_i” ?
- “Sigma is a sigmoid function and “y” (should be “y_i” ?) is a label vector of training nodes”
- In general section 5.1 seems a bit disconnected from the previous sections…I felt like it was a jump from the previous one and some of the formulas are included in a way that makes it difficult to understand…why those values in the formulas? Are those formulas based on some previous papers? If so, I think it should be more clear.
- Area under the ROC curve…ROC is not defined previously
- The paragraph that starts by “For metapath2vec, we set the window size to 7….” contains several magic numbers like 7, 100, 128, etc. which are not explained…why those numbers?
- Sections 5.1.3 and 5.1.4 seem to provide some interesting results which I am not sure if they could be presented as the research results of this paper…but for that, I think the paper should be rewritten with another style emphasizing the research problem of document classification more clearly and focusing on that aspect.
- “From table 5, HGT represents the model that learns the entire ScienceOn Knowledge Graph and HGT-Subgraph represent the model that learns a subgraph of the ScienceOn…” What subgraph? I think the authors could explain better what subgraph they are talking about as well as what they say later about oversmoothing and expensive computation.
- According to the sentence “Through this analysis, the ScienceON Knowledge Graph emerges not only as a repository of data, but also as a dynamic map of the confluence of scientific disciplines and technological innovation…” it seems that the visualizations that appear in the paper are also available to the users of ScienceOn…if that’s true, maybe adding some information about how they can be obtained would be useful.
- I tried to access to the ScienceOn website and as it is mainly in Korean, I was not able to use it…I would suggest to add internationalization support and enable English at least for international users, but I understand that this option is not mandatory and may be outside of the needs of the system if the target audience doesn’t require it.
- Is the SPARQL endpoint public? If not, maybe providing some account for the reviewers to play with it would be a good idea? I tried to run the SPARQL queries but I didn’t find information about the SPARQL endpoint?
- Following that line…did the authors try to reuse data from other semantic web portals whose data is already available from SPARQL, like Wikidata, ORKG, etc?

In my opinion, the paper is interesting and applying knowledge graphs to better represent information about Science is necessary...however, I would recommend the authors to rewrite the paper either as a more research oriented paper focused on the originality of their results, or as an application report paper, focused on the architecture of the ScienceOn system and presenting evidences about the impact of that system.

Review #2
By Michael Färber submitted on 02/Jul/2024
Suggestion:
Major Revision
Review Comment:

The article is reviewed according to the criteria proposed by the SWJ:

1. Originality:
The article deals with integrating and analyzing scientific and technological data in RDF. However, its originality is limited: Several similar platforms and knowledge graphs exist (e.g., SemOpenAlex.org, LinkedPapersWithCode.com, KAKEN , MLSea, CS-KG, etc.). The authors insufficiently describe the gap to these initiatives.

2. Significance of the Results:
The authors fail to describe important characteristics of the ScienceON platform and the resulting knowledge graph, such as data quality dimensions of the knowledge graph (e.g., data coverage, covered domains, update frequency, etc.). In addition, it is unclear whether the aim is to replace or aggregate existing scholarly knowledge graphs, and to which degree URIs are interlinked to other knowledge graphs (which seems to be the case only to a limited degree). While the motivation for creating the ScienceON platform and knowledge graph is obvious, given the data provided by the Korean organizations, it remains unclear what the exact target user group will be and how these users will use or have used the system so far. In addition, the system and the knowledge graph are evaluated based on tasks such as link prediction or node classification, but it is not shown that the users of the system actually need these applications. A requirement analysis/study in the first place could help.

3. Quality of Writing:
While the grammar of the writing is fine, the structure and conciseness of the text can be improved. Certain sections could be shortened or restructured for better readability and coherence (see below). The related work section appears incomplete and outdated, missing recent scholarly knowledge graphs.

Overall, the article focuses on an interesting and important topic. However, for acceptance, the authors would need to provide clearer the gap to related works (also w.r.t. the different initiatives worldwide concerning scholarly platforms) and show in the evaluation why a knowledge graph is needed and how a knowledge graph helps in the actual use cases (instead of picked scenarios evaluated without knowing which scores are needed).

Detailed Comments:

Abstract:
* The expression "scientific and technological data" might be confusing. I would suggest to define or explain it.

Introduction:
* The list of related scholarly knowledge graphs appears to be incomplete and outdated. Papers such as SemOpenAlex, LinkedPapersWithCode, MLSea, ORKG, and DSKG might provide notable references.
* The mentioned challenges are kept very generic. For the first challenge, "Unexplored potential of KGs," concrete examples might help. For instance, there are already commercial systems (e.g., Dimensions) and free alternatives (e.g., SemOpenAlex). For the second challenge, the authors miss to outline what exact elements are missing in existing knowledge graphs and suggest possible solutions.
* It remains unclear what the data coverage, covered scientific disciplines, languages, and update frequency are (see knowledge graph data quality dimensions). Describing such key factors is important for a better understanding.
* It is unclear whether the ScienceON knowledge graph is intended to replace existing scholarly knowledge graph, aggregate them, or focus on knowledge management.

Table 1:
* The selection criteria for the knowledge graphs listed in Table 1 should be clearly defined. It appears that the table is inconsistent with the text in the Introduction, as different knowledge graphs are mentioned. It remains unclear whether also scalability or semantics are important aspects or contributions of ScienceON.

Motivation and Background:
* This section can be written more concisely.
* The mention of "node classification and link prediction" should include explanations of why these tasks are performed and outline specific application use cases where knowledge graphs are needed.
* It remains unclear if ScienceON is a completely new initiative or if previous versions exist. This might be written more clearly.
* Related work concerning other initiatives (e.g., in Japan, Germany, etc.) may be included.

Related Work:
* A clearer structure would help in this section, e.g., with a subsection on scholarly knowledge graphs. It is also necessary to clarify what is missing in existing solutions and why the ScienceON knowledge graph is necessary. The use cases mentioned for scholarly knowledge graphs appear random to some degree.
* The mention of knowledge graph embeddings and GNN embeddings should be relevant to the later content and presented more systematically, considering other existing methods.
* Side note: Existing GNNs designed for heterogeneous graphs are not yet automatically using all RDF data (e.g., RDF datatype properties), see the paper of AutoRDF2GML .

Section 4:
* It might be interesting to know to which degree entities are interlinked to other knowledge graphs and how many ORCID ids are provided in the knowledge graph (% of all authors).
* Given the journal categorization, how are new journals included and why are not the concepts from OpenAlex or other scholarly knowledge graphs reused?
* Section 4.2 is kept very generic. The authors might describe better how often and how much data is processed.
* The keyword extraction step does not necessarily need LLMs and could use traditional approaches as done for other knowledge graphs like MAG/MAKG+. A comparative evaluation would be helpful. The authors might also want to consider the use of schemas like from ACM for categorizing papers.
* The reuse of vocabulary, the use of resolvable URIs, and the provisioning of metadata files (e.g., VoID file) should be detailed.
* The API format might be justified and the insufficiency of SPARQL for complex data analytics may be discussed.

Section 5:
* The description of GNNs can be removed, as GNNs are not the focus of the paper.
* The dataset paragraph should clarify that the entire KG is very large, and only a subgraph is used for experiments. The size of this subgraph as a percentage of the whole KG should be mentioned.
* The evaluation should demonstrate the necessity of using RDF/knowledge graphs. This seems to be not directly done by the authors.
* R-GCN might be another candidate for a GNN method.
* What are the specific RDF properties considered for link prediction?
* Given the very high values reported in Table 6, details about the graph (e.g., size) should be mentioned.
* The tasks and evaluation results could be compared with the tasks and evaluation results of other scholarly knowledge graphs.
* The paper should show the benefits for end users and clarify who the target user groups are. So far, it is not clear who should use the system (and how much it is used by them) and in which way.

Section 6:
* It would be helpful to include the link to the online demo in this section.
* It seems that the end users need to operate with SPARQL. The authors need to clarify if this is the (primary) retrieval language and if this skill (SPARQL) can be expected from the end users.

Conclusion:
* The conclusion can be shortened and made more concise.

Review #3
By Silvio Peroni submitted on 02/Jul/2024
Suggestion:
Reject
Review Comment:

In this article, the authors introduce ScienceON, a knowledge graph dedicated to publishing scholarly and research information about publications, projects, people involved in the process, etc., in RDF and making such data available to all via the Web. The data contained there come from several sources containing different types of information that focus mainly on South Korean publications and research endeavours. The paper is also enriched by some experiments that show the use of these data for addressing graph-based tasks and a list of applications built upon such data to make them available to non-expert users.

These kinds of works, i.e. creating scholarly knowledge graphs that are open, accessible and reusable, are crucial for the community since they are the steps needed to aim at building a distributed system that, in principle, contains information about all the research information produced around the globe, following the vision that was introduced, probably for the very first time, by Robert Cameron in his work published in First Monday (https://doi.org/10.5210/fm.v2i4.522). In addition, having a knowledge graph on publications of countries and/or disciplines that usually are excluded or not fully represented in well-known proprietary indexes has even more value since it offers more equitable access to science.

While I praise these particular activities and, thus, the topic of this work. However, after reading the article with tremendous interest, I think it has several issues in its present form. The first one is about its narrative. Analysing its current organisation, it seems to merge two different contributions, i.e. the knowledge graph developed and all the experiments done on that knowledge graph (i.e. section 5). The authors justify section 5 by claiming that it supports the claim of the quality of ScienceON - the rationale here is that since we can run experiments on graph-related tasks on the knowledge graph, such knowledge graph must be qualitatively sound. However, the point is that these experiments could also be run on other knowledge graphs, obtaining similar results. Thus, they do not justify the claim on the quality of ScienceON and appear to be just an exercise with no added value to the existence of ScienceON. Indeed, it would have been better to have two separate papers here: one about ScienceON, the workflow for constructing it, etc., and the other about the experiments on graph-based tasks, where ScienceON is used as one possible knowledge graph for application. I would suggest removing all the parts related to the topic in section 5 - which also affects part of the related works (i.e. section 3.2) that appears suddenly without a clear justification - and focusing entirely on the knowledge graph construction, providing even more details when needed.

As follows, other points that should be considered as well.

* In the abstract, the authors say that ScienceON Knowledge Graph should be considered a gold standard. This definition is not appropriate for several reasons. First, to claim something is a gold standard, one has to demonstrate its full quality and that it is (in principle) free of issues. However, according to my experience, it is kind of impossible to claim that for a KG of scholarly data due to the heterogeneity, possible mistakes, and coverage that any KG of this kind have. The point here is that no scholarly KG is able to have perfect coverage - actually, all of them are usually incomplete by definition, either because they explicitly put a threshold about what to include and what not to include or because the source material does not allow you to cover everything and/or may contain mistakes. Second, we usually talk about a golden standard in the context of a precise experiment to measure the results of addressing a particular task. In the tasks shown in the paper, how can the authors demonstrate that ScienceON is, for instance, better than OpenAIRE, OpenCitations, and OpenAlex?

* Table 1 in the introduction shows the authors' perceived superiority (in terms of relations and entity types) that ScienceON introduces compared with other KGs. First, I would have expected this table in the related works section. Second, and most importantly, such comparison lacks several used and well-known KGs for scholarly data developed and used systematically in the community, for instance, OpenCitations, OpenAIRE Graph, OpenAlex, CORE, and the Ukrainian Open Citation Index, to mention a few. Adding these additional KGs would make the comparison fairer and show how things are handled in the KGs (publishers, abstracts, several different types of publication entities, etc.) that are missing in ScienceON KG.

* Another aspect that needs to be clearly stated is the coverage of the data compared to other KGs. In the literature, there are KGs that are either multi-disciplinary or mono-disciplinary, based primarily on English scholarly literature, containing information about a specific country, etc. According to my understanding reading the text, it seems that ScienceON is primarily dedicated to South Korean scholarly literature, which is good since no other database provides such coverage of the research information for that country. However, saying that the authors "designed to construct a comprehensive and systematic KG to address the challenges of data-driven analysis within the science and technology domains" suggests that the resource is better than all the others available online. Is that the case? I think the claim should be softened a bit here.

* The license associated with the data we can see and download via the APIs needs to be clarified. The authors claim they have "free copyright usage," but it is rather unclear what I can do with them. Can I use them and mix them up with other data for research purposes? Do they allow me to do commercial activities with them? Thus, it is essential to specify a license to clarify that formally.

* Related to the previous point, I would strongly suggest that the authors and the ScienceON infrastructure provide evidence of following shared and international standards for supporting their "openness" and correct (and expected) availability of the data they provide. To this end, my suggestion would be to evaluate ScienceON against, at least, the Principle for Open Scholarly Infrastructure (POSI), the FAIR principles for data management and stewardship (FAIR), and the TRUST Principles for digital repositories (TRUST). Another well-known and complete assessment framework would be the FOREST Framework for Values-Driven Scholarly Communication (FOREST).

* In section 4, the authors introduce the main component of ScienceON. They say that the architecture provided "facilitates the progression from diverse data sources to the extraction of actionable knowledge". However, the various sources used seem to contain different kinds of data – KISTI Data Center refers to papers and authors, AccessON refers to open access information, KIPRIS contains patents, etc. Thus, even if there are several data sources, they seem complementary, i.e., the same information does not come from diverse sources. This approach essentially simplifies the ingestion of new data since, if you use multiple sources containing similar information, one has to handle the possibility that entities (e.g. papers) are present in different sources and, to ingest such data correctly, one has to develop deduplication approaches that increase the level of complexity of the ingestion process. For instance, this is the case for both OpenCitations and OpenAIRE. Thus, the question is: have the authors developed approaches to handle these situations?

* According to what is described in the text, it seems that only DOIs are considered for identifying papers included in science. Thus, what happens to all the papers that do not necessarily have a DOI? Are they excluded? Are they handled in some way? If so, how? In addition, is there a plan to also consider other relevant identifiers? Also, are only papers and journals considered, as mentioned in Table 2? What about books (which are still the primary publication in the humanities domain, for instance)? Moreover, how are the books identified if only DOIs are used, considering that books usually come with one or more ISBNs associated?

* When you talk about authors and affiliations associated with papers, it is not clear how it is handled in the data. Technically speaking, affiliations are a tripartite relation that connects an article with an author having a particular affiliation specified for such an article. How this is handled in the RDF? According to the diagram in Figure 4, there is only a relation between an author and an organisation. If so, how can I answer a simple question like: what is John Doe's affiliation with the context of paper A? For more insight into this problem, I would suggest seeing https://doi.org/10.1145/2362499.2362502.

* According to the documentation and links provided, I needed to see explicit documentation about the data model used to organise and expose the data. Is there an ontology? Is it documented? The only documentation seen is that in Figure 4, which is just introductory and does not provide any example of data definition nor a precise definition of ontological terms. Even following the ontology URL (https://scienceon.kisti.re.kr/ontologies/skg#), I could not find any ontology to look at. Where has it been defined? I want to stress that having an open data model (implemented as an ontology) is crucial for claiming that the provided data are FAIR compliant - a crucial endeavour today when infrastructures make available research information. In addition, did the author use a particular methodology for developing the ontology? How can they claim that the ontology developed is of sufficient quality?

* To retrieve the journal categories, the authors said they scraped them from Google Scholar. Supposing these categories are available in the data they publish, it is unclear if they have the legal right to republish this information in their dataset as open material. The authors should carefully check it, and if they have the right to do so, they should also provide information to describe if and when these data can be reused.

* In Section 4.2, the authors say that the "aspect of the ETL process is the assignment of unique identifiers to each document within ScienceON, which is essential for eliminating duplicate". However, it needs to be clarified how this assignment works, what the shape of these identifiers is, and how the authors are sure that two entities (e.g., two papers) mentioned in different sources refer to the same object. The process adopted here should be carefully explained since it is an essential passage in the production of any scholarly KG.

* Why is there the need to pass through a conversion from a relational database and an RDF triplestore? In particular, what was the rationale for handling all the data in a relational database? Wouldn't it be better to store data directly in the triplestore?

* I think the claim "ScienceON Knowledge Graph stands as a testament to the potential of semantic web technologies in facilitating advanced knowledge discovery" is a bit overstated. What about all the other efforts developed by several infrastructures and projects in the past years?

* From the diagram in Figure 4, journals may have an ISBN. However, the ISBN is an identifier for books, not journals. If this is confirmed, I honestly think the data model developed is kind of inconsistent with reality.

Minor and typo:
* a SciGraph -> SciGraph

Thus, even if I consider the ScienceON KG a valuable contribution, as explained at the beginning, I think that the authors should go through an extensive rewriting of the text to make it more focussed on the details related to the KG construction, the issue and lessons learnt related to it. They should provide more and better comparisons with existing scholarly KG to clarify what is the added value of ScienceON compared to the existing literature.

Review #4
Anonymous submitted on 01/Sep/2024
Suggestion:
Major Revision
Review Comment:

This paper introduces the ScienceON Knowledge Graph System, a framework designed to tackle the significant challenges associated with integrating and analyzing scientific and technological data. The paper claims three main contributions: 1) detailing the creation of the ScienceON data ecosystem, 2) describing the design and structure of the ScienceON Knowledge Graph, and 3) exploring the real-world applications of the ScienceON Knowledge Graph.

Overall, the paper is well-written, though certain sections lack the necessary details (see specific comments below).

The pipeline for knowledge graph creation, while not particularly innovative, presents a well-defined use case. Many resources similar to ScienceON have not provided detailed explanations of their frameworks, making this paper a valuable contribution.

The ScienceON resource is especially useful at a national level, with potential for international relevance. However, its resemblance to existing systems raises concerns, particularly because some of the most compelling data, such as information related to R&D projects, is limited to Korean datasets. Additionally, the paper's analysis of the knowledge graph overlooks a few KG in this space, such as OpenAlex, ORKG, and CS-KG. The Dimensions database, which also includes project-related information, might be relevant to consider, even though it is not open access. Ultimately, it would be advantageous for the authors to articulate why ScienceON could be the best resource for specific use cases, including those at an international level.

The knowledge graph and the ontology developed are well-documented. However, I suggest evaluating the ontology with competency questions, as is standard practice in the community associated with this journal.

Finally, the concluding sections on potential applications effectively illustrate various use cases, even if they do not significantly advance beyond the current state of the art.

In conclusion, the paper does provide a detailed account of the creation of an interesting large-scale resource. As such, it could be of interest both to those interested in practical methodologies for KG construction from heterogeneous sources and to potential users of ScienceON. Therefore, I recommend accepting the paper after the authors address the issues I have outlined in this review.

Below, I provide specific comments and questions.

Section 1:
The statement, “We harness the potential of the ScienceON Knowledge Graph by engaging in graph-level experiments and employing knowledge representation learning models for both comprehensive quantitative and qualitative analysis,” is unclear. The term "graph-level experiments" is particularly vague. I recommend clarifying what is meant here.
I suggest adding to Table 1 OpenAlex and Dimensions.

Section 2: I suggest supporting several claims with references, including relevant web pages if no academic articles are available.

Section 3: It would be pertinent to mention that MAG has now been discontinued. I would also discuss some knowledge-centric KG that details the content of research papers such as ORKG, Nanopub, and CS-KG.

Section 4: The provenance of data in the “KISTI Data Center” is unclear. Does KRISTI extract data from the aforementioned resources (e.g., OAG)? Is it parsed from the web? Please clarify.

The phrase “Journal’s category from web” is also unclear and lacks detail. Do you use a fixed taxonomy? If so, can you make it available? How many categories are in the taxonomy, and which disciplines does it cover?

Section 4.4: In the semantic web community, ontologies are typically evaluated by defining a set of “competency questions”—questions the ontology should be able to answer—and then demonstrating that the ontology functions as expected by presenting relevant SPARQL queries. Incorporating this type of evaluation would make the paper more robust. The list of competency questions and relevant queries could be included in an appendix.