A Step Forward in Cancer Informatics—It Is Mandatory to Make Guidelines Machine Readable

Clinical guidelines are general recommendations for practicing clinicians regarding prevention, diagnosis and treatment of a given disease. One of the most comprehensive and used guidelines are developed and regularly updated by the National Comprehensive Cancer Network (NCCN). Guidelines are readily available for download in portable document format (PDF). A machine-readable representation of NCCN guidelines is currently not available. In this writing, we argue on the necessity that clinical guidelines should be published in a machine-readable format. After review of the available literature, we describe the most important achievements in the field. Publication of guidelines in a machine-readable form may also be beneficial for other scientific and technical disciplines.


Introduction
Clinical guidelines are general recommendations for practicing clinicians regarding prevention, diagnosis, and treatment of diseases. Ideally the recommendations are based on current high-level evidence. Methodological and practical directions for the development of guidelines are described for example in a publication from 2011 of the Institute of Medicine entitled "Clinical Practice Guidelines We Can Trust" [1].

The National Comprehensive Cancer Network (NCCN) Clinical Practice Guidelines
In oncology, one of the most comprehensive and used guidelines are those developed by the National Comprehensive Cancer Network (NCCN). NCCN is a non-profit alliance of 27 leading cancer centers based in the United States. The guidelines contain sequential management decisions and interventions and are applicable in the majority of clinical situations, covering 97% of cancer types affecting patients in the United States. They are continuously updated and revised, to stay current with the latest developments and evidence. They are an indispensable tool assisting physicians in decision making in cancer care.
After registration and acceptance of the terms and conditions, the NCCN guidelines are freely accessible for all interested parties in several forms. Namely, users have access to NCCN Guidelines with NCCN Evidence Blocks™, NCCN International Translations/Adaptations, NCCN Educational Events and Programs and the core NCCN Clinical Practice Guidelines. The newest form is the NCCN Framework for Resource Stratification of NCCN Guidelines (NCCN Framework™), subclassified as basic, core and enhanced. Altogether, the guidelines are available to download in the commonly used portable document format (PDF). The PDF format is designed with the intention to represent textual and graphical data across multiple platforms in the way they are supposed to be viewed or printed. PDF is not intended for structured or semi-structured data exchange, data extraction or mining. Content extraction from a PDF file in a structured way is not impossible, but it is also not a straightforward task. The complete PDF documentation file released from Adobe extends over 900 pages [2]. Although some support for data mining exists, the complexity of the format is prohibitive, if not impossible to manage, for the majority of scientists and scientific software developers.
Data and information contained within the NCCN guidelines are invaluable. They are structured, interconnected and represented with workflow graphics and text. They are properly referenced to the source literature and enriched with a metadata system in the form of evidence levels. As such, they represent a unique resource and it is imperative to make an additional effort and transform this amalgamated knowledge into a form suitable for informatics approaches. There are multiple reasons for this, but the most important ones can be summarized as follows.

Machine readable medical guidelinesa necessity
Firstly, we are confronted with an explosion of information. As the average life expectancy increases the total number of patients is also rising [3]. The introduction of new services will provide more opportunities to collect and analyze data [4,5]. For every patient we will have more and more prognostic and predictive data which our decisions should be based on, especially factors based on "omics" data [6]. Tumorboards based on molecular profiling of cancers are already being introduced into our practice and will play an ever more important role as we go forward [7-9]. Scientific output is growing with significantly [10], followed by a rising number of therapeutic options [11]. This information overload, as well as the inadequate use of technology modern technology does not come without its price [12,13]. It is evident that it will be ever harder to keep up with developments. Modern tools for assistance and facilitation of cancer care are mandatory [14, 15]. As an example, development and maintenance of decision support tools for oncology would be easier. The implementation of decision tools based on structured NCCN guidelines is almost self-explanatory. Significant parts of the NCCN guidelines are expressed with branching logic where conditional statements are given in form of information and sugesstions or action interconnected with ifthen-else logic.
Secondly, the information growth is accompanied with an even faster expansion of scientific publications [10]. The rise in quantity is usually not accompanied with a rise in quality [16]. To search for relevant and high-quality information we need to spend extensive time and energy. Text mining of cancer related data is rather underdeveloped, and efforts in this direction are more incidental than systematic [17]. We could use a well-structured database of evidencebased papers and recommendations for evaluating existing and new literature and for machine learning processes. If we want to develop machine learning techniques for evidence classification based on natural language processing we will need training material, and it is hard to imagine a better one than the structured NCCN guideline.
Thirdly, treatment quality and conformance to best practice was and is still an issue [18][19][20]. Quality control and conformance to standards would be facilitated through early feedback on structure and content. It is possible to imagine that the National Guideline Clearinghouse would embrace such undertaking.
However, the idea of a structured approach to clinical guidelines is not new [21]. Shahar et al. described a text-based language for representation and annotation of clinical guidelines (CG) [22]. A few years later, Shahar et al published an interesting research paper on the efforts to convert guidelines written as a free text to annotated ontology enriched digital electronic guidelines [23]. This approach should be embraced in terms of research in natural language processing. But for practical purposes, we should think more straightforwardly and provide the desired results directly from the source. As a matter of fact, more effort is given to creating structure from unstructured text with complicated and complex approaches than to establishing a well-defined structure to begin with. The same group has developed the Digital Electronic Guidelines Library (DeGeL), a web-based guideline repository and a suite of tools, to support the use of automated guidelines for medical care, research, and quality assessment [24]. Several authors have evaluated or proposed different approaches for machine readable implementation or development of guidelines. Johnson et al. describe PRODIGY, a guideline-based decision support system aimed at the support of general practitioners [25]. Tu and Musen have also described a task oriented approach to guideline modeling developed within the EON project [26]. Guidelines seek to change behavior by making statements involving one or all tasks: setting constraints, setting goals, making decisions, sequencing and synchronization of actions, interpreting the data [26]. Peleg et al. developed the Guideline Interchange Format Language (GLIF) which has evolved through several versions [27]. GLIF consists of three levels, namely Conceptual, Computable and Implementable Level. The conceptual level of GLIF was described with the Unified Markup Language. Computable and implementable layers allow some of the higher programming concepts such as macros. Although interesting, none of the described models and proposals have been broadly implemented in practice. The Arden syntax for medical logic modules is an industry recognized standard for expressing medical knowledge. However, broader utilization in terms of guideline representation may be prohibitive due to technological limitations and complexity [28][29][30][31].
The data in NCCN guidelines should be available in standards intended for data exchange that are both readable for machines and humans. The format should be simple and understandable for a broad audience. It does not have to fulfill any other requirements but solely focus on representing information contained in clinical guidelines in a structured way. An excellent candidate is the extensible markup language (XML) developed and managed by World Wide Web Consortium (W3C). In summary, XML is a markup language intended for textual data processing in a semistructured way. It is easily understandable and widely implemented in practice. It is used not only for textual data but for any kind of data requiring a structured approach for processing and exchange. The XML structure is described through XML Schemas (XSD). The clinicaltrials.gov portal and the Clinical Data Interchange Standard Consortium (CDISC) use XML as one of the formats for data exchange and their schemas ae free and publicly available. XML facilitates the development of tools such as syntax checkers and editors that can help increase the correctness of the content, and this in turn will foster the development and production. XML has also invaluable extensions in the form of XPath and XQuery which are Turing complete (A computational system that can compute every Turingcomputable function). However, one can use other machine readable formats such as javascript object notation (JSON), hypertext markup language (html) or develop a new standard. However, this would be connected with extensive additional work for tool development, syntax checkers and parsers.
Furthermore, NCCN guidelines contain specific workflow rules expressed with IF -THEN control flow ( Figure  1).
The limitations of our proposal should be acknowledged. We have concentrated our research and discussion only on the NCCN guidelines, based on the fact that the consortium was established with the main purpose of guideline development and maintenance. We cannot exclude the possibility that other parties have already devel-oped and established a machine-readable guideline system. However, a brief survey on the major oncological societies which publish clinical practice guidelines on a regular basis did not yield any documents available in the proposed format (ASCO, EORTC, ASTRO, ESTRO, AGO). We did not give any specific recommendation for further development in terms of XML schema content, but this will certainly be the topic of future publications.

Conclusion
Guidelines in general should be available in a machine readable form. The format should be utilized in scientific efforts and implemented into clinical routine. On a technical level, it is possible to imagine the integration of such resources into decision support systems or quality assurance audits. However, the intention of this paper is not to discuss lower technical aspects of implementation, but to raise awareness and motivate the responsible consortia and the scientific community. The same resources, made machine readable, can improve the fight against cancer and would will certainly be welcomed by the scientific community.