For the key stakeholders in the learning ecosystem to make data-driven decisions based on accurate analysis, the underlying data must be trustworthy. Therefore, it is critical that data trustworthiness issues, which also include data quality, provenance and lineage, be investigated for organizational data sharing, situation assessment, multi-source data integration and numerous other functions to support decision makers and analysts (Bertino et al., 2009). In general, the problem of providing ‘trustworthy’ data to users and applications, is an inherently difficult problem, which often depends on the application and data semantics as well as on the data collection modalities, context and situation.
Legacy approaches to address the issue of integrity in information systems have been based on a hierarchical lattice of integrity levels, where integrity is defined as a relative measure that is evaluated at the subsystem level of data subjects and objects (Sandhu, 1993). Security policies related to integrity, rather than disclosure, are of the highest priority in commercial information systems and that separated mechanisms are required for the enforcement of these policies (Clark & Wilson, 1987). A well-formed transaction is structured such that a subject cannot manipulate data arbitrarily, but only in constrained ways that ensure internal consistency of data. Separation of duty attempts to ensure the external consistency of data objects: the correspondence among data objects of different subparts of a task. This correspondence is ensured by separating all operations into several subparts and requiring that each subpart be executed by a different subject (Zhou et al., 2009).
Apart from integrity levels or separation of duties, previously semantic integrity constraints that enable users to express a variety of conditions that data must satisfy also were used especially in legacy commercial database management systems. Such constraints are used mainly for data consistency and correctness. However, such semantic integrity techniques often fail with the more complex problem of data trustworthiness in that they are not able to determine whether some data correctly reflect the real world scenarios and in ascertaining the reliability and accuracy of the distributed data sources (Eng, 2014).
Data trustworthiness entails deeply the concept of data quality that has been investigated from different perspectives, depending also on the precise meaning assigned to the notion of data quality. Data could be considered of high quality “if they are fit for their intended uses in operations, decision making and planning” (Kerr et al., 2007). Alternatively, the data are deemed of high quality if they correctly represent the real-world construct to which they refer. There are a number of theoretical frameworks for understanding data quality. One framework seeks to integrate the product perspective (conformance to specifications) and the service perspective (meeting consumers’ expectations). Another framework is based in semiotics to evaluate the quality of the form, meaning and use of the data. One highly theoretical approach analyzes the ontological nature of information systems to define data quality rigorously (Price & Shanks, 2005). In addition to these more theoretical investigations, a considerable amount of research on the data quality has been devoted to investigating and describing various categories of desirable attributes (or dimensions) of data.
Another promising approach for assuring information trustworthiness is based on a comprehensive framework composed of key elements of trust scores and data confidence policy (Bertino et al., 2009). Trust scores are associated with all data items to indicate the trustworthiness of each data item. Trust scores can be used for data comparison or ranking. They can be also used together with other factors (e.g., information about contexts and situations, past data history) for deciding about the use of the data items. A framework that provides a trust score computation method which could be based on the concept of data provenance, as provenance gives important evidence about the origin of the data, that is, where and how the data is generated. The second element of the framework is the notion of confidence policy. Such a policy specifies a range of trust scores that a data item, or set of data items, must have for use by the application or task. It is important to notice that the required range of trust scores depends on the purpose for which the data have to be used.
In many cases, it is crucial to provide the analysts and processing applications not only the needed data, but also with an universal annotation that indicates how much the input data can be trusted. Being able to do so is particularly challenging, especially when large amounts of data are generated and continuously transmitted across the system. Also solutions for increasing trustworthiness of the data, like those to specifically target the data quality, may be expensive and may require access to data sources which may have access restrictions, because of data sensitivity (Bertino & Lim, 2010).
References
Bertino, E., Dai, C., & Kantarcioglu, M. (2009). The Challenge of Assuring Data Trustworthiness. In X. Zhou, H. Yokota, K. Deng, & Q. Liu (Eds.), Database Systems for Advanced Applications (Vol. 5463, pp. 22–33). Springer Berlin Heidelberg. https://doi.org/10.1007/978-3-642-00887-0_2
Bertino, E., & Lim, H.-S. (2010). Assuring Data Trustworthiness—Concepts and Research Challenges. In W. Jonker & M. Petković (Eds.), Secure Data Management (pp. 1–12). Springer. https://doi.org/10.1007/978-3-642-15546-8_1
Sandhu, R. S. (1993). Lattice-based access control models. Computer, 26(11), 9–19. https://doi.org/10.1109/2.241422
Clark, D. D., & Wilson, D. R. (1987). A Comparison of Commercial and Military Computer Security Policies. 1987 IEEE Symposium on Security and Privacy, 184–184. https://doi.org/10.1109/SP.1987.10001
Zhou, X., Yokota, H., Deng, K., & Liu, Q. (2009). Database Systems for Advanced Applications: 14th International Conference, DASFAA 2009, Brisbane, Australia, April 21-23, 2009, Proceedings. Springer.
Thiran, P., Houben, G.-J., Hainaut, J.-L., & Benslimane, D. (2004). Updating legacy databases through wrappers: Data consistency management. 11th Working Conference on Reverse Engineering. https://www.academia.edu/2728715/Updating_legacy_databases_through_wrappers_Data_consistency_management
Kerr, K., Norris, T., & Stockdale, R. (2007). Data quality information and decision making: A healthcare case study. ACIS 2007 Proceedings – 18th Australasian Conference on Information Systems.
Price, R., & Shanks, G. (2005, January 1). Empirical Refinement of a Semiotic Information Quality Framework. https://doi.org/10.1109/HICSS.2005.233
Eng, A. W. & N. (2014). Chapter 9 Integrity Rules and Constraints. https://opentextbc.ca/dbdesign01/chapter/chapter-9-integrity-rules-and-constraints/