FAIR Digital Object Application Case for Composing Machine Learning Training Data

The application case for implementing and using the FAIR Digital Object (FAIR DO) concept (Schultes and Wittenburg 2019), aims to simplify the access to label information for composing Machine Learning (ML) (Awad and Khanna 2015) training data. Data sets curated by different domain experts usually h...

Full description

Saved in:
Bibliographic Details
Published in:Research Ideas and Outcomes Vol. 8; pp. 1 - 4
Main Authors: Blumenröhr, Nicolas, Jejkal, Thomas, Pfeil, Andreas, Stotzka, Rainer
Format: Journal Article
Language:English
Published: Sofia Pensoft Publishers 01-10-2022
Subjects:
Online Access:Get full text
Tags: Add Tag
No Tags, Be the first to tag this record!
Abstract The application case for implementing and using the FAIR Digital Object (FAIR DO) concept (Schultes and Wittenburg 2019), aims to simplify the access to label information for composing Machine Learning (ML) (Awad and Khanna 2015) training data. Data sets curated by different domain experts usually have non-identical label terms. This prevents images with similar labels from being easily assigned to the same category. Therefore, using them collectively for application as training data in ML comes with the cost of laborious relabeling. The data needs to be machine-interpretable and -actionable to automate this process. This is enabled by applying the FAIR DO concept. A FAIR DO is a representation of scientific data and requires at least a globally unique Persistent Identifier (PID) (Schultes and Wittenburg 2019), mandatory metadata, and a digital object type. Storing typed information in the PID record demands a prior selection of that information. This includes mandatory metadata and a digital object type to enable machine interpretability and subsequent actionability. The information provided in the PID record refers to its PID Kernel Information Profile (PIDKIP), defined or selected by the creator of the FAIR DO. A PIDKIP is a standard that facilitates the definition and validation of the mandatory metadata attributes in the PID record. This information acts as a basis for a machine to decide if the digital object is reusable for a particular application. Part of that is also the digital object type, which enables a machine to work with the data represented by the FAIR DO. If more information is required, the data itself or other associated FAIR DOs need to be accessed through references in the PID record. Specifying the granularity of the data representation, and the granularity of the metadata in the information record is not a fixed task but depends on the objective. Here, the FAIR DO concept is used for representing image data sets with their label metadata. Each data set contains multiple images, which refer to the same label term. One data set associated with a particular label is represented as one FAIR DO. A type that provides information about this entity covers the packaged format of the images and the image format itself. Further information about the label term and other metadata associated with the data set is provided or accessed through references in the PID record. For the PIDKIP, the Helmholtz KIP was chosen, following the RDA Working Group recommendations on PID Kernel Information (RDA 2013). This profile includes mandatory metadata attributes, used for machine-actionable decisions required for relabeling. Information about the data labels is not directly provided in its PID record, but in another PID record of an associated image label FAIR DO. This one represents a metadata document, containing label information about the data set. Its PID record is based on the same PIDKIP, i.e. the Helmholtz KIP. Both FAIR DOs point to each other. Thus, the image label FAIR DO is accessed via the reference in the PID record of the data set FAIR DO and vice versa. Its PID record contains information about the labels, which are relevant to the relabeling task. Accessing data label information that way means the user does not have to look up each data set, analyze its content and search for its labels. (Fig. 1) The automated procedure for relabeling then looks as follows: A specialized client that can work with PIDs, resolves the PID of a FAIR DO which represents an image data set, and fetches its record. Analyzing its type, the client validates the data usability for composing a ML training data set. Furthermore, the referenced PID of the image label FAIR DO in the record is resolved the same way. By analyzing its PID record, the client identifies that it is relevant for getting information about the labels. The document represented by the image label FAIR DO is accessed via its location path provided in the PID record. To work with its content, a specialized tool is required that is compatible with its format and schema, i.e. its type. This tool identifies and analyzes the label term of the data set for mapping it to corresponding label terms of other image data sets. This specification of FAIR DOs enables the relabeling of entire image data sets for application in ML. However, the current granularity of data representation is insufficient for other machine-based decisions and actions on single images. Another aspect in this regard is to increase the information in the PID record to enable more machine-actionable decisions. This requires reconsideration of the granularity of metadata in the PID record and needs to be balanced with the aim of fast record processing. Changing the content of the PID record also leads to deriving a new PIDKIP, or extending existing ones. Metadata tools applied in conjunction with the FAIR DO concept that uses the label information in the document of the metadata FAIR DOs need further specification. One requirement for their implementation is a standardized data description for the metadata document, using schemas and vocabularies. Using the machine actionability of FAIR DOs described above, enables automation for relabeling data sets. This leaves more time for the ML user to concentrate on model training and optimization. Software development of FAIR DO-specific clients and metadata mapping tools are the subject of current research. The next step is to implement such software, for carrying out the proposed concept on a large scale. This work has been supported by the research program 'Engineering Digital Futures' of the Helmholtz Association of German Research Centers and the Helmholtz Metadata Collaboration Platform (Helmholtz-Gemeinschaft Deutscher Forschungszentren 1995).
AbstractList The application case for implementing and using the FAIR Digital Object (FAIR DO) concept (Schultes and Wittenburg 2019), aims to simplify the access to label information for composing Machine Learning (ML) (Awad and Khanna 2015) training data. Data sets curated by different domain experts usually have non-identical label terms. This prevents images with similar labels from being easily assigned to the same category. Therefore, using them collectively for application as training data in ML comes with the cost of laborious relabeling. The data needs to be machine-interpretable and -actionable to automate this process. This is enabled by applying the FAIR DO concept. A FAIR DO is a representation of scientific data and requires at least a globally unique Persistent Identifier (PID) (Schultes and Wittenburg 2019), mandatory metadata, and a digital object type. Storing typed information in the PID record demands a prior selection of that information. This includes mandatory metadata and a digital object type to enable machine interpretability and subsequent actionability. The information provided in the PID record refers to its PID Kernel Information Profile (PIDKIP), defined or selected by the creator of the FAIR DO. A PIDKIP is a standard that facilitates the definition and validation of the mandatory metadata attributes in the PID record. This information acts as a basis for a machine to decide if the digital object is reusable for a particular application. Part of that is also the digital object type, which enables a machine to work with the data represented by the FAIR DO. If more information is required, the data itself or other associated FAIR DOs need to be accessed through references in the PID record. Specifying the granularity of the data representation, and the granularity of the metadata in the information record is not a fixed task but depends on the objective. Here, the FAIR DO concept is used for representing image data sets with their label metadata. Each data set contains multiple images, which refer to the same label term. One data set associated with a particular label is represented as one FAIR DO. A type that provides information about this entity covers the packaged format of the images and the image format itself. Further information about the label term and other metadata associated with the data set is provided or accessed through references in the PID record. For the PIDKIP, the Helmholtz KIP was chosen, following the RDA Working Group recommendations on PID Kernel Information (RDA 2013). This profile includes mandatory metadata attributes, used for machine-actionable decisions required for relabeling. Information about the data labels is not directly provided in its PID record, but in another PID record of an associated image label FAIR DO. This one represents a metadata document, containing label information about the data set. Its PID record is based on the same PIDKIP, i.e. the Helmholtz KIP. Both FAIR DOs point to each other. Thus, the image label FAIR DO is accessed via the reference in the PID record of the data set FAIR DO and vice versa. Its PID record contains information about the labels, which are relevant to the relabeling task. Accessing data label information that way means the user does not have to look up each data set, analyze its content and search for its labels. (Fig.1) The automated procedure for relabeling then looks as follows: A specialized client that can work with PIDs, resolves the PID of a FAIR DO which represents an image data set, and fetches its record. Analyzing its type, the client validates the data usability for composing a ML training data set. Furthermore, the referenced PID of the image label FAIR DO in the record is resolved the same way. By analyzing its PID record, the client identifies that it is relevant for getting information about the labels. The document represented by the image label FAIR DO is accessed via its location path provided in the PID record. To work with its content, a specialized tool is required that is compatible with its format and schema, i.e. its type. This tool identifies and analyzes the label term of the data set for mapping it to corresponding label terms of other image data sets. This specification of FAIR DOs enables the relabeling of entire image data sets for application in ML. However, the current granularity of data representation is insufficient for other machine-based decisions and actions on single images. Another aspect in this regard is to increase the information in the PID record to enable more machine-actionable decisions. This requires reconsideration of the granularity of metadata in the PID record and needs to be balanced with the aim of fast record processing. Changing the content of the PID record also leads to deriving a new PIDKIP, or extending existing ones. Metadata tools applied in conjunction with the FAIR DO concept that uses the label information in the document of the metadata FAIR DOs need further specification. One requirement for their implementation is a standardized data description for the metadata document, using schemas and vocabularies. Using the machine actionability of FAIR DOs described above, enables automation for relabeling data sets. This leaves more time for the ML user to concentrate on model training and optimization. Software development of FAIR DO-specific clients and metadata mapping tools are the subject of current research. The next step is to implement such software, for carrying out the proposed concept on a large scale. This work has been supported by the research program 'Engineering Digital Futures' of the Helmholtz Association of German Research Centers and the Helmholtz Metadata Collaboration Platform (Helmholtz-Gemeinschaft Deutscher Forschungszentren 1995).
The application case for implementing and using the FAIR Digital Object (FAIR DO) concept (Schultes and Wittenburg 2019), aims to simplify the access to label information for composing Machine Learning (ML) (Awad and Khanna 2015) training data. Data sets curated by different domain experts usually have non-identical label terms. This prevents images with similar labels from being easily assigned to the same category. Therefore, using them collectively for application as training data in ML comes with the cost of laborious relabeling. The data needs to be machine-interpretable and -actionable to automate this process. This is enabled by applying the FAIR DO concept. A FAIR DO is a representation of scientific data and requires at least a globally unique Persistent Identifier (PID) (Schultes and Wittenburg 2019), mandatory metadata, and a digital object type. Storing typed information in the PID record demands a prior selection of that information. This includes mandatory metadata and a digital object type to enable machine interpretability and subsequent actionability. The information provided in the PID record refers to its PID Kernel Information Profile (PIDKIP), defined or selected by the creator of the FAIR DO. A PIDKIP is a standard that facilitates the definition and validation of the mandatory metadata attributes in the PID record. This information acts as a basis for a machine to decide if the digital object is reusable for a particular application. Part of that is also the digital object type, which enables a machine to work with the data represented by the FAIR DO. If more information is required, the data itself or other associated FAIR DOs need to be accessed through references in the PID record. Specifying the granularity of the data representation, and the granularity of the metadata in the information record is not a fixed task but depends on the objective. Here, the FAIR DO concept is used for representing image data sets with their label metadata. Each data set contains multiple images, which refer to the same label term. One data set associated with a particular label is represented as one FAIR DO. A type that provides information about this entity covers the packaged format of the images and the image format itself. Further information about the label term and other metadata associated with the data set is provided or accessed through references in the PID record. For the PIDKIP, the Helmholtz KIP was chosen, following the RDA Working Group recommendations on PID Kernel Information (RDA 2013). This profile includes mandatory metadata attributes, used for machine-actionable decisions required for relabeling. Information about the data labels is not directly provided in its PID record, but in another PID record of an associated image label FAIR DO. This one represents a metadata document, containing label information about the data set. Its PID record is based on the same PIDKIP, i.e. the Helmholtz KIP. Both FAIR DOs point to each other. Thus, the image label FAIR DO is accessed via the reference in the PID record of the data set FAIR DO and vice versa. Its PID record contains information about the labels, which are relevant to the relabeling task. Accessing data label information that way means the user does not have to look up each data set, analyze its content and search for its labels. (Fig. 1) The automated procedure for relabeling then looks as follows: A specialized client that can work with PIDs, resolves the PID of a FAIR DO which represents an image data set, and fetches its record. Analyzing its type, the client validates the data usability for composing a ML training data set. Furthermore, the referenced PID of the image label FAIR DO in the record is resolved the same way. By analyzing its PID record, the client identifies that it is relevant for getting information about the labels. The document represented by the image label FAIR DO is accessed via its location path provided in the PID record. To work with its content, a specialized tool is required that is compatible with its format and schema, i.e. its type. This tool identifies and analyzes the label term of the data set for mapping it to corresponding label terms of other image data sets. This specification of FAIR DOs enables the relabeling of entire image data sets for application in ML. However, the current granularity of data representation is insufficient for other machine-based decisions and actions on single images. Another aspect in this regard is to increase the information in the PID record to enable more machine-actionable decisions. This requires reconsideration of the granularity of metadata in the PID record and needs to be balanced with the aim of fast record processing. Changing the content of the PID record also leads to deriving a new PIDKIP, or extending existing ones. Metadata tools applied in conjunction with the FAIR DO concept that uses the label information in the document of the metadata FAIR DOs need further specification. One requirement for their implementation is a standardized data description for the metadata document, using schemas and vocabularies. Using the machine actionability of FAIR DOs described above, enables automation for relabeling data sets. This leaves more time for the ML user to concentrate on model training and optimization. Software development of FAIR DO-specific clients and metadata mapping tools are the subject of current research. The next step is to implement such software, for carrying out the proposed concept on a large scale. This work has been supported by the research program 'Engineering Digital Futures' of the Helmholtz Association of German Research Centers and the Helmholtz Metadata Collaboration Platform (Helmholtz-Gemeinschaft Deutscher Forschungszentren 1995).
Author Stotzka, Rainer
Jejkal, Thomas
Pfeil, Andreas
Blumenröhr, Nicolas
Author_xml – sequence: 1
  givenname: Nicolas
  surname: Blumenröhr
  fullname: Blumenröhr, Nicolas
– sequence: 2
  givenname: Thomas
  orcidid: 0000-0003-2804-688X
  surname: Jejkal
  fullname: Jejkal, Thomas
– sequence: 3
  givenname: Andreas
  orcidid: 0000-0001-6575-1022
  surname: Pfeil
  fullname: Pfeil, Andreas
– sequence: 4
  givenname: Rainer
  surname: Stotzka
  fullname: Stotzka, Rainer
BookMark eNpNkUFLAzEQhYNUsNbe_AELXm1NdmKyOZbWaqFSlHoOs0m2Ztlu1uz24L9324p4mjePx5uB75oM6lA7Qm4ZnUKm5EP0YZpNneKMwQUZpiDkRDIBg3_6iozbtqSUMilACTokb8vZ6j1Z-J3vsEo2eelMl8yapvIGOx_qZI6tS4oQk3nYN6H19S55RfPpa5esHcb6aGwj-pNYYIc35LLAqnXj3zkiH8un7fxlst48r-az9cQwoWCSKlsA53kOjAtbULSFVRKpVCmgtZLmPONg0GWy33kKFKwUrsgV8D5KYURW514bsNRN9HuM3zqg1ycjxJ3G2HlTOc2toMI8onQm45YyFM5ZTq11WV4A5n3X3bmrieHr4NpOl-EQ6_59ncqUKwUgeZ-6P6dMDG0bXfF3lVF9ZKB7BjrTZwbwA9KNe0Y
Cites_doi 10.1007/978-3-030-23584-0_1
10.1007/978-1-4302-5990-9_1
ContentType Journal Article
Copyright 2022. This work is published under http://creativecommons.org/licenses/by/4.0/ (the “License”). Notwithstanding the ProQuest Terms and Conditions, you may use this content in accordance with the terms of the License.
Copyright_xml – notice: 2022. This work is published under http://creativecommons.org/licenses/by/4.0/ (the “License”). Notwithstanding the ProQuest Terms and Conditions, you may use this content in accordance with the terms of the License.
DBID AAYXX
CITATION
8FE
8FH
ABUWG
AFKRA
AZQEC
BBNVY
BENPR
BHPHI
CCPQU
DWQXO
GNUQQ
HCIFZ
LK8
M7P
PIMPY
PQEST
PQQKQ
PQUKI
PRINS
DOA
DOI 10.3897/rio.8.e94113
DatabaseName CrossRef
ProQuest SciTech Collection
ProQuest Natural Science Collection
ProQuest Central (Alumni)
ProQuest Central UK/Ireland
ProQuest Central Essentials
Biological Science Collection
ProQuest Central
Natural Science Collection
ProQuest One Community College
ProQuest Central
ProQuest Central Student
SciTech Premium Collection
Biological Sciences
Biological Science Database
Publicly Available Content Database
ProQuest One Academic Eastern Edition (DO NOT USE)
ProQuest One Academic
ProQuest One Academic UKI Edition
ProQuest Central China
Directory of Open Access Journals
DatabaseTitle CrossRef
Publicly Available Content Database
ProQuest Central Student
ProQuest Biological Science Collection
ProQuest Central Essentials
ProQuest One Academic Eastern Edition
ProQuest Central (Alumni Edition)
SciTech Premium Collection
ProQuest One Community College
ProQuest Natural Science Collection
Biological Science Database
ProQuest SciTech Collection
ProQuest Central China
ProQuest Central
ProQuest One Academic UKI Edition
Natural Science Collection
ProQuest Central Korea
Biological Science Collection
ProQuest One Academic
DatabaseTitleList Publicly Available Content Database

CrossRef
Database_xml – sequence: 1
  dbid: DOA
  name: Directory of Open Access Journals
  url: http://www.doaj.org/
  sourceTypes: Open Website
DeliveryMethod fulltext_linktorsrc
Discipline Sciences (General)
EISSN 2367-7163
EndPage 4
ExternalDocumentID oai_doaj_org_article_4d606c5a7ec84d01a6eed40dde8bf3ab
10_3897_rio_8_e94113
Genre Conference Proceeding
GroupedDBID 5VS
8FE
8FH
AAFWJ
AAYXX
ADBBV
AFKRA
AFPKN
ALMA_UNASSIGNED_HOLDINGS
BBNVY
BCNDV
BENPR
BHPHI
CCPQU
CITATION
GROUPED_DOAJ
H13
HCIFZ
IAO
ITC
K13
KQ8
LK8
M7P
M~E
OK1
PIMPY
PROAC
ABUWG
AZQEC
DWQXO
GNUQQ
PQEST
PQQKQ
PQUKI
PRINS
ID FETCH-LOGICAL-c1693-29df344bb3146df0adfd97a07923add70b4843cae873ad42303d76efb934dfd03
IEDL.DBID DOA
ISSN 2367-7163
IngestDate Tue Oct 22 15:14:59 EDT 2024
Thu Oct 10 16:01:49 EDT 2024
Thu Nov 21 20:44:29 EST 2024
IsDoiOpenAccess true
IsOpenAccess true
IsPeerReviewed true
IsScholarly true
Language English
LinkModel DirectLink
MergedId FETCHMERGED-LOGICAL-c1693-29df344bb3146df0adfd97a07923add70b4843cae873ad42303d76efb934dfd03
ORCID 0000-0003-2804-688X
0000-0001-6575-1022
OpenAccessLink https://doaj.org/article/4d606c5a7ec84d01a6eed40dde8bf3ab
PQID 2724993374
PQPubID 2049295
PageCount 4
ParticipantIDs doaj_primary_oai_doaj_org_article_4d606c5a7ec84d01a6eed40dde8bf3ab
proquest_journals_2724993374
crossref_primary_10_3897_rio_8_e94113
PublicationCentury 2000
PublicationDate 20221001
PublicationDateYYYYMMDD 2022-10-01
PublicationDate_xml – month: 10
  year: 2022
  text: 20221001
  day: 01
PublicationDecade 2020
PublicationPlace Sofia
PublicationPlace_xml – name: Sofia
PublicationTitle Research Ideas and Outcomes
PublicationYear 2022
Publisher Pensoft Publishers
Publisher_xml – name: Pensoft Publishers
References Forschungszentren (94113_B8009257)
94113_B8111217
94113_B8111166
RDA (94113_B8009249)
References_xml – ident: 94113_B8111166
  doi: 10.1007/978-3-030-23584-0_1
– ident: 94113_B8009249
  article-title: PID Kernel Information Profile Management
  contributor:
    fullname: RDA
– ident: 94113_B8111217
  doi: 10.1007/978-1-4302-5990-9_1
– ident: 94113_B8009257
  article-title: Helmholtz Metadata Collaboration (HMC) Platform
  contributor:
    fullname: Forschungszentren
SSID ssj0001763960
Score 2.24874
Snippet The application case for implementing and using the FAIR Digital Object (FAIR DO) concept (Schultes and Wittenburg 2019), aims to simplify the access to label...
SourceID doaj
proquest
crossref
SourceType Open Website
Aggregation Database
StartPage 1
SubjectTerms Automation
Datasets
Image Data
Kip protein
Label
Learning algorithms
Machine learning
Mapping
Metadata
Persistent Identifier
Software
Title FAIR Digital Object Application Case for Composing Machine Learning Training Data
URI https://www.proquest.com/docview/2724993374
https://doaj.org/article/4d606c5a7ec84d01a6eed40dde8bf3ab
Volume 8
hasFullText 1
inHoldings 1
isFullTextHit
isPrint
link http://sdu.summon.serialssolutions.com/2.0.0/link/0/eLvHCXMwrV07T8MwELagEwuiPEShIA8gwZDWqd3YHksfggEQUCQ2y8-qS4v6-P-cnZSHGFgYE1lJdJe7--58_g6hi4IIRr3oZsxymzEhSaat0xkg-65knWCNTaMTXvjDmxgMI03O56iv2BNW0gOXgmszBxDbdjX3VjBHcl2AV2cErFKYQLVJ3pcU35KpVF0BswFsXna6Q0zm7cV03hItL1me0x8xKFH1__LEKbyM9tBuhQtxr_yeOtrys31Uryxvia8qeujrA_Q06t0948F0Esd94EcTCym497UPjfsQmDBgURxtfR5rAfg-tUx6XLGpTvC4mgyBB3qlD9HraDju32bVYITMRu6UrCNdoIwZQ8HPuUC0C05yTSIXIPgrTgwDDVjtBYdrAEyEOl74YCRlsJTQI1SbzWf-GGFSOJFT4YyMA8FD3OUDhMQsPJQWJIQGutyISr2X_BcK8oYoUgUiVUKVIm2gmyjHzzWRtTrdAF2qSpfqL102UHOjBVWZ0lJ1OGSIklLOTv7jHadopxNPMKR-vCaqrRZrf4a2l259nn6hD0cNzJY
link.rule.ids 315,782,786,866,2106,27933,27934
linkProvider Directory of Open Access Journals
openUrl ctx_ver=Z39.88-2004&ctx_enc=info%3Aofi%2Fenc%3AUTF-8&rfr_id=info%3Asid%2Fsummon.serialssolutions.com&rft_val_fmt=info%3Aofi%2Ffmt%3Akev%3Amtx%3Ajournal&rft.genre=article&rft.atitle=FAIR+Digital+Object+Application+Case+for+Composing+Machine+Learning+Training+Data&rft.jtitle=Research+Ideas+and+Outcomes&rft.au=Blumenr%C3%B6hr%2C+Nicolas&rft.au=Jejkal%2C+Thomas&rft.au=Pfeil%2C+Andreas&rft.au=Stotzka%2C+Rainer&rft.date=2022-10-01&rft.pub=Pensoft+Publishers&rft.eissn=2367-7163&rft_id=info:doi/10.3897%2Frio.8.e94113
thumbnail_l http://covers-cdn.summon.serialssolutions.com/index.aspx?isbn=/lc.gif&issn=2367-7163&client=summon
thumbnail_m http://covers-cdn.summon.serialssolutions.com/index.aspx?isbn=/mc.gif&issn=2367-7163&client=summon
thumbnail_s http://covers-cdn.summon.serialssolutions.com/index.aspx?isbn=/sc.gif&issn=2367-7163&client=summon