Types of Healthcase Data

There are many definitions and categorizations of healthcare data. In this chapter, we will introduce the concepts and methods of categorizing healthcare data used in this paper.

Classification upon the identifiability

Most major countries, including the U.S., include personally identifiable data as an important criterion for data classification. While it's important to avoid the risk of privacy breaches, health data, in particular, can combine various valuable data to drive new innovations that improve patient and individual health. This requires a delicate approach that distinguishes between privacy and utilization.

The most prominent laws that follow this approach are the HIPPA/HITECH laws in the U.S. These two laws set out fundamental principles for the protection and use of health information and categorize health information into three categories. Health information that does not fall into one of these categories is still basically subject to general privacy laws.

Data typeIdentifiabilityConsent for use requiredUse for research purposes

PHI(Protected Health Information)



Possible after IRB evaluation

DHI(De-identified health information)



Possible after IRB evaluation

LDS(Limited Data Sets)


(applying somewhat relaxed condition)


(Exempt for research purpose)

Possible after submission of a non-identification agreement and IRB review

PHI (Protected Health Information)

PHI is defined as individually identifiable health information that is created, collected, transmitted, or maintained by a healthcare entity, payment entity, or healthcare-related entity that is covered by HIPPA and that includes information about an individual's (1) past, present, or future physical or mental health condition, (2) health insurance information, or (3) medical expense status.

PHI must be utilized, corrected, and exported for purposes other than treatment only with the patient's consent, except in some exceptional cases such as public interest. Research organizations and others may utilize protected health information for research purposes through an institutional review board (IRB).

DHI (De-identified Health Information)

DHI is recognized under two methods: 1) the Safe Harbor method and 2) the Expert Determination method. The Safe Harbor approach removes the 18 types of identifiers listed below. The subject of the Expert Determination method is a person with appropriate knowledge and expertise in the field of statistics or science regarding identifiability or identification methods. They must determine that the information poses a very small risk of identifying an individual, even when combined with other information, and document their reasons and findings.

Institutions regulated by HIPAA are prescribed that they can use and release DHI freely. Nonetheless, in this procedure, if the information is identifiable, it is considered PHI.

Identifiers Type Name, Address, Dates related to an individual(date of birth, date of insured, date of terminating the insurance, date of death, etc), Contact number, VIN(Vehicle Identification Number), Fax number, Device identifiers and serial numbers, E-mail address, Online access address(URLs), SSN(Social Security Number), Internet access address(IP), Medical record number, Biological fingerprint or voiceprint, Health plan beneficiary number, Photographic image, Bank account number, Suggested information to be re-identifying possible, Certification/qualification information, Possibly recognizable information moreover.

LDS (Limited Data Sets)

LDS is similar to DHI under the Safe Harbor approach in that it is information that has been stripped of identifiers, but it is subject to more relaxed standards and may include some date information (date of birth, date of admission, date of discharge, etc.) and information such as zip code and place of residence (state, city).

Instead, it requires users of the information, such as researchers, to submit the consent prohibiting data re-identification that outlines how they intend to prevent data abuse, and stipulates that if the information is used for certain purposes (research, public health, health care delivery), it can be used without the patient's consent and after going through an IRB. In other words, it puts the onus of re-identification on the user and makes it easier to put the information to valuable use.

Classification of the data contents

In addition to individually identifiable possibilities, there are many other ways to categorize data, such as whether it is structurable, who created it and how, what it is used for, and what it is about. However, rather than applying strict classification criteria or describing all types in detail, this whitepaper focuses on introducing representative types that are important in terms of their use value and helping you understand how each data type is utilized.

Clinical Data

It is the most representative healthcare data and a type including patient information generated when medical centers like hospitals and so on proceed with the diagnosis, injection, running tests, surgery, etc. Therefore, from the structured test numerical value to medical records to the digital screening and image(X-ray, CT, MRI, sonogram, endoscopy, etc) written in natural language, various detailed items are existing.

This information is called EMR(Electronic Medical Record) when saved electronically. Furthermore, the total personal medical information stored in many places is called EHR(Electronic Health Record). Clinical Data is PHI(Protected Health Information) at the generating stage, and strictly prohibited to access and utilize this data for other institutes except for the patient under the duty and responsibility of medical centers to store under the law.

Claim data, derived from clinical data, is based on the information submitted when making an insurance claim from the medical center to the insurance institute. Here, the patient's privacy, diagnosis, and medication information are included. In Korea, the single insurance system is adopted, HIRA(Health Insurance Review & Assessment Service) and NHIS(National Health Insurance Service) established and have opened the public data based on the whole nation's data (Healthcare Bigdata Hub, Sharing service of NHUS materials, etc). Korean pharmaceutical company, HK Inno-N utilized this and launched a new medicine called K-CAB for gastroesophageal reflux disease(reference).

Omics data

It means a data set of total concepts including the biomaterials like genome, transcriptome, proteome, metabolome, and microbiome. These biomaterials each have distinctive features and expect to be personalized medical services when the related data can be accumulated and analyzed on a large scale.

Genome data is the most representative omics data and means data to represent a genetic code recorded on DNA deciding the personal features through sequencing listing by combining alphabets A, T, G, G like a cryptogram. In fact, analyzing genome data seems like decoding the cryptogram, and the main task is to analyze what makes a difference between individuals depending on a single or plural nucleotide at a certain spot. In particular, more than 80% of the cause for the rare disease is a genetic mutation, so decoding the cryptogram is important to figure out the gene causing the disease.

Recently, the progress of technology for machine learning and analyzing big data can make it possible to utilize clinical data and analyze it complexly. Through this, it is possible to make an early diagnosis and utilize for finding a biomarker used for predicting and measuring the treatment reaction.

PGHD (Person-generated Health Data)

Without depending on the external institutes, it is data generated from the various sensors from wearable devices, cellphones, etc, possessed by patients or individuals or data including self-uploaded postings on SNS or surveys. These data have a feature to be collected frequently from routine life without visiting hospitals.

PGHD seems to be not related to the disease, but it is possible to find new discoveries about the disease when it combines with clinical and other data. Actually, in recent new medicine clinical trials, it tends to keep trying actively to utilize PGHD as RWD(Real-world Data)(Reference).

SDOH (Social Determinants of Health)

SDOH is data affecting the health among decided social or economic external factors by nature like population statistics information, social or political factors, climate, or environment.

Gravity Project is an actual case to utilize SDOH data. This project regulates the social or economic factors (education, job, home, income, social safety), physical environment, health(smoking, eating habits, alcohol, sexual life), and public health(access to medical center) as the staple factors and aims to analyze the influence to health.

Research Data

It is data generated to develop the new treatment method from the laboratories and pharmaceutical companies or hospitals related to Medicine, Pharmacy, or Life Science. Typically there is data to be the result of clinical trials or research. Clinical data or Omics data generated already can also be research data in case of being re-utilized or collected for the research.

Research data is necessary to cooperate with the various institutes to secure enough participants for research progress. It is not easy to communicate with people who speak different languages. If other institutes use the titles and units for equal data differently, it will be difficult to communicate and cooperate. Therefore, research data is mostly well structured and collected under the unified rules between institutions conducting the research together. Consistently, an effort on standardization like CDM(Common data model) is continuing the integrative analysis for patients' data accumulated through integrative analysis or various types of research targeting the internal data from different hospitals.

Mostly, research data is scientifically strict, designed for systematic collecting, and verified by academia and reviewing institutions. Also, before conducting the research, it is distinctive to have high-quality data enough to get reviewed regarding the legality and suitability for data collecting subjects and collecting methods from reviewing committees like IRB, etc.

Other Data

Moreover, there is meaningful data when combined with other healthcare data and analyzed even though it is not related to health itself, like personal payment information. For example, payment details of personal regular fitness centers seem not to be related to health itself. However, certain health-related figures are improved or worsened. We can try to predict the change in the personal health index by relating and analyzing the payment information.

Like this data, its value can be much higher when combining different data types. Therefore, when certain data is combined with the generally-known healthcare data, it is a very important future task to utilize it worthwhile.

Last updated