daqapo
Despite the extensive opportunities that process mining techniques provide, the garbage in - garbage out principle still applies. Data quality issues are widespread in real-life data and can generate misleading results when used for analysis purposes. DaQAPO - Data Quality Assessment for Process-Oriented data - provides a set of assessment functions to identify a wide array of quality issues.
The table below summarizes the different data quality assessment tests available in daqapo
, after which each test will be briefly demonstrated.
Function name | Description | Output |
---|---|---|
detect_activity_frequency_violations | Function that detects activity frequency anomalies per case | Summary in console + Returns activities in cases which are executed too many times |
detect_activity_order_violations | Function detecting violations in activity order | Summary in console + Returns detected orders which violate the specified order |
detect_attribute_dependencies | Function detecting violations of dependencies between attributes (i.e. condition(s) that should hold when (an)other condition(s) hold(s)) | Summary in console + Returns rows with dependency violations |
detect_case_id_sequence_gaps | Function detecting gaps in the sequence of case identifiers | Summary in console + Returns case IDs which should be expected to be present |
detect_conditional_activity_presence | Function detection violations of conditional activity presence (i.e. activity/activities that should be present when (a) particular condition(s) hold(s)) | Summary in console + Returns cases violating conditional activity presence |
detect_duration_outliers | Function detecting duration outliers for a particular activity | Summary in console + Returns rows with outliers |
detect_inactive_periods | Function detecting inactive periods, i.e. periods of time in which no activity executions/arrivals are recorded | Summary in console + Returns periods of inactivity |
detect_incomplete_cases | Function detecting incomplete cases in terms of the activities that need to be recorded for a case | Summary in console + Returns traces in which the mentioned activities are not present |
detect_incorrect_activity_names | Function returning the incorrect activity labels in the log | Summary in console + Returns rows with incorrect activities |
detect_missing_values | Function detecting missing values at different levels of aggregation | Summary in console + Returns rows with NAs |
detect_multiregistration | Function detecting the registration of a series of events in a short time period for the same case or by the same resource | Summary in console + Returns rows with multiregistration on resource or case level |
detect_overlaps | Checks if a resource has performed two activities in parallel | Data frame containing the activities, the number of overlaps and average overlap in minutes |
detect_related_activities | Function detecting missing related activities, i.e. activities that should be registered because another activity is registered for a case | Summary in console + Returns cases violating related activities |
detect_similar_labels | Function detecting potential spelling mistakes | Table showing similarities for each label |
detect_time_anomalies | Funtion detecting activity executions with negative or zero duration | Summary in console + Returns rows with negative or zero durations |
detect_unique_values | Function listing all distinct combinations of the given log attributes | Summary in console + Returns all unique combinations of values in given columns |
detect_value_range_violations | Function detecting violations of the range of acceptable values | Summary in console + Returns rows with value range infringements |
In the examples below, we use the dataset hospital_actlog_actlog
, which is an artificial event log with data quality issues provided by daqapo
.
hospital_actlog %>%
detect_activity_frequency_violations("Registration" = 1,
"Clinical exam" = 1)
## *** OUTPUT ***
## For 3 cases in the activity log (13.6363636363636%) an anomaly is detected.
## The anomalies are spread over the following cases:
## # A tibble: 3 x 3
## patient_visit_nr activity n
## <dbl> <chr> <int>
## 1 518 Registration 3
## 2 512 Clinical exam 2
## 3 535 Registration 2
hospital_actlog %>%
detect_activity_order_violations(activity_order = c("Registration", "Triage", "Clinical exam",
"Treatment", "Treatment evaluation"))
## Warning in detect_activity_order_violations.activitylog(., activity_order =
## c("Registration", : Some activity instances within the same case overlap. Use
## detect_overlaps to investigate further.
## Warning in detect_activity_order_violations.activitylog(., activity_order
## = c("Registration", : Not all specified activities occur in each case. Use
## detect_incomplete_cases to investigate further.
## Selected timestamp parameter value: both
## *** OUTPUT ***
## It was checked whether the activity order Registration - Triage - Clinical exam - Treatment - Treatment evaluation is respected.
## This activity order is respected for 18 (81.82%) of the cases and not for4 (18.18%) of the cases.
## For cases for which the aformentioned activity order is not respected, the following order is detected (ordered by decreasing frequeny of occurrence):
## # A tibble: 4 x 3
## activity_list n case_ids
## <chr> <int> <chr>
## 1 Registration - Registration - Registration 1 518
## 2 Registration - Registration - Triage - Clinical exam - Treatme~ 1 535
## 3 Registration - Triage - Clinical exam - Clinical exam 1 512
## 4 Triage - Registration 1 521
hospital_actlog %>%
detect_attribute_dependencies(antecedent = activity == "Registration",
consequent = startsWith(originator,"Clerk"))
## *** OUTPUT ***
## The following statement was checked: if condition(s) ~activity == "Registration" hold(s), then ~startsWith(originator, "Clerk") should also hold.
## This statement holds for 12 (85.71%) of the rows in the activity log for which the first condition(s) hold and does not hold for 2 (14.29%) of these rows.
## For the following rows, the first condition(s) hold(s), but the second condition does not:
## # A tibble: 2 x 7
## patient_visit_nr activity originator start complete
## <dbl> <chr> <chr> <dttm> <dttm>
## 1 528 Registr~ Nurse 6 2017-11-21 18:10:17 2017-11-21 18:15:04
## 2 534 Registr~ <NA> 2017-11-22 18:35:00 2017-11-22 18:37:00
## # ... with 2 more variables: triagecode <dbl>, specialization <chr>
hospital_actlog %>%
detect_case_id_sequence_gaps()
## *** OUTPUT ***
## It was checked whether there are gaps in the sequence of case IDs
## From the 27 expected cases in the activity log, ranging from 510 to 536, 5 (18.52%) are missing.
## These case numbers are:
## case present
## 1 511 FALSE
## 2 513 FALSE
## 3 514 FALSE
## 4 515 FALSE
## 5 516 FALSE
hospital_actlog %>%
detect_conditional_activity_presence(condition = specialization == "TRAU",
activities = "Clinical exam")
## *** OUTPUT ***
## The following statement was checked: if condition(s) ~specialization == "TRAU" hold(s), then activity/activities Clinical exam should be recorded
## The condition(s) hold(s) for 2 cases. From these cases:
## - the specified activity/activities is/are recorded for 2 case(s) (100%)
## - the specified activity/activities is/are not recorded for 0 case(s) (0%)
hospital_actlog %>%
detect_duration_outliers(Treatment = duration_within(bound_sd = 1))
## *** OUTPUT ***
## Outliers are detected for following activities
## Treatment Lower bound: 5.06 Upper bound: 22.2
## A total of 1 is detected (1.89% of the activity executions)
## For the following activity instances, outliers are detected:
## # A tibble: 1 x 13
## patient_visit_nr activity originator start complete
## <dbl> <chr> <chr> <dttm> <dttm>
## 1 523 Treatme~ Nurse 17 2017-11-21 18:26:04 2017-11-21 18:55:00
## # ... with 8 more variables: triagecode <dbl>, specialization <chr>,
## # duration <dbl>, mean <dbl>, sd <dbl>, bound_sd <dbl>, lower_bound <dbl>,
## # upper_bound <dbl>
hospital_actlog %>%
detect_duration_outliers(Treatment = duration_within(lower_bound = 0, upper_bound = 15))
## *** OUTPUT ***
## Outliers are detected for following activities
## Treatment Lower bound: 0 Upper bound: 15
## A total of 1 is detected (1.89% of the activity executions)
## For the following activity instances, outliers are detected:
## # A tibble: 1 x 13
## patient_visit_nr activity originator start complete
## <dbl> <chr> <chr> <dttm> <dttm>
## 1 523 Treatme~ Nurse 17 2017-11-21 18:26:04 2017-11-21 18:55:00
## # ... with 8 more variables: triagecode <dbl>, specialization <chr>,
## # duration <dbl>, mean <dbl>, sd <dbl>, bound_sd <dbl>, lower_bound <dbl>,
## # upper_bound <dbl>
hospital_actlog %>%
detect_inactive_periods(threshold = 30)
## Selected timestamp parameter value: both
## Selected inactivity type:arrivals
## *** OUTPUT ***
## Specified threshold of 30 minutes is violated 9 times.
## Threshold is violated in the following periods:
## # A tibble: 9 x 3
## period_start period_end time_gap
## <dttm> <dttm> <dbl>
## 1 2017-11-20 10:20:06 2017-11-21 11:35:16 1515.
## 2 2017-11-21 11:22:16 2017-11-21 11:59:41 37.4
## 3 2017-11-21 12:05:52 2017-11-21 13:43:16 97.4
## 4 2017-11-21 14:06:09 2017-11-21 15:12:17 66.1
## 5 2017-11-21 15:18:19 2017-11-21 16:42:08 83.8
## 6 2017-11-21 17:06:10 2017-11-21 18:02:10 56
## 7 2017-11-21 18:15:04 2017-11-22 10:04:57 950.
## 8 2017-11-22 10:32:56 2017-11-22 16:30:00 357.
## 9 2017-11-22 17:00:00 2017-11-22 18:00:00 60
hospital_actlog %>%
detect_incomplete_cases(activities = c("Registration","Triage","Clinical exam","Treatment","Treatment evaluation"))
## *** OUTPUT ***
## It was checked whether the activities Clinical exam, Registration, Treatment, Treatment evaluation, Triage are present for cases.
## These activities are present for 4 (39.62%) of the cases and are not present for 18 (60.38%) of the cases.
## Note: this function only checks the presence of activities for a particular case, not the completeness of these entries in the activity log or the order of activities.
## For cases for which the aforementioned activities are not all present, the following activities are recorded (ordered by decreasing frequeny of occurrence):
## # A tibble: 9 x 3
## activity n case_ids
## <chr> <int> <chr>
## 1 Triage 11 510 - 512 - 517 - 521 - 524 - 525 - 526 - 527 - 528 ~
## 2 Registration 9 512 - 518 - 518 - 518 - 521 - 522 - 527 - 528 - 534
## 3 Clinical exam 5 512 - 510 - 527 - 528 - 512
## 4 Treatment evaluat~ 2 529 - 532
## 5 0 1 533
## 6 registration 1 510
## 7 Trage 1 520
## 8 Treatment 1 532
## 9 Triaga 1 522
hospital_actlog %>%
detect_incorrect_activity_names(allowed_activities = c("Registration","Triage","Clinical exam","Treatment","Treatment evaluation"))
## *** OUTPUT ***
## 4 out of 9 (44.44% ) activity labels are identified to be incorrect.
## These activity labels are:
## registration - Trage - Triaga - 0
## Given this information, 4 of 53 (7.55%) rows in the activity log are incorrect. These are the following:
## # A tibble: 4 x 7
## patient_visit_nr activity originator start complete
## <dbl> <chr> <chr> <dttm> <dttm>
## 1 510 registr~ Clerk 9 2017-11-20 10:18:17 2017-11-20 10:20:06
## 2 520 Trage Nurse 17 2017-11-21 13:43:16 2017-11-21 13:39:00
## 3 522 Triaga Nurse 5 2017-11-21 15:15:25 2017-11-21 15:18:04
## 4 533 0 <NA> 2017-11-22 18:35:00 2017-11-22 18:37:00
## # ... with 2 more variables: triagecode <dbl>, specialization <chr>
hospital_actlog %>%
detect_missing_values()
## Selected level of aggregation:overview
## *** OUTPUT ***
## Absolute number of missing values per column:
## Relative number of missing values per column (expressed as percentage):
## Overview of activity log rows which are incomplete:
##
## patient_visit_nr 0
## activity 0
## originator 2
## start 1
## complete 0
## triagecode 1
## specialization 0
##
## patient_visit_nr 0.000000
## activity 0.000000
## originator 3.773585
## start 1.886792
## complete 0.000000
## triagecode 1.886792
## specialization 0.000000
## # A tibble: 4 x 7
## patient_visit_nr activity originator start complete
## <dbl> <chr> <chr> <dttm> <dttm>
## 1 510 Clinica~ Doctor 7 2017-11-20 11:35:01 2017-11-20 11:36:09
## 2 533 0 <NA> 2017-11-22 18:35:00 2017-11-22 18:37:00
## 3 534 Registr~ <NA> 2017-11-22 18:35:00 2017-11-22 18:37:00
## 4 512 Clinica~ Doctor 7 NA 2017-11-20 11:33:57
## # ... with 2 more variables: triagecode <dbl>, specialization <chr>
hospital_actlog %>%
detect_missing_values(level_of_aggregation = "activity")
## Selected level of aggregation:activity
## *** OUTPUT ***
## Absolute number of missing values per column (per activity):
## Relative number of missing values per column (per activity, expressed as percentage):
## Overview of activity log rows which are incomplete:
## # A tibble: 9 x 7
## activity patient_visit_nr originator start complete triagecode specialization
## <chr> <int> <int> <int> <int> <int> <int>
## 1 0 0 1 0 0 0 0
## 2 Clinical~ 0 0 1 0 1 0
## 3 registra~ 0 0 0 0 0 0
## 4 Registra~ 0 1 0 0 0 0
## 5 Trage 0 0 0 0 0 0
## 6 Treatment 0 0 0 0 0 0
## 7 Treatmen~ 0 0 0 0 0 0
## 8 Triaga 0 0 0 0 0 0
## 9 Triage 0 0 0 0 0 0
## # A tibble: 9 x 7
## activity patient_visit_nr originator start complete triagecode specialization
## <chr> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
## 1 0 0 1 0 0 0 0
## 2 Clinical~ 0 0 0.111 0 0.111 0
## 3 registra~ 0 0 0 0 0 0
## 4 Registra~ 0 0.0714 0 0 0 0
## 5 Trage 0 0 0 0 0 0
## 6 Treatment 0 0 0 0 0 0
## 7 Treatmen~ 0 0 0 0 0 0
## 8 Triaga 0 0 0 0 0 0
## 9 Triage 0 0 0 0 0 0
## # A tibble: 4 x 7
## patient_visit_nr activity originator start complete
## <dbl> <chr> <chr> <dttm> <dttm>
## 1 510 Clinica~ Doctor 7 2017-11-20 11:35:01 2017-11-20 11:36:09
## 2 533 0 <NA> 2017-11-22 18:35:00 2017-11-22 18:37:00
## 3 534 Registr~ <NA> 2017-11-22 18:35:00 2017-11-22 18:37:00
## 4 512 Clinica~ Doctor 7 NA 2017-11-20 11:33:57
## # ... with 2 more variables: triagecode <dbl>, specialization <chr>
hospital_actlog %>%
detect_missing_values(
level_of_aggregation = "column",
column = "triagecode")
## Selected level of aggregation:column
## *** OUTPUT ***
## Absolute number of missing values in columntriagecode:1
## Relative number of missing values in columntriagecode(expressed as percentage):1.88679245283019
##
## Overview of activity log rows in whichtriagecodeis missing:
## # A tibble: 1 x 7
## patient_visit_nr activity originator start complete
## <dbl> <chr> <chr> <dttm> <dttm>
## 1 510 Clinica~ Doctor 7 2017-11-20 11:35:01 2017-11-20 11:36:09
## # ... with 2 more variables: triagecode <dbl>, specialization <chr>
hospital_actlog %>%
detect_multiregistration(threshold_in_seconds = 10)
## Selected level of aggregation: resource
## Selected timestamp parameter value: complete
## *** OUTPUT ***
## Multi-registration is detected for 4 of the 12 resources (33.33%). These resources are:
## Doctor 7 - Nurse 5 - Nurse 27 - NA
## For the following rows in the activity log, multi-registration is detected:
##
## # A tibble: 9 x 7
## patient_visit_nr activity originator start complete
## <dbl> <chr> <chr> <dttm> <dttm>
## 1 512 Clinica~ Doctor 7 2017-11-20 11:27:12 2017-11-20 11:33:57
## 2 512 Clinica~ Doctor 7 NA 2017-11-20 11:33:57
## 3 524 Triage Nurse 5 2017-11-21 17:04:03 2017-11-21 17:06:05
## 4 525 Triage Nurse 5 2017-11-21 17:04:13 2017-11-21 17:06:08
## 5 526 Triage Nurse 5 2017-11-21 17:04:15 2017-11-21 17:06:10
## 6 536 Triage Nurse 27 2017-11-22 15:15:39 2017-11-22 15:25:01
## 7 536 Treatme~ Nurse 27 2017-11-22 15:15:41 2017-11-22 15:25:03
## 8 533 0 <NA> 2017-11-22 18:35:00 2017-11-22 18:37:00
## 9 534 Registr~ <NA> 2017-11-22 18:35:00 2017-11-22 18:37:00
## # ... with 2 more variables: triagecode <dbl>, specialization <chr>
hospital_actlog %>%
detect_overlaps()
## # A tibble: 7 x 4
## activity_a activity_b n avg_overlap_mins
## <chr> <chr> <int> <dbl>
## 1 Clinical exam Treatment 2 8.17
## 2 Registration Clinical exam 1 1.9
## 3 Registration Triaga 1 2.65
## 4 Registration Triage 1 1.93
## 5 Triage Clinical exam 2 5.63
## 6 Triage Registration 1 0.817
## 7 Triage Treatment 1 9.33
hospital_actlog %>%
detect_similar_labels(column_labels = "activity", max_edit_distance = 3)
## # A tibble: 5 x 3
## column_labels labels similar_to
## <chr> <chr> <chr>
## 1 activity registration Registration
## 2 activity Registration registration
## 3 activity Triage Trage - Triaga
## 4 activity Trage Triage - Triaga
## 5 activity Triaga Triage - Trage
hospital_actlog %>%
detect_time_anomalies()
## Selected anomaly type: both
## *** OUTPUT ***
## For 5 rows in the activity log (9.43%), an anomaly is detected.
## The anomalies are spread over the activities as follows:
## Anomalies are found in the following rows:
## # A tibble: 3 x 3
## # Groups: activity [3]
## activity type n
## <chr> <chr> <int>
## 1 Registration negative duration 3
## 2 Clinical exam zero duration 1
## 3 Trage negative duration 1
## # A tibble: 5 x 9
## patient_visit_nr activity originator start complete
## <dbl> <chr> <chr> <dttm> <dttm>
## 1 518 Registr~ Clerk 12 2017-11-21 11:45:16 2017-11-21 11:22:16
## 2 518 Registr~ Clerk 6 2017-11-21 11:45:16 2017-11-21 11:22:16
## 3 518 Registr~ Clerk 9 2017-11-21 11:45:16 2017-11-21 11:22:16
## 4 520 Trage Nurse 17 2017-11-21 13:43:16 2017-11-21 13:39:00
## 5 528 Clinica~ Doctor 1 2017-11-21 19:00:00 2017-11-21 19:00:00
## # ... with 4 more variables: triagecode <dbl>, specialization <chr>,
## # duration <dbl>, type <chr>
hospital_actlog %>%
detect_unique_values(column_labels = "activity")
## *** OUTPUT ***
## Distinct entries are computed for the following columns:
## activity
## # A tibble: 9 x 1
## activity
## <chr>
## 1 registration
## 2 Registration
## 3 Triage
## 4 Clinical exam
## 5 Trage
## 6 Treatment
## 7 Triaga
## 8 Treatment evaluation
## 9 0
hospital_actlog %>%
detect_unique_values(column_labels = c("activity", "originator"))
## *** OUTPUT ***
## Distinct entries are computed for the following columns:
## activity - originator
## # A tibble: 22 x 2
## activity originator
## <chr> <chr>
## 1 registration Clerk 9
## 2 Registration Clerk 12
## 3 Triage Nurse 27
## 4 Clinical exam Doctor 7
## 5 Triage Nurse 17
## 6 Registration Clerk 6
## 7 Registration Clerk 9
## 8 Trage Nurse 17
## 9 Clinical exam Doctor 4
## 10 Registration Clerk 3
## # ... with 12 more rows
hospital_actlog %>%
detect_value_range_violations(triagecode = domain_numeric(from = 0, to = 5))
## *** OUTPUT ***
## The domain range for column triagecode is checked.
## Values allowed between 0 and 5
## The values fall within the specified domain range for 46 (86.79%) of the rows in the activity log and outside the domain range for 7 (13.21%) of these rows.
##
## The following rows fall outside the specified domain range for indicated column:
## # A tibble: 7 x 8
## column_checked patient_visit_nr activity originator start
## <chr> <dbl> <chr> <chr> <dttm>
## 1 triagecode 510 Clinica~ Doctor 7 2017-11-20 11:35:01
## 2 triagecode 529 Treatme~ Doctor 1 2017-11-22 16:30:00
## 3 triagecode 530 Triage Nurse 17 2017-11-22 18:00:00
## 4 triagecode 531 Triage Nurse 17 2017-11-22 18:05:00
## 5 triagecode 532 Treatme~ Nurse 17 2017-11-22 18:15:00
## 6 triagecode 532 Treatme~ Doctor 7 2017-11-22 18:27:00
## 7 triagecode 533 0 <NA> 2017-11-22 18:35:00
## # ... with 3 more variables: complete <dttm>, triagecode <dbl>,
## # specialization <chr>