Nexto more generic data filtering methods discussed here
The filters for event data subsetting can mostly be divided into two type: event filters and case filters. Event filters will subset parts of cases based on criteria applied on the events (e.g. the resource which performed it), while case filters will subset complete cases, based on criteria applied on the cases (e.g. the trace length).
Each filter has a reverse argument, which allows to reverse the filter very easily. Furthermore, each filter has an interface-alternative, which can be called by adding a i before the function name.
The filter activity function can be used to filter activities by name. It has three arguments
patients %>%
filter_activity(c("X-Ray", "Blood test")) %>%
activities
## # A tibble: 2 x 3
## handling absolute_frequency relative_frequency
## <fct> <int> <dbl>
## 1 X-Ray 261 0.524
## 2 Blood test 237 0.476
As one can see, there are only 2 distinct activities left in the event log.
It is also possible to filter on activity frequency. This filter uses a percentile cut off, and will look at those activities which are most frequent until the required percentage of events has been reached. Thus, a percentile cut off of 80% will look at the activities needed to represent 80% of the events. In the example below, the least frequent activities covering 50% of the event log are selected, since the reverse argument is true.
patients %>%
filter_activity_frequency(percentage = 0.5, reverse = T) %>%
activities
## # A tibble: 4 x 3
## handling absolute_frequency relative_frequency
## <fct> <int> <dbl>
## 1 Check-out 492 0.401
## 2 X-Ray 261 0.213
## 3 Blood test 237 0.193
## 4 MRI SCAN 236 0.192
Instead of providing a target percentage, we can provide a target frequency interval. For example, only retain the activities which occur more than 300 times.
patients %>%
filter_activity_frequency(interval = c(300,500)) %>%
activities
## # A tibble: 4 x 3
## handling absolute_frequency relative_frequency
## <fct> <int> <dbl>
## 1 Registration 500 0.252
## 2 Triage and Assessment 500 0.252
## 3 Discuss Results 495 0.249
## 4 Check-out 492 0.248
When we don’t now the maximal frequency - 500 in this case, we can use an open interval by using NA.
patients %>%
filter_activity_frequency(interval = c(300, NA)) %>%
activities
## # A tibble: 4 x 3
## handling absolute_frequency relative_frequency
## <fct> <int> <dbl>
## 1 Registration 500 0.252
## 2 Triage and Assessment 500 0.252
## 3 Discuss Results 495 0.249
## 4 Check-out 492 0.248
Similar to the activity filter, the resource filter can be used to filter events by listing on or more resources.
patients %>%
filter_resource(c("r1","r4")) %>%
resource_frequency("resource")
## # A tibble: 2 x 3
## employee absolute relative
## <fct> <int> <dbl>
## 1 r1 500 0.679
## 2 r4 236 0.321
Instead of filtering events by the resource that performed the activity, we can also filter event by the frequency of the resource. This happens in the same way as for the activity frequency filter. The filter below gives us the 80% activity instances performed by the most common resources.
patients %>%
filter_resource_frequency(perc = 0.80) %>%
resources()
## # A tibble: 5 x 3
## employee absolute_frequency relative_frequency
## <fct> <int> <dbl>
## 1 r1 500 0.222
## 2 r2 500 0.222
## 3 r6 495 0.220
## 4 r7 492 0.219
## 5 r5 261 0.116
Alternatively, using the interval argument, we can select resources who perform between 200 and 300 activity instances.
patients %>%
filter_resource_frequency(interval = c(200,300)) %>%
resources()
## # A tibble: 3 x 3
## employee absolute_frequency relative_frequency
## <fct> <int> <dbl>
## 1 r5 261 0.356
## 2 r3 237 0.323
## 3 r4 236 0.322
The trim filter is a special event filter, as it also take into account the notion of cases. In fact, it trim cases such that they start with a certain activities until they end with a certain activity. It requires two list: one for possible start activities and one for end activities. The cases will be trimmed from the first appearance of a start activity till the last appearance of an end activity. When reversed, these slices of the event log will be removed instead of preserved.
patients %>%
filter_trim(start_activities = "Registration", end_activities = c("MRI SCAN","X-Ray")) %>%
process_map(type = performance())
Instead of triming cases to a particular start and/or end activity, we can also trim cases to a particular time window. For this we use the function filter_time_period
with filter_method trim
. This filter needs a time interval, which is a vector of length 2 containing data/datetime values. These can be created easily using lubridate function, e.g. ymd
for year-month-day formats.
This example takes only activity instances which happened (at least partly, i.e. some events) in December of 2017.
library(lubridate)
patients %>%
filter_time_period(interval = ymd(c(20171201, 20171231)), filter_method = "trim") %>%
summary()
## Number of events: 290
## Number of cases: 36
## Number of traces: 13
## Number of distinct activities: 7
## Average trace length: 8.055556
##
## Start eventlog: 2017-11-30 20:29:12
## End eventlog: 2017-12-31 08:00:08
## handling patient employee handling_id
## Blood test :30 Length:290 r1:52 Length:290
## Check-out :48 Class :character r2:52 Class :character
## Discuss Results :54 Mode :character r3:30 Mode :character
## MRI SCAN :30 r4:30
## Registration :52 r5:24
## Triage and Assessment:52 r6:54
## X-Ray :24 r7:48
## registration_type time .order
## complete:145 Min. :2017-11-30 20:29:12 Min. : 1.00
## start :145 1st Qu.:2017-12-06 01:04:43 1st Qu.: 73.25
## Median :2017-12-13 13:12:47 Median :145.50
## Mean :2017-12-13 20:14:51 Mean :145.50
## 3rd Qu.:2017-12-19 18:09:13 3rd Qu.:217.75
## Max. :2017-12-31 08:00:08 Max. :290.00
##
Using a different filter method (start, complete, contained or intersecting), this filter can also act as a case filter (see below).
Instead of filtering events, or parts of cases, we can also filter event data by taking (or leaving) cases as a whole. Using edeaR
, there are the following options to filter cases:
Filtering on throughput time can be done in an absolute and relative way, just as for many other filters.
For instance, we can filter cases with a throughput time between 50 and 100 hours. Notice that setting the time unit argument appropriately is important in this case.
patients %>%
filter_throughput_time(interval = c(50, 100), units = "hours") %>%
throughput_time(units = "hours")
## min q1 median mean q3 max st_dev iqr
## 50.08389 63.55361 78.62292 76.84821 87.43417 99.95861 13.76536 23.88056
## attr(,"units")
## [1] "hours"
Alternatively, we can filter the 50% cases with the lowest throughput time.
patients %>%
filter_throughput_time(percentage = 0.5) %>%
throughput_time(units = "hours")
## min q1 median mean q3 max st_dev iqr
## 35.90611 81.16660 103.51056 99.84844 120.75403 145.86722 26.59339 39.58743
## attr(,"units")
## [1] "hours"
In both cases, the selection can be negated using the reverse
argument. When using an interval, one of the limits can be set to NA to create an open interval.
Filtering on processing time happens in exactly the same way as the filter on throughput time, as the examples below show.
patients %>%
filter_processing_time(interval = c(50, 100), units = "hours") %>%
processing_time(units = "hours")
## Warning: Factor `handling` contains implicit NA, consider using
## `forcats::fct_explicit_na`
## Warning: Factor `employee` contains implicit NA, consider using
## `forcats::fct_explicit_na`
## Warning: Factor `handling` contains implicit NA, consider using
## `forcats::fct_explicit_na`
## Warning: Factor `employee` contains implicit NA, consider using
## `forcats::fct_explicit_na`
## Warning in min.default(structure(numeric(0), class = c("POSIXct", "POSIXt": no
## non-missing arguments to min; returning Inf
## Warning in max.default(structure(numeric(0), class = c("POSIXct", "POSIXt": no
## non-missing arguments to max; returning -Inf
## Warning: Factor `handling` contains implicit NA, consider using
## `forcats::fct_explicit_na`
## min q1 median mean q3 max st_dev iqr
## -Inf -Inf -Inf -Inf -Inf -Inf NA NaN
## attr(,"units")
## [1] "hours"
patients %>%
filter_processing_time(percentage = 0.5) %>%
processing_time(units = "hours")
## min q1 median mean q3 max st_dev iqr
## 10.717778 23.043750 24.945000 24.387612 26.423264 27.726944 2.657736 3.379514
## attr(,"units")
## [1] "hours"
Filtering on trace length is similar to filters on processing or throughput time. Only the units argument is not needed here.
patients %>%
filter_trace_length(interval = c(2, 5)) %>%
trace_length(units = "hours")
## min q1 median mean q3 max st_dev iqr
## 2.000000 5.000000 5.000000 4.951128 5.000000 5.000000 0.338502 0.000000
patients %>%
filter_trace_length(percentage = 0.5) %>%
trace_length()
## min q1 median mean q3 max st_dev iqr
## 5.0000000 6.0000000 6.0000000 5.9360000 6.0000000 6.0000000 0.2452439 0.0000000
When looking at control-flow, we can select cases that contain a specific activity, for instance a X-Ray scan.
patients %>%
filter_activity_presence("X-Ray") %>%
traces
## # A tibble: 3 x 3
## trace absolute_frequen~ relative_frequen~
## <chr> <int> <dbl>
## 1 Registration,Triage and Assessment,X-Ray,~ 258 0.989
## 2 Registration,Triage and Assessment,X-Ray 2 0.00766
## 3 Registration,Triage and Assessment,X-Ray,~ 1 0.00383
Or that don’t have a specific activity.
patients %>%
filter_activity_presence("X-Ray", reverse = T) %>%
traces
## # A tibble: 4 x 3
## trace absolute_frequen~ relative_frequen~
## <chr> <int> <dbl>
## 1 Registration,Triage and Assessment,Blood ~ 234 0.979
## 2 Registration,Triage and Assessment,Blood ~ 2 0.00837
## 3 Registration,Triage and Assessment 2 0.00837
## 4 Registration,Triage and Assessment,Blood ~ 1 0.00418
We can also test more than one activity. In this case, we can require “all”, “one_of” or “none” of them to be present, through setting the argument method
correctly.
For example, there are no case that have both X-Ray and MRI-SCAN
patients %>%
filter_activity_presence(c("X-Ray", "MRI SCAN"), method = "all") %>%
traces
## [1] trace absolute_frequency relative_frequency
## <0 rows> (or 0-length row.names)
Almost all have on of them.
patients %>%
filter_activity_presence(c("X-Ray", "MRI SCAN"), method = "one_of") %>%
traces
## # A tibble: 5 x 3
## trace absolute_frequen~ relative_frequen~
## <chr> <int> <dbl>
## 1 Registration,Triage and Assessment,X-Ray,~ 258 0.519
## 2 Registration,Triage and Assessment,Blood ~ 234 0.471
## 3 Registration,Triage and Assessment,Blood ~ 2 0.00402
## 4 Registration,Triage and Assessment,X-Ray 2 0.00402
## 5 Registration,Triage and Assessment,X-Ray,~ 1 0.00201
And 3 have none of them.
patients %>%
filter_activity_presence(c("X-Ray", "MRI SCAN"), method = "none") %>%
traces
## # A tibble: 2 x 3
## trace absolute_frequency relative_frequen~
## <chr> <int> <dbl>
## 1 Registration,Triage and Assessment 2 0.667
## 2 Registration,Triage and Assessment,Blood~ 1 0.333
Another way is to select cases with a specific start and or end activity. In case of the patients data set, all cases start with “Registration”. Filtering cases that don’t start with Registration gives an empty log.
patients %>%
filter_endpoints(start_activities = "Registration", reverse = T)
## Log of 0 events consisting of:
## 0 traces
## 0 cases
## 0 instances of 0 activities
## 0 resources
## Warning in min.default(structure(numeric(0), class = c("POSIXct", "POSIXt": no
## non-missing arguments to min; returning Inf
## Warning in max.default(structure(numeric(0), class = c("POSIXct", "POSIXt": no
## non-missing arguments to max; returning -Inf
## Events occurred from NA until NA
##
## Variables were mapped as follows:
## Case identifier: patient
## Activity identifier: handling
## Resource identifier: employee
## Activity instance identifier: handling_id
## Timestamp: time
## Lifecycle transition: registration_type
##
## # A tibble: 0 x 7
## # ... with 7 variables: handling <fct>, patient <chr>, employee <fct>,
## # handling_id <chr>, registration_type <fct>, time <dttm>, .order <int>
If we are interested to see the “completed” cases, those that start with Registration and end we “Check-out”, we can apply the following filter.
patients %>%
filter_endpoints(start_activities = "Registration", end_activities = "Check-out") %>%
process_map()
Another control-flow filtering approach is to look at precedences between activities. The filter_precedence
function uses 5 different inputs
If there is more than one antecedent or consequent activity, the filter will test all possible pairs. The filter_method will tell the filter whether all of the rules should hold, at least one, or none are allowed.
For example, take the patients data. The following filter takes only cases where “Triage and Assessment” is directly followed by “Blood test”.
patients %>%
filter_precedence(antecedents = "Triage and Assessment",
consequents = "Blood test",
precedence_type = "directly_follows") %>%
traces
## # A tibble: 3 x 3
## trace absolute_frequen~ relative_frequen~
## <chr> <int> <dbl>
## 1 Registration,Triage and Assessment,Blood ~ 234 0.987
## 2 Registration,Triage and Assessment,Blood ~ 2 0.00844
## 3 Registration,Triage and Assessment,Blood ~ 1 0.00422
The following selects cases where Triage and Assessment is eventually followed by both Blood test and X-Ray, which never happens.
patients %>%
filter_precedence(antecedents = "Triage and Assessment",
consequents = c("Blood test", "X-Ray"),
precedence_type = "eventually_follows",
filter_method = "all") %>%
traces
## [1] trace absolute_frequency relative_frequency
## <0 rows> (or 0-length row.names)
The next filter selects cases where Triage and Assessement is eventually followed by at least one the three antecedents, by changing the filter method to one_of.
patients %>%
filter_precedence(antecedents = "Triage and Assessment",
consequents = c("Blood test", "X-Ray", "MRI SCAN"),
precedence_type = "eventually_follows",
filter_method = "one_of") %>%
traces
## [1] trace absolute_frequency relative_frequency
## <0 rows> (or 0-length row.names)
This final example only retains cases where Triage and Assessment is not followed by any of the three consequent activities. The result is 2 incomplete cases where the last activity was Triage and Assessment.
patients %>%
filter_precedence(antecedents = "Triage and Assessment",
consequents = c("Blood test", "X-Ray", "MRI SCAN"),
precedence_type = "eventually_follows",
filter_method = "none") %>%
traces
## # A tibble: 7 x 3
## trace absolute_frequen~ relative_frequen~
## <chr> <int> <dbl>
## 1 Registration,Triage and Assessment,X-Ray,~ 258 0.516
## 2 Registration,Triage and Assessment,Blood ~ 234 0.468
## 3 Registration,Triage and Assessment,Blood ~ 2 0.004
## 4 Registration,Triage and Assessment,X-Ray 2 0.004
## 5 Registration,Triage and Assessment 2 0.004
## 6 Registration,Triage and Assessment,X-Ray,~ 1 0.002
## 7 Registration,Triage and Assessment,Blood ~ 1 0.002
Filtering on trace frequency is similar to the filters on activity/resource frequence and the performance filter: you can choose between a percentage target or between an frequency interval.
Select 80% of the cases that share the most common traces.
sepsis %>%
filter_trace_frequency(percentage = 0.8) %>%
n_cases()
## [1] 1050
Or the 20% least common ones.
sepsis %>%
filter_trace_frequency(percentage = 0.2) %>%
n_cases()
## [1] 266
Or the cases of which the trace frequency is less than 50.
sepsis %>%
filter_trace_frequency(interval = c(0,50)) %>%
n_cases()
## [1] 1050
Filtering cases by time period can be done using the filter_time_period
introduced above. There are four different methods that result in case filters:
The following four example dotted charts show the impact of the four different methods using the same interval.
sepsis %>%
filter_time_period(interval = ymd(c(20150101, 20150131)), filter_method = "start") %>%
dotted_chart
## Joining, by = "case_id"
sepsis %>%
filter_time_period(interval = ymd(c(20150101, 20150131)), filter_method = "complete") %>%
dotted_chart
## Joining, by = "case_id"
sepsis %>%
filter_time_period(interval = ymd(c(20150101, 20150131)), filter_method = "contained") %>%
dotted_chart
## Joining, by = "case_id"
sepsis %>%
filter_time_period(interval = ymd(c(20150101, 20150131)), filter_method = "intersecting") %>%
dotted_chart
## Joining, by = "case_id"