Checklist: Prepare Your Own Data Extraction

Now you are ready to prepare your own data extraction. While the IT staff will create the data dump for you, you will have to tell them exactly what kind of data you need, in which format, and so on.

To make sure you are not forgetting something important, here is a checklist that you can go through before your meeting with the administrator (click on the checklist below to download a PDF version that you can print and fill out in the process).

_images/Data-Extraction-Checklist.png

In the following sections, the questions on the checklist are explained in more detail.

What you will learn:

  • What you need to think about when you get started with your own data extraction.
  • How to verify whether the data that you have received works for process mining.

Step 1: Which Process

Determine which process you want to analyze. Indicate the scope of the process (where it starts and where it ends). Often, people have a different process scope in mind while they are using the same process name.

If you are analyzing the process for someone else, make sure all stakeholders agree on the process scope.

If you need to choose a process to start with among multiple candidates, rank all candidates using the following criteria:

  • Availability of data. Can you obtain good process mining data for this process relatively easily? For example, a process that is handled in an ad-hoc manner through email may be possible to analyze to some extent, but it will require a lot of pre-processing. In contrast, a process for which you already have access to some data via the data warehouse may be better than a process that is outsourced and requires the collaboration of an external service provider.
  • Champion support. Equally important to the data availability is a good support from the team that is responsible for the process. You will need at least one strong supporter who knows the process well (domain expert) and can answer questions during your analysis. Without the support of a good champion, you will hit a wall of questions very quickly. Process mining does not happen in a vacuum and an unavailable project sponsor who wants a “surprise me” analysis is a red flag. You also need the support of the process manager in defining the main analysis goals for the process mining project.
  • Improvement potential. With data availability and champion support being equal, pick the process that is most relevant and has the most improvement potential. Unless you just want to play around for learning purposes, there must be an interest from the business in the results of your process mining analysis. You can either assess the improvement potential with a classical business case analysis or base the decision on the gut feeling and anecdotal evidence from the process experts. If you still have a tie between multiple processes, go for the one with the highest volume (for example, based on the number of cases that are processed).

If you are under time pressure to show some results from your process mining analysis within a certain time period, consider initiating the data extraction procedure for two different processes in parallel. You never know in advance which delays might occur in the IT department and which data quality problems the data will contain.

Step 2: Questions About Process

Define 3-5 analysis questions that you want to answer about this process. Try to include questions that relate to the process flow (for example, “Where do we have the most rework in our process?”) to go beyond purely number-based KPIs (for example, percentage of customers that complete a purchase). Include the process manager, or your process champion, in the discussion to make sure that the questions are relevant from a business perspective. Clarify whether the focus is on performance, compliance questions, or both.

There are two main reasons why you should define a few questions upfront:

  1. It will help focus your analysis. Otherwise it is very easy to get lost in the data. Once you start analyzing the data, you will get many more ideas. That’s great. It’s one of the advantages of process mining that it allows for explorative analysis: You can get answers to questions that you did not even know you had just by looking at what is really happening. You can add these questions to your analysis in the process. However, a few main questions as a starting point is critical not to lose sight of your analysis goals.
  2. The questions you want to answer can pose additional requirements on your data extraction. For example, if you want to benchmark the processes in different countries or regions then you need to include the country information as a attribute. Furthermore, as discussed in How Much Data Do You Need?, for some data selection methods the analysis goals (compliance vs. performance) can influence the way in which cases should be chosen.

Defining questions before you start with your first process mining analysis can be a bit of a chicken-and-egg problem: You don’t know yet what kind of questions you can answer with process mining. Don’t worry about it. You can start with your process mining understanding from the Hands-on Tutorial. Process mining is an iterative activity and you can extract more data, or define better questions, the next time you go through the cycle.

Step 3: Which IT Systems

The next step to determine which IT systems are involved in the execution of the process. You can talk to someone who works in the process and ask them which applications they use to perform their tasks. Look out especially for CRM, ERP, workflow and ticketing systems. Their data is often most closely related to the executed process, and contains the most interesting information. However, also custom systems and spreadsheets can be analyzed. Sometimes, even very old legacy systems produce log data that can be used for process mining. Finally, there may be a data warehouse that already collects data from the process you want to analyze.

Depending on the type of involved systems you have identified, your data may be stored in a number of places. Databases are typical for large process-supporting systems and custom or legacy systems. Other systems store event logs in flat file formats. In some situations, you may have to turn on the logging functionality of your systems before data is recorded.

It is possible that not all parts of the process are supported by a system that produces data. For example, there may be manual steps (like a phone call) that are not recorded. That’s OK. You will assess the coverage of the data, and the impact of missing steps, later in the analysis. If there are “blind spots” in the process, you can take this into account when you interpret the results.

If your process is supported by multiple IT systems, consider starting small by extracting the data from just one system first. Even though the full process scope cannot be analyzed, you will be able to show some first results more quickly and increase your understanding of what process mining can do. More data can be included to expand the scope of the process in the next iteration.

Step 4: Case ID

As explained in the The Minimum Requirements for an Event Log section, one of the minimum requirements that your data needs to fulfill is that it must contain a Case ID. Each case is one instance, one execution, of your process. For example, if you look at an ordering process then handling one order is one case. In an administrative process for building permits at a municipality the building permit application number is the case. For every event, you have to know which case it refers to, so that Disco can compare several executions of the process to one another.

Check whether one of the following situations applies to you:

  • The Case ID needs to be combined from multiple fields. This happens if your main case ID does not by itself uniquely identify the process instance. For example, if you analyze the tax return process at the tax office then each citizen is identified by their social security number. However, tax declarations are prepared each year. If you have data from multiple years, then you need to combine the year for which the declaration was filed in addition to the social security number. Make sure you write down all fields that you need to identify a unique case.
  • Different Case IDs are used in different parts of the process. If multiple identifiers are used throughout the process, you need to be able to follow the chain of one case from the beginning to the end. Think of the case as the “streaming object” (like one yellow dot that flows through the process in a process mining animation). For example, if in a purchasing process incoming requests are identified by a PO number and later in the process the payment activities are related to an Invoice number, then there must be a back-reference from the invoice ID to the purchase order. In your data extract, you need one continuous case ID as a reference for all steps in the process.
  • Multiple Case IDs have many-to-many relationships. If different case IDs are used throughout the process they may not always have a 1-to-1 mapping (for example two orders combined in one delivery, or one order delivered in two pieces). Take a look at the example in our article on Many-to-Many Relationships in Process Mining [1] to think through this problem and to determine from which perspective you want to look at the process. You may have to create multiple data extracts if there is more than one perspective that you want to analyze.

In the situation that multiple Case IDs used in different parts of the process, consider starting simple by looking at, for example, the PO handling and the invoice handling process in isolation (as separate processes). You can connect them to look at the end-to-end process later.

If you already know that you want to analyze alternative Case IDs (for example, the service request number as well as the customer ID in a callcenter process) then make sure to write down all the Case ID fields that should be included.

Step 5: Activities

The second minimum requirement is the Activity. The activities are the steps, or status changes, that happen in your process. IT systems may record not only activities you care about, but also less interesting debug information. Make sure you can capture the events that describe the interesting activities (for example, relevant business milestones in your process). If less interesting activities are included, this is not a problem because you can filter them out later.

Determine what you would like to use as an activity name. For example, if you are analyzing the customer journey of visitors on a corporate website, you could use either the URL or the page title of the webpage as the activity name.

If you have multiple candidates for your activity name, keep them all because you may want to change the focus on what an activity is during the analysis. For example, in a loan application process at a bank the system may record changes to both the internal status (showing the stage of the internal process) and the external status (showing the status of the application to the customer or a sales channel partner). Both perspectives can be useful. By keeping both fields in the data, you will have the full flexibility in your analysis later on.

In finding activity names in your data, most likely one of the following situations applies to you:

  1. The IT system keeps a history of steps. In this situation, you will automatically get all the different activities or status changes that have occurred in the process. Depending on the system, the history of status changes can be found in audit trails, history tables, log files, the data warehouse, or is available through a CSV export.

If you have access to the end user interface of the system, you can write down the name of the activity field shown there. If you have multiple kinds of statuses (like an internal and an external status), include them all.

Some audit trails record status changes in the form of ‘Old value’ and ‘New value’ (reflecting the previous and the new status after the change). Include both the old and the new value column in your data set and refer to our article on How To Deal With ‘Old Value / New Value’ Data Sets [2] for more information. If not just status changes but also changes to data fields are recorded in the audit trail, make sure to include an additional column that states which field was updated. This will help you to filter out the relevant steps later on.

  1. Activities are hidden in many different places. In database-centric systems (see also How Difficult Is it to Get Data For Process Mining?) you often need to go into the various business data tables to find the activities for your data set. In this situation, you will not automatically get all the different activities in the process. Instead, you need to determine the most important milestone activities with the process owner and then go look for dates or timestamp information for these activities in the data.
Write down the names of the activities you would like to see in your discovered process in the checklist. If you have access to the end user interface of the system, look up a few cases and write down the timestamps that are shown for each case for these activities. This can help the system administrator to verify whether they found the right fields for the activities that you have requested.

Make sure that the activity names in your data set are human-readable. Most processes are very complex in their behavior and you will not get very far if the steps in the process maps show merely technical status numbers or action codes. You can keep technical action codes in the data, which can be useful to trace back technical questions about the data later on. However, you should also request a mapping to a human-readable name that states what the meaning of the status is. These human-readable status names should be then included as a separate column in the data.

Similarly, if you get your activities from fields in the business database, do not just state the technical name of the data field, which can be very cryptic. Instead, include a description that reflects the name of the activity from a business perspective.

Step 6: Timestamps

The third minimum requirement is the Timestamp. You need at least one timestamp for each activity to be able to bring the events for each case in the right order. If you want to analyze activity durations, you need both a start and a completion timestamp for each activity.

Try to get your timestamps as detailed as possible. For example, if you have a time with hour and minutes then do not just take the date but make sure you include the time as well. If you have seconds or milliseconds that’s great. Even if it seems like you do not need that much detail for your performance analysis in the process, it will be helpful to have detailed timestamps to get the order of the activities right.

Sometimes, you find that many activities have the same timestamp. For example, if your IT system only records the date (and no time) then several steps within the same case may occur on the same day. See whether you can find some sequence information (like a sequence number) in the data that clarifies the order of the steps. If you have a sequence number, make sure you include it as a separate column in the data. You can use it later to sort your rows based on that field. If you don’t have a sequence number don’t worry. There are other ways to deal with same timestamps, which we will get to later. Just include the timestamp.

Furthermore, check whether one of the following situations applies to you:

  • There are more than two timestamps per activity. Sometimes, you do not just have one or two timestamps per activity but more. For example, there may be a timestamp that indicates when a task is ready to be picked up, a second one that indicates when the person started working on it, and a third one stating when the task was completed. Include all these timestamp columns in your data. Later on, you will be able to analyze different questions like, for example, how long it takes for an activity to be picked up after it has been ready, or the average times between being ready and completion.
  • Date and time are in separate columns. Because you can have multiple timestamp columns for your analysis, the date and time for a single timestamp need to be in the same column. You can create a separate column that concatenates the data and time into one field.
  • You have different timestamp patterns. If you get your timestamps from different data sources then these timestamps may have different formats. For example, one activity may have a date timestamp in the format of 01SEP2013 while another activity has a timestamp in the format of 2013-09-10 07.36.51.711899. Luckily, Disco does not force you to provide timestamps in a specific format. Instead, you can simply tell Disco how it should read your timestamps by configuring the timestamp pattern during the import step. However, you can only configure one timestamp pattern per column. Therefore, you need to make sure that you put differently formatted timestamps in separate columns. When they are in different columns, you can simply configure the pattern for each column separately during import. Refer to the article How To Deal With Data Sets That Have Different Timestamp Formats [3] for further information.
  • Your timestamps for repetitions of activities are overwritten. When you keep activity timestamps in a database or Excel sheet, there is often the assumption that the process is followed in the ideal sequence and no rework occurs. Of course this is wrong in most situations: Rework can happen, for example, because that task was not done right the first time and the corresponding process step has to be repeated at a later time. If the system does not capture all timestamps for activity repetitions (often, just the latest timestamp is kept) you lose valuable information. Refer to the example in The Assumed Process is Not Your Real Process [4] to see how you can deal with column-based data formats, which often have this problem.

If you do not have any timestamps at all, then you can still discover your process if you can be sure that the events are correctly ordered. For example, you may know that the sequence of entries in an audit trail reflects the order in which the activities have occurred. Or you may have a sequence number attribute that you can use to sort the data before importing your file. In the absence of timestamps Disco will use the order of the rows in the imported file to construct the sequences for each case. However, your analysis will be limited to the process flow perspective and you will not be able to analyze the performance of your process.

Step 7: Other Attributes

To decide which additional attributes you should include beyond the minimum requirements, it is important to know the main goals for your process analysis. For example, do you want to compare processes for different products, or for different channels? Then you need to make sure to include the relevant product category and channel fields from the source data. You find more examples for additional attributes in Other Columns.

Additional attributes can be very important in your analysis for three main reasons:

  1. They can be used for filtering. Although attributes are not normally shown in the process map itself, you can use them to focus or split your data set to compare the processes for different productions, channels, regions, etc.
  2. They can open up different perspectives. You may be able to change the focus or “unfold” your process map based on some of these attributes as shown in this article on Change in Perspective with Process Mining [5].
  3. They provide context. When you discover interesting patterns in your process, attributes can help you in the root cause analysis. Are delays occurring mostly in one particular region? Is the ideal process followed more frequently in one particular group? You can find out where things work well and use this information to promote best practices. Even a free-text attribute that contains the comment that a service employee made during a phone call can be great context information, when you look at individual cases.

Include anything that might be useful. By looking at the data attributes that are available, you may also get further ideas for your analysis. It is no problem to start out with a data set that contains dozens (e.g., up to 40 or 50 or more) additional columns with process-relevant context information. You can always leave them out later on when you import the data in Disco.

When you include attributes that can change over time (like the organizational role of a person), try to get the right attribute value from the time that the activity was performed. For example, you want to know the role the person had when they did something in the process rather than the role that they have now.

Step 8: Selection Method

Read the chapter on How Much Data Do You Need? and determine whether you can extract all activities in a certain timeframe or whether you will select cases based on a particular start or end activity. If you select cases based on a start or end activity, write down which start or end activity will be used.

Step 9: Timeframe

Read the chapter on How Much Data Do You Need? and determine the timeframe based on your data selection method. Write down the start and the end date of the timeframe you want to extract the data for.

Step 10: Required Format

When you ask someone to extract data for you, they will ask about the required format. With “format” they typically mean everything, including the content, the timeframe, etc.—not just the actual file format.

As for the file format, it is typically best to extract a plain Comma Separated Values (CSV) file, where each row corresponds to an activity that happened in the process and each column gives some information about the context of that activity. CSV files can be extracted from almost any IT system or database and they do not have a data limit like Excel has. Refer to Import for further details about the file formats that can be imported in Disco and take a look at the Required format for CSV, Excel and TXT Files.

To explain what kind of data you need, you can best create a small sample data snippet in the following way. Take one of the example files that come with Disco and open it in Excel. Look up one or a few cases in the operational system and fill in the values from these cases in your data sample. This way, they can better understand what you are asking for and they can cross-check their data extracts based on the sample cases you gave them.

Make sure that:

  • There are multiple rows that all include a reference to the same Case ID
  • There are different activities for one case (so, different activity names appear in different rows for the same case)
  • There are different timestamps for different activities in a case (not just one timestamp that is the same for all activities in each case)

As explained in Required format for CSV, Excel and TXT Files, the separator that is used to set apart the columns in your data (typically a comma, semicolon, pipe or tab) should not appear in the contents of your data values (for example, the activity name) without being enclosed in quotes. Most CSV exports do this automatically. If you are not familiar with the CSV format then you can look up the definitions on Wikipedia here.

To make sure that you preserve special characters in your data, it is best to save the file in Unicode (e.g., UTF-8).

It is typically a good idea to ask for a small test extract first. This way, you can inspect the data to see whether it is suitable before they run the big data export. If you can, also try to get a data dictionary that explains the meaning of the different fields along with the data.

Get Started!

Once you have extracted your data, importing your event log in Disco is really easy. You can just open your file and simply select each column to configure it either as Case ID, Activity, Timestamp, Resource, or Attribute. Then press Start import as shown in Figure 2.

_images/Importing-Disco.png

Figure 2: Once you have extracted your data, importing it in Disco is easy: Simply select each column to configure it either as Case ID, Activity, Timestamp, Resource, or Attribute. Then press Start import.

For further details on how to import your data, you can refer to the Import chapter in the Disco user guide.

Footnotes

[1]Article on Dealing with Many-To-Many Relationships in Process Mining: https://fluxicon.com/blog/2015/04/dealing-with-many-to-many-relationships-in-data-extraction-for-process-mining/.
[2]Article on How To Deal With ‘Old Value / New Value’ Data Sets: https://fluxicon.com/blog/2015/09/how-to-deal-with-old-value-new-value-data-sets/.
[3]Article on How To Deal With Data Sets That Have Different Timestamp Formats: http://fluxicon.com/blog/2015/12/how-to-deal-with-data-sets-that-have-different-timestamp-formats/.
[4]Read the section ‘The Assumed Process is Not Your Real Process’ in the following article http://fluxicon.com/blog/2014/01/why-process-mining-is-better-than-excel-for-process-analysis/ to see an example of how you can deal with column-based data formats, which overwrite timestamps for activity repetitions.
[5]Article on Change in Perspective with Process Mining: http://www.kdnuggets.com/2016/02/change-perspective-process-mining.html.