Import

In the Data Requirements you have learned about what kind of data is needed to do a process mining analysis.

Disco has been designed to make the data import really easy for you by automatically detecting timestamps, remembering your settings, and by loading your data sets with unprecedented speed. In this chapter you find further details about which kind of files you can load in Disco and how the data import works.

Importing Data Sets

Importing is symbolized by the folder icon shown in Figure 1.

_images/icon-open.png

Figure 1: Open symbol in Disco.

Clicking on the folder symbol will open up the file chooser dialog and let you import a data set into your workspace. You find the Open symbol in the top left of your Disco window as shown in Figure 2.

_images/1_EmptyWorkspace.png

Figure 2: Pressing the open symbol brings up the file chooser dialog. (On Windows you will see the standard Windows file chooser dialog.)

Most of the times you will open and configure files with the following extensions:

  • .csv (Comma Separated Values), or .zip files containing a CSV file [1]
  • .txt (Text files), or .zip files containing a TXT file
  • .xls and .xlsx (Microsoft Excel)

If you open an Excel file that contains multiple non-empty sheets, then Disco lets you choose which one you want to import like shown in Figure 3.

_images/2_SelectExcelSheet.png

Figure 3: If you have multiple non-empty sheets in your Excel file, then Disco lets you choose which one you want to import.

Read on to learn how the required format for these files looks like, and how exactly the configuration works in Disco.

Furthermore, Disco reads pre-configured files in various standard formats:

Jump to Importing Pre-configured Data Sets if rather than importing CSV or Excel files what you want is to load files with the extensions above.

Required format for CSV, Excel and TXT Files

You can open files in the format shown in Figure 4. Every line or row is expected to contain the information about one event or executed activity in your process. You need at least one case ID column, an activity, and ideally one or more timestamps.

If you don’t know what a case ID or an activity is, please read the introduction about event logs in the Data Requirements.

_images/PM-Format.png

Figure 4: Example file in Excel: You need to have one row for each activity that was performed in the process, plus a case ID, an activity, and a timestamp.

Note that the rows in your file do not need to be sorted. Disco will sort the activities per case based on the timestamps in your file. Only if you use a log that does not have timestamps (or if some events in the same case have identical timestamps), the events will be imported in the order in which they appear in the file.

The order and the name of the columns do not matter either. For example, it is not necessary that your case ID column is called “CaseID” as in Figure 4. What is important is that you have at least one column that can be used as a case ID and at least one column that can be used as activity name. You can tell Disco which column means what in the configuration step (see Import Configuration Settings).

The timestamp does not need to have a specific format. Instead, Disco reads the timestamps in the format that you provide (see Configuring Timestamp Patterns to learn how timestamp patterns can be configured). If you have multiple timestamp columns that indicate when an activity has been scheduled, started, and completed, then you can make use of that (see Including Multiple Timestamp Columns for further details).

Finally, you can have as many additional data columns as you like. They will be included as attributes that can be used in your analysis later on.

If your data is not in Excel but in a database or some other information system, then the best format to extract the data is as a Comma Separated Values (CSV) file. Logically, a CSV file is very similar to the Excel table from Figure 4: Each line contains one event and the different column cells are separated by a comma (or some other delimiting character). However, a CSV file does not have the limit of 1 million rows that Excel has and, therefore, is also suitable to contain much larger data sets. You can open a CSV file in a standard text editor.

Here is an example excerpt of an event log in CSV format:

case01, Request Quotes, Tom, 2009-06-12 09:45
case01, Authorization, Peter, 2009-06-12 13:11
case23, Request Quotes, Mary, 2009-06-12 15:29
case01, Compile Result, Jonas, 2009-06-13 10:01
case23, Authorization, Peter, 2009-06-13 11:25
case01, Check Result, Amanda, 2009-06-13 15:09
case23, Compile Result, Andy, 2009-06-14 09:15

The first line describes an event:

  • Which has occurred while executing the case “case01”,
  • Which has been triggered by the execution of activity “Request Quotes”,
  • Where this activity had been executed by resource “Tom”,
  • And where the activity had been executed on 12 June 2009, at quarter to ten in the morning.

The delimiting character can be either a comma (”,”), a semicolon (”;”), a tab (“t”), or a pipe (“|”) character. If your delimiting character is contained in the contents of your file, then the content elements need to be grouped by quotes. For example, if you use the comma as delimiting character and your activity name is Request Quotes, Standard (so, the activity name itself contains a comma as well):

case01,Request Quotes, Standard,Tom,2009-06-12 09:45

then the activity name needs to be surrounded by quotes:

case01,"Request Quotes,Standard",Tom,2009-06-12 09:45

All this is pretty standard and will be done automatically by most databases or other export functions.

Import Configuration Settings

Once you have opened your Excel, CSV, or text file in Disco, you see an import configuration screen as shown in Figure 5.

_images/6_ImportSettings.png

Figure 5: Import Configuration screen in Disco.

On the import screen, you see the same columns as if you had opened the file in Excel (see Figure 4 for comparison). Disco starts by guessing what each column might mean (trying to identify case ID, activity, and timestamp), but you can check and correct the configuration before the import proceeds.

  1. Preview of the loaded file. A preview of the file you have loaded helps you through the configuration step. Select each column to configure it.

The following configuration options are available for each column:

  1. Remove. Ignores the selected column (will not be imported at all).
  2. Case. Set the selected column as the case ID.
  3. Activity. Configures the selected column as activity name.
  4. Timestamp. Indicates that the selected column contains the timestamp.
  5. Resource. Configures column as resource column for organizational analysis.
  6. Other. Includes the selected column as an additional attribute.

The current configuration for each column is indicated by the little configuration symbol in the header of each column in the previewed data table.

  1. Exclude/Include all. With this button you can exclude or include all columns at once. This is particularly helpful if you have many different columns and only want to include or exclude a few of them.
  2. Start import. After you have configured your columns, you can start to import the log by pressing this button.

Minimum Requirements

You need to configure at least one case ID column and one activity column before you can start the import of your file.

_images/4_NoActivity.png

Figure 6: Disco tells you if no activity or case ID column has been configured yet.

Disco tells you which configuration step is missing if these minimum requirements are not met: For example, in Figure 6 you see a configuration where no activity column has been defined yet and, therefore, the Start import button is still inactive. In Figure 7, the status column is used for the activity name and the log is ready for import.

_images/5_ReadyForConversion.png

Figure 7: As soon as all required columns are configured, the actual log import can be started.

If you are not sure what the case ID or activity should be for your log, you can learn more about event logs in the introductory chapter The Minimum Requirements for an Event Log.

Import Progress

As soon as you have pressed the Start import button, Disco starts reading the complete file according to the given configuration. The progress of the import is shown along with an indication of how much data has been already read; see (1) in Figure 8.

_images/Inprogress.png

Figure 8: After you have configured your columns, the import can be started. You’ll see the progress (1) and can stop the import if needed (2).

Disco has been designed for speed and usually loads your data really fast. So, most of the times the import will be finished before you can actually read the progress indicator. However, if it should take longer and you want to abort the import process, then you can do that by pressing the x to the right of the progress bar; see (2) in Figure 8.

_images/AfterImport.png

Figure 9: As soon as the import is finished, you are directly taken to the Map view.

After the import is finished, you are directly taken to the Map view, where you can inspect the process flow and start your analysis; see Figure 9. Refer to Analyzing Process Maps to learn more about how to work with process maps in Disco.

Saving the Import Settings

The import configuration is saved automatically and will be restored the next time you load the same or a similar file.

Configuring Timestamp Patterns

Timestamps can come in various formats. For example, the date 4 August 2012 might be represented as 04/08/12 or 08/04/12, as 2012-08-04, or as 4.8.2012, and in many other ways. The same holds for the time of the day. Different conventions regarding the order, separating characters, with or without spaces, etc. often make it a pain for data analysts to deal with timestamps.

Disco makes it as easy for you as possibly possible: When it parses the first rows of your data set to make suggestions for how you might want to configure your data columns (see Import Configuration Settings), different timestamp patterns are tested against your data to see which one gives the best match. This means that in more than 90% of the cases your timestamp format is automatically detected and you do not need to manually configure anything about it at all.

You can verify that everything is in order by selecting your timestamp column in the configuration screen—like in Figure 6: If Disco says that the pattern matches all rows (see Figure 10 below) then everything is OK.

_images/PatternMatchesAllRows.png

Figure 10: Feedback on how many rows of the selected timestamp column match the current timestamp pattern.

To inspect and change the timestamp pattern press the Pattern... button (see (1) in Figure 10). This will bring up the timestamp pattern configuration screen shown in Figure 11.

_images/wrongPattern.png

Figure 11: If the current pattern (1) does not match your data (3) then the interactive preview will give you direct feedback (4).

The timestamp pattern configuration screen allows you to inspect and modify the timestamp pattern to fit your data.

Disco allows you to directly specify the date and time pattern in Java’s Simple Data Format [2]. To specify a custom timestamp pattern, you can follow these steps:

  1. Look at the first couple of example dates from your file in the preview; see (3) in Figure 11. Find out how the pattern starts: Is it the year? The month? For example, in Figure 11 it is the day of the month.
  2. Check the pattern legend; see (2) in Figure 11. Find out which letter represents your date or time entity. For example, the day of the month is represented by the lower case letter “d”.
  3. Start typing your custom pattern in the pattern entry field; see (1) in Figure 11. Type the letter representing your date or time entity as many times as the corresponding entity has digits in the pattern. For example, the date of the month is represented by up to two digits. So, you would start typing dd to match the day at the beginning of your timestamp pattern.
  4. Look at the timestamp examples in your preview (3) again and find out how the pattern separates the different date and time entities. Is it a dot (.) or a dash (-) or just a white space? For example, in Figure 11 the day of the month and the month are separated by a dot. So, you would expand your pattern from dd to dd.MM for matching both day and month.
  5. Verify in each step that the pattern that you have provided so far gives the expected result. The preview window gives you interactive feedback while you are typing your pattern; see (3) in Figure 11. For example, in Figure 11 the pattern dd- is wrong and does not match the timestamps in the data sample.
  6. Continue until all date and time elements are matched completely. For example, the pattern dd.MM.yy HH:mm matches all timestamps in Figure 12.
_images/TimestampPattern.png

Figure 12: Once you have found a matching pattern you can use it.

Once you have matched all date and time elements for the example timestamps in your preview, you can press Use pattern and Disco will remember your custom pattern for the future.

Combining Multiple Case ID, Activity, or Resource Columns

In most situations, there is a clear candidate for the case ID and the activity column. But often there are multiple views that one can take on the data.

In Data Requirements we have explained the mental model that underlies process mining and how, for example, the activity name determines the level of detail for the process steps. If you refer to Combined Activity, then you can see how a more detailed process map for the call center example could be discovered based on using both the Operation and the Agent Position column together as the activity name.

To make it easy for you to explore multiple views, Disco allows you to combine multiple columns for:

  • the case ID,
  • activity, and
  • resource field.

This works by simply assigning not just one but multiple columns as your case ID, activity, or resource. The contents of these columns will be concatenated (joined together).

For example, in Figure 5 only the Status column was configured as the activity. The Service Line column was included as a plain attribute. This leads to the process map shown in Figure 9, with activity names such as Offered, Scheduled, and In progress.

Alternatively, you could configure both Status and Service Line as activity like shown in Figure 13. This then gives you a more detailed process map as you can see in Figure 14, with activity names such as Offered-2nd line, Offered-Specialist, and In progress-Specialist.

_images/CombinedActivity.png

Figure 13: By including the service line (1st line, 2nd line, or Specialist) into the activity name ...

_images/CombinedActivity-Result.png

Figure 14: ... one can distinguish activities that were performed for cases resolved in these different parts of the organization.

In the example in Figure 13 the Service Line is a static attribute—assigned for each case depending on where it was eventually resolved (so, you don’t see the actual transfer of a case taken in by the 1st line and handed over to the 2nd line). This is due to the way the data was recorded (see the call center example in Combined Activity, where the service line handovers are recorded in a more detailed manner).

As a result, you get to see the different process flows for the 1st line, 2nd line, and Specialist process categories in one view, next to each other. You could do the same for different products, departments, or other interesting categories for your process. A similar result can be achieved by using the Attribute Filter to explicitly filter different versions of the process based on the Service Line attribute (or other categories).

Swapping Cases, Activities, and Resources

Similar to combining multiple activity columns (see Combining Multiple Case ID, Activity, or Resource Columns), different views can be taken on the process by changing the column configuration.

For example, in Figure 15 the Resource column from the purchasing example log is configured as the activity. The result is a process map where you can see how work is transferred between people as shown in Figure 16.

_images/ActivitySwapped-1.png

Figure 15: By configuring the resource column as the activity ...

_images/ActivitySwapped-2.png

Figure 16: ... you can see how the process flows between people.

Similarly, in Figure 17 a role-based view on the process is shown. By configuring the Role column—see column to the right of the Resource column in Figure 15—as the activity, one can see how the process flows between different roles in the organization. The performance visualization in Figure 18 highlights that the manager is a bottleneck by taking about seven days on average before forwarding the case to the purchasing agent.

_images/ActivitySwapped-3.png

Figure 17: By configuring the role column as the activity, the process flow between functions is visualized.

_images/ActivitySwapped-4.png

Figure 18: Due to low availability the purchasing agent causes most of the delays in the purchasing process.

To give a case ID swapping example, imagine a healthcare process where one would normally want to look at the diagnosis and treatment process from a patient perspective. So, the patient ID column would be the natural configuration for the case identifier. However, it can also be interesting to compare how different physicians work to identify and promote best practices. In this case, the physician’s name (normally configured as a resource column) would be part of the case ID.

Depending on how you look at your data, multiple views on the process can be taken (see also Data Requirements). By making it easy to interpret columns in different ways as shown here, by combining multiple columns (Combining Multiple Case ID, Activity, or Resource Columns), and by allowing you to go back and adjust the configuration (Adjusting the Import Configuration), Disco encourages the exploration of your process in a multi-faceted way.

Including Multiple Timestamp Columns

At least one timestamp is usually needed to bring the events for each case in the right order, and in many situations one timestamp is all you have (like in the example in Figure 5). A single timestamp is enough to discover the process flows, to measure the time between status changes, and the time to run through the whole process from start to finish etc.

_images/StartAndCompleteTimestamp.png

Figure 19: If you have start and completion timestamps, make sure to include both.

However, sometimes you have multiple timestamps for the start and completion of an activity. This is great because it allows you to measure not only the time between different process steps but also the actual execution time for an activity. If you have two timestamps per activity, you can simply configure both of them as a timestamp column as shown in Figure 19. Disco will interpret the earliest timestamp as the start and the latest timestamp as the completion of the activity for each row.

As a result, the active time (where somebody is actually working on the activity) and passive time (where nothing happens to the case) can be distinguished in the analysis. For example, in Figure 20 you see a fragment of the resulting process map, where average durations are attached both to the activities themselves as well as to the (idle) times between them. Refer to Performance Metrics to learn more about how the process maps in Disco can be annotated with performance information.

_images/StartAndCompleteTimestampResult.png

Figure 20: If start and completion timestamps are available, then execution times (active) and waiting times (passive) can be distinguished in the analysis.

In some situations, you may have even more than two timestamps. For example, in Figure 21 another call center log fragment is shown. There are three columns that hold relevant timestamp information: ACT_CREATED, ACT_START, and ACT_END:

  • The ACT_END column always holds the completion timestamp for the agent’s activity.
  • The ACT_START column holds the actual start time for timed activities; for non-timed activities the corresponding field is empty.
  • The ACT_CREATED column is always filled, but for timed activities the start time does not reflect the actual start time. The time difference between ACT_START and ACT_CREATED for timed activities is significant for call center analyses as timing is very important.
_images/MultipleTimestamps.png

Figure 21: Among multiple timestamp columns per row, the earliest is taken as the start and the latest as the completion of the activity.

With Disco this situation can be resolved very easily. Simply include all three columns as a timestamp column during configuration and the earliest and latest timestamp will be chosen as the start and completion time for each activity.

Adjusting the Import Configuration

As shown in Combining Multiple Case ID, Activity, or Resource Columns and Swapping Cases, Activities, and Resources, there are often multiple views that can be taken on your process. To explore these different views, or perhaps to correct a configuration mistake, you sometimes want to change the configuration of your data set after you have completed the import.

In Disco, you can adjust your import configuration in two different ways:

  • Press the Reload button in the project view and you will be brought back to the configuration screen as shown in Figure 22 and Figure 23.
  • Simply import the same file again and configure it another way. Disco will remember your configuration settings from the last time you imported it. So, a re-import has the same effect as using the Reload button.
_images/ReloadDataSet.png

Figure 22: The Reload button is a quick way to go back to the configuration screen to check how the data looked like, to potentially correct configuration mistakes, or to take alternative views on your process: Pressing Reload in the project view ...

_images/ConfigAfterReload.png

Figure 23: ... brings you back to the configuration screen.

In the case that you already have done some filtering work (see Filtering) that you need to preserve but still want to change the configuration, then re-loading the original file does not help. Instead, you can do the following:

  1. Export the currently filtered data set as a CSV file (see Exporting Data Sets).
  2. Import the exported CSV file and configure it the way you want it.

Importing Pre-configured Data Sets

In most situations, the starting point is data in CSV or XLS format as described before. However, you can also import data sets in standard event log formats such as MXML and XES (MXML and XES). Furthermore, efficient Disco-compressed log files (Disco Log Files) and complete Disco projects (Disco Projects) can be imported, too.

MXML and XES

The Mining XML (MXML) format has been around as a standard format for event logs for several years now. The EXtensible Event Stream (XES) format is the successor of MXML and has been approved by the IEEE Task Force on Process Mining in 2010.

Standard formats are useful to facilitate the interoperability between different tools. For example, logs in MXML or XES format can be loaded by other process mining tools such as the academic toolset ProM. Disco places a high value on interoperability and imports and exports all event log standard formats that are in use. Read more about the supported standard event log types in Exporting Data Sets.

Even if you do not work with other process mining tools, the MXML or XES import can be still useful for you:

No configuration
Data sets exported in MXML or XES already contain the configuration information (about case IDs, activities, timestamps, etc.). So, if you want to exchange a data set with a co-worker who also uses Disco, you can send her an MXML or XES file to let her skip the configuration step entirely. This way, you do not need to explain which columns should be configured how, but instead she opens the file and directly sees the process map. Or similarly, if you want to keep differently filtered or configured versions of your event log as files for yourself, storing them in MXML or XES standard format saves you the configuration step when you next import these files again.
Faster import
Next to skipping the configuration step during import, reading a pre-configured data set is also more efficient because it is already sorted (in contrast, the rows in a CSV file still need to be sorted by Disco). For larger data sets, the difference can be a 10x speedup (and more) during import.

If you should have MXML or XES files that you would like to re-configure in a different way, you can still export them as a CSV and re-import the CSV again (see also Adjusting the Import Configuration).

Disco Log Files

Loading data sets in MXML or XES format is already faster than importing CSV data (see also MXML and XES), but nothing is as fast as loading data in the native Disco Log Files (FXL) format. FXL is a proprietary (no standard) format but the best if you need to exchange really large data sets with another Disco user.

For example, for one of our large benchmark logs the import of the CSV file (including sorting) took 3.5 hours, the MXML file loaded in about 20 minutes, and the FXL file was read in just over one minute.

Files in FXL format have also a much smaller file size compared to compressed XML standard logs and CSV.

Disco Projects

Not only individual log files but complete projects can be exported and imported. This way, multiple data sets including all applied filters and notes can be shared with other Disco users or used to backup and re-load previous project work. Refer to Managing and Sharing Projects to read more about how to export and import Disco projects as DSC files.

Troubleshooting

Here are a few typical problems that may occur during the import of your file, and what to do about them.

“Columns are empty when I load my file”

When you import an XLS file (see Importing Data Sets), it may be the case that the cells of some of the columns in your Excel data sheet are empty because they contain formulas. Formulas or references to other cells cannot be interpreted by Disco.

To import the values (the results of a formula) as your data set into Disco, you can use one of these two alternative solutions:

  • Option 1: Copy and paste values. You can copy the columns that contain cells with formulas and paste them with the Paste special - Values option. Make sure that you only paste the values of your column, because the standard copy / paste functionality in Excel will copy the formulas into your new column!
  • Option 2: Export XLS as CSV file. A file in CSV format is a plain text file that contains no formulas. If you export your data set from Excel as a CSV, then only the resulting values of all your cell formulas will be exported. The exported CSV file can be imported by Disco without problems.

“My date and time are in separate columns”

If you configure multiple columns as timestamp, Disco will parse each of them to determine the start and completion times for the activities. This means that if your source data has one timestamp distributed over two columns (see Date and Time columns in Figure 24), Disco cannot get the complete timestamp.

_images/DateAndTime.png

Figure 24: If you have your date and time in multiple columns, you need to combine them into one column before importing your data set in Disco.

As a solution, you can open your data set in Excel and create a third column (see Date & Time in Figure 24) that combines the two. The CONCATENATE function in Excel can be used to do that. After you have created your combined timestamp column, you can save or export the data set and configure the Date & Time column as your timestamp column in Disco.

In some cases the CONCATENATE formula does not work because the Date and Time values are interpreted by Excel as a number. To solve this problem, you have two options:

  • Make sure that the Date and Time columns are interpreted as text. For example, instead of writing B3 within the formula, write TEXT(B3;"HH:MM:SS").
  • Add the two columns by using the formula =A3+B3. Then right click on cell C3 and select Format Cells, from the format cell dialog box select the Number tab, from the category list select Custom, under Type write the format DD/MM/YYYY HH:MM:SS, and click OK.

“I need to merge multiple files”

Currently, only single files can be imported with Disco. If you have parts of your event log distributed over multiple data sets, you need to do some extra pre-processing work before you can start your process mining analysis. Databases or specialized tools can be used to do the merging.

Contact the IT support staff who extracted the data for you, or get in touch with us to help you with this.

“The Start button is greyed out”

As shown in Figure 6, you can only start the import of your file after you have configured all the required columns (case ID and activity need to be set).

“Disco has problems reading my file”

If not all lines in your CSV or Excel file can be read, then Disco tells you about it (see Figure 25).

_images/Problem_new2.png

Figure 25: If Disco reports the lines that had problems during the import, make sure to check the source file in the given line number to figure out what the problem is.

To find out what the problem is, open the file and compare the line that is reported with the line before. Can you see a difference? Does the problematic line have more or less columns than all the other lines before? Are there wrong quotes?

_images/LinesCouldNotBeRead.png

Figure 26: The source file that produced the error message in Figure 25: Line 8 has one column less than the other rows. Disco fills up the missing cell at the end of the row but warns you about the mismatch.

For example, the error message in Figure 25 was produced when attempting to read the file shown in Figure 26. The file is a modified snippet from the example log to test different non-latin scripts such as Korean, Hebrew, and Chinese. When I added the last line to test latin characters in the mix, I forgot to fill in a value for the Resource column in my source file. As a result, you can see that all the other columns are shifted one to the left, and that line No. 8 is missing a cell at the end of the row.

“I get weird symbols”

Disco automatically detects the encoding of the text in your data set and can also deal with languages that have special symbols (like the “ü” in German). If you find that characters in your data set are not read properly, make sure that you have saved your file in Unicode (UTF-8) format. [3]

If Disco has not correctly auto-detected your file encoding, you can also try other encodings from the list provided in the lower left of the configuration screen (see Figure 27). When you select a new encoding, Disco will reload your CSV file with the chosen encoding and you can see in the preview of the configuration screen whether your characters are now displayed correctly.

_images/ExplicitEncoding.png

Figure 27: You can also reload your CSV file with an explicit file encoding, in case auto-detection fails.

When you open your CSV file in Excel, sometimes not all symbols are properly interpreted. When you run into the problem that special characters are not correctly represented, try to avoid going through Excel and import the CSV file directly into Disco.

Disco provides full Unicode support, which means that data sets from any language (also right-to-left languages like Hebrew and Arabic) can be read and are properly displayed (see Figure 28).

_images/Disco-Unicode.png

Figure 28: Through Disco’s Unicode support, data sets from any language (also right-to-left languages like Hebrew and Arabic) can be read and are properly displayed.

Footnotes

[1]Storing large CSV files in a ZIP archive can be useful if you are working with very large data sets, because the zipped file will take up much less space on your hard disk. To import your file into Disco, you do not need to extract the CSV from the ZIP file but you can directly load the compressed archive (with the CSV file inside it).
[2]Further documentation and examples about the Simple Date Format are available on Oracles’s documentation page here: https://docs.oracle.com/javase/7/docs/api/java/text/SimpleDateFormat.html.
[3]Unicode is a standard that enables the handling of text for most of the world’s writing systems. See the Wikipedia page at https://en.wikipedia.org/wiki/Unicode for further information.