Data Mining with Python on Medical Datasets for Data Mining


The series “Data Mining with Python on Medical Datasets for Data Mining” is a series in which several data mining techniques are highlighted. The series are written in collaboration with John Snow Labs which provided me the medical datasets. In this article basic Text Mining techniques will be highlighted and some of the results are presented.

By the way, if you are interested in Deep Learning you should definitely read this article on implementing a GRU in Python using Tensorflow.


A medical dataset is given which contains written diagnoses of people. The goal of this article is to extract causal relationships from these diagnoses. For example, a diagnosis could be that Bob has broken his leg due to falling from a cliff. Then the cause of Bob’s broken leg is the falling from a cliff. In order to extract such a patterns, we need to dive a little into text mining. Before that can happen, we need to clean the data. A modified sample of the original dataset which will be used in this article can be downloaded here:

Download the dataset sample here

Data preprocessing

First, we need to load the data into memory. This can be done easily by using pandas (Python Data Analysis Library):

# Imports
import pandas as pd

# Specify the column names
columns = ['id', 'description']

# Load the CSV file
df = pd.read_csv('test_data.csv', quotechar='"', index_col=False, skiprows=1, names=columns)

Now we need to do some preprocessing. Notice that clauses are seperated by the words "and" and by the symbol "&". We will normalize the text by lowercasing all characters. This has also some downsides, since there are non-equivalent words (for example "windows" and "Windows") which are now considered as equivalent. The "and" and "&" will be converted to "," (which will be parsed later). Notice the extra spaces at the beginning and at the ending that are added to strings. This is done for simplicity, we only need spaces when word are replaced. The following code is used for normalizing all descriptions:

def preprocess(data):
    Preprocess data.

    :param data: data to normalize.
    :return: Normalized data.
    import re
    data = ' ' + data + ' '
    data = re.sub('\s+', ' ', data)
    data = re.sub(' & ', ',', data)
    data = re.sub(' and ', ',', data)
    data = re.sub(' , ', ',', data)
    data = data.lower().strip()
    return data

# Normalize the descriptions
df['description'] = df['description'].apply(preprocess)

Now the text mining part can begin!

Text mining

Text mining is data mining applied on text. In this article, some of the techniques of text mining are applied on our toy dataset.

Information extraction

The main goal of this article is to extract causal relations from the descriptions. This could be done by using entity recognition and then applying information extraction but that is outside the scope of this article. Here, we will use a simple template based alternative.

First, we extract all seperate clauses from a description. For example, if we have the description “rain causing wet grass, wet grass due to sprinkler”, then there are two clauses: “rain causing wet grass” and “wet grass due to sprinkler”. Notice that there are two types of causal relations. “A causing B” and “B due to A”, where A is the subject of the causal relation and B is the caused entity (the event). Note that A and B are swapped by the relation label “due to”. This is taken into account in the information extraction method. The relation labels “causing” and “due to” are both replaced by the relation token “->” (for simplicity). The information extraction is quite simple this way:

def find_relation(text, relation_token, relation_label, inverse=False):
    Find relations.

    :param text: The text to extract relations from.
    :param relation_token: The relation token (a descriptive symbol for the relation).
    :param relation_label: The label of the relation (this will be translated to relation_type).
    :param inverse: whether to invert A and B (see return).
    :return: A list of tuples (A, r, B) where A is the subject of the relation, B is the event and r is the relation
             type. For example: A="rain", r="causing", B="wet grass".
    KB = []
    if relation_label in text:
        parts = text.split(relation_label)
        if len(parts) == 2:
            if not inverse:
                A, B = parts
                B, A = parts
            KB.append((A, relation_token, B))
    return KB

# Knowledge base
KB = []

# Loop through all documents
for doc in df['description']:
    # Find all clauses
    for text in doc.split(','):
        # Extract relations
        KB.extend(find_relation(text, '->', ' due to ', True))
        KB.extend(find_relation(text, '->', ' causing '))


The results are displayed using the following piece of code:

# Show the results
print('Causes of "burn":', [relation[0] for relation in KB if relation[1] == '->' and relation[2] == 'burn'])
print('Causes of "broken leg":', [relation[0] for relation in KB if relation[1] == '->' and relation[2] == 'broken leg'])

For the toy dataset, the following results were obtained:

Causes of "burn": ['cooking', 'coffee', 'forest fire', 'fishing boat on fire', 'car crash', 'hot water']
Causes of "broken leg": ['falling of a cliff', 'soccer', 'fight', 'fighting', 'falling', 'fighting']

So, our text mining application works on this particular dataset. You see for example that “hot water” causes “burn”. Notice that it will not work in general! Therefore, often applications are tested using a separate train dataset and test dataset (and there are even better testing techniques).


Extend the given dataset and add a new relation. Extend the application such that it can extract the new relation from the dataset.

Kevin Jacobs

Kevin Jacobs

Kevin Jacobs is a certified Data Scientist and blog writer for Data Blogger. He is passionate about any project that involves large amounts of data and statistical data analysis. Kevin can be reached using Twitter (@kmjjacobs), LinkedIn or via e-mail: