The series “Data Mining on Medical Data” is a series in which several data mining techniques are highlighted. The series are written in collaboration with John Snow Labs which provided me the medical datasets. In this article basic Text Mining techniques will be highlighted and some of the results are presented.
A medical dataset is given which contains written diagnoses of people. The goal of this article is to extract causal relationships from these diagnoses. For example, a diagnosis could be that Bob has broken his leg due to falling from a cliff. Then the cause of Bob’s broken leg is the falling from a cliff. In order to extract such a patterns, we need to dive a little into text mining. Before that can happen, we need to clean the data. A modified sample of the original dataset which will be used in this article can be downloaded here:
First, we need to load the data into memory. This can be done easily by using pandas (Python Data Analysis Library):
# Imports import pandas as pd # Specify the column names columns = ['id', 'description'] # Load the CSV file df = pd.read_csv('test_data.csv', quotechar='"', index_col=False, skiprows=1, names=columns) Now we need to do some preprocessing. Notice that clauses are seperated by the words "and" and by the symbol "&". We will normalize the text by lowercasing all characters. This has also some downsides, since there are non-equivalent words (for example "windows" and "Windows") which are now considered as equivalent. The "and" and "&" will be converted to "," (which will be parsed later). Notice the extra spaces at the beginning and at the ending that are added to strings. This is done for simplicity, we only need spaces when word are replaced. The following code is used for normalizing all descriptions:
def preprocess(data): """ Preprocess data. :param data: data to normalize. :return: Normalized data. """ import re data = ' ' + data + ' ' data = re.sub('\s+', ' ', data) data = re.sub(' & ', ',', data) data = re.sub(' and ', ',', data) data = re.sub(' , ', ',', data) data = data.lower().strip() return data # Normalize the descriptions df['description'] = df['description'].apply(preprocess)
Now the text mining part can begin!
Text mining is data mining applied on text. In this article, some of the techniques of text mining are applied on our toy dataset.
The main goal of this article is to extract causal relations from the descriptions. This could be done by using entity recognition and then applying information extraction but that is outside the scope of this article. Here, we will use a simple template based alternative.
First, we extract all seperate clauses from a description. For example, if we have the description “rain causing wet grass, wet grass due to sprinkler”, then there are two clauses: “rain causing wet grass” and “wet grass due to sprinkler”. Notice that there are two types of causal relations. “A causing B” and “B due to A”, where A is the subject of the causal relation and B is the caused entity (the event). Note that A and B are swapped by the relation label “due to”. This is taken into account in the information extraction method. The relation labels “causing” and “due to” are both replaced by the relation token “->” (for simplicity). The information extraction is quite simple this way:
def find_relation(text, relation_token, relation_label, inverse=False): """ Find relations. :param text: The text to extract relations from. :param relation_token: The relation token (a descriptive symbol for the relation). :param relation_label: The label of the relation (this will be translated to relation_type). :param inverse: whether to invert A and B (see return). :return: A list of tuples (A, r, B) where A is the subject of the relation, B is the event and r is the relation type. For example: A="rain", r="causing", B="wet grass". """ KB =  if relation_label in text: parts = text.split(relation_label) if len(parts) == 2: if not inverse: A, B = parts else: B, A = parts KB.append((A, relation_token, B)) return KB # Knowledge base KB =  # Loop through all documents for doc in df['description']: # Find all clauses for text in doc.split(','): # Extract relations KB.extend(find_relation(text, '->', ' due to ', True)) KB.extend(find_relation(text, '->', ' causing '))
The results are displayed using the following piece of code:
# Show the results print('Causes of "burn":', [relation for relation in KB if relation == '->' and relation == 'burn']) print('Causes of "broken leg":', [relation for relation in KB if relation == '->' and relation == 'broken leg'])
For the toy dataset, the following results were obtained:
Causes of "burn": ['cooking', 'coffee', 'forest fire', 'fishing boat on fire', 'car crash', 'hot water'] Causes of "broken leg": ['falling of a cliff', 'soccer', 'fight', 'fighting', 'falling', 'fighting']
So, our text mining application works on this particular dataset. You see for example that “hot water” causes “burn”. Notice that it will not work in general! Therefore, often applications are tested using a separate train dataset and test dataset (and there are even better testing techniques).
Extend the given dataset and add a new relation. Extend the application such that it can extract the new relation from the dataset.