CERN (deel 4) – De laatste weken

A touristic view of Geneva.

Een toeristische foto van het mooie Genève.Dat was het dan bijna, morgen (19-8-2016) zit mijn CERN avontuur er op en dit is dan ook de laatste Nederlandstalige post over mijn zomerstage bij CERN. De laatste weken is er ontzettend veel gebeurd: ik heb bezoek uit Nederland gehad, het project is in een stroomversnelling terecht gekomen en afgerond en ik heb afscheid genomen van mijn collega’s op CERN.


Read more · 10 minutes
Data Blogger Courses

Mastering Pandas

In this course, you will learn how to use the Python Pandas. After the course, you will be able to:

  • Load and transform your data
  • Visualizing data using line plots, scatter plots and histograms
  • Merging and storing data

The course also includes more advanced topics, such as data parallelization and aggregation.

You can see all course content under “Curriculum” on Data Blogger Courses and the first three lessons are free. The first free lesson can be found here.

(more…) Read more

How to scrape a website using Python + Scrapy in 5 simple steps

In this Python Scrapy tutorial, you will learn how to write a simple webscraper in Python using the Scrapy framework. The Data Blogger website will be used as an example in this article.

Scrapy: An open source and collaborative framework for extracting the data you need from websites. In a fast, simple, yet extensible way.

By the way, if you are interested in scraping Tweets, you should definitely read this article.


Read more · 14 minutes

Apache Flink: The Next Distributed Data Processing Revolution?

Disclaimer: The results are valid only in the case when network attached storage is used in the computing cluster.

The logo of Apache Flink.

The logo of Apache Flink.

The amount of data is growing significantly over the past few years. It is not feasible for only one machine to process large amounts of data. Therefore, the need of distributed data processing frameworks is growing. It all started back in 2011 when the first version of Apache Hadoop was released (version 1.0.0). The Hadoop framework is capable of storing a large amount of data on a cluster. This is known as the Hadoop FileSystem (HDFS) and it is used at almost every company which has the burden to store Terabytes of data every day. Then the next problem arose: how can companies process all the stored data? Here is where Distributed Data Processing frameworks come into play. In 2014, Apache Spark was released and it now has a large community. Almost every IT section has implemented at least some lines of Apache Spark code. Companies gathered more and more data and the demand for faster data processing frameworks is growing. Apache Flink (released in March 2016) is a new face in the field of distributed data processing and is one answer to the demand for faster data processing frameworks.


Read more · 13 minutes