Een toeristische foto van het mooie Genève.Dat was het dan bijna, morgen (19-8-2016) zit mijn CERN avontuur er op en dit is dan ook de laatste Nederlandstalige post over mijn zomerstage bij CERN. De laatste weken is er ontzettend veel gebeurd: ik heb bezoek uit Nederland gehad, het project is in een stroomversnelling terecht gekomen en afgerond en ik heb afscheid genomen van mijn collega’s op CERN.Read more · 10 minutes
In this course, you will learn how to use the Python Pandas. After the course, you will be able to:
- Load and transform your data
- Visualizing data using line plots, scatter plots and histograms
- Merging and storing data
The course also includes more advanced topics, such as data parallelization and aggregation.(more…) Read more
In this Python Scrapy tutorial, you will learn how to write a simple webscraper in Python using the Scrapy framework. The Data Blogger website will be used as an example in this article.
Scrapy: An open source and collaborative framework for extracting the data you need from websites. In a fast, simple, yet extensible way.
By the way, if you are interested in scraping Tweets, you should definitely read this article.
Disclaimer: The results are valid only in the case when network attached storage is used in the computing cluster.
The amount of data is growing significantly over the past few years. It is not feasible for only one machine to process large amounts of data. Therefore, the need of distributed data processing frameworks is growing. It all started back in 2011 when the first version of Apache Hadoop was released (version 1.0.0). The Hadoop framework is capable of storing a large amount of data on a cluster. This is known as the Hadoop FileSystem (HDFS) and it is used at almost every company which has the burden to store Terabytes of data every day. Then the next problem arose: how can companies process all the stored data? Here is where Distributed Data Processing frameworks come into play. In 2014, Apache Spark was released and it now has a large community. Almost every IT section has implemented at least some lines of Apache Spark code. Companies gathered more and more data and the demand for faster data processing frameworks is growing. Apache Flink (released in March 2016) is a new face in the field of distributed data processing and is one answer to the demand for faster data processing frameworks.Read more · 13 minutes