Huray! The Data Blogger blog is enlisted in this top 75 of Data Science blogs. This is a good moment to give an overview of some of the most influential blogs for Data Science.Read more · 7 minutes
In this course, you will learn how to use the Python Pandas. After the course, you will be able to:
- Load and transform your data
- Visualizing data using line plots, scatter plots and histograms
- Merging and storing data
The course also includes more advanced topics, such as data parallelization and aggregation.(more…) Read more
Het zijn gunstige tijden voor huizenkopers, de huizenprijzen zijn laag. Het grote probleem voor starters is dat banken niet veel risico durven te nemen en daardoor relatief lage hypotheken verstrekken. In dit artikel probeer ik het huidige hypotheeksysteem (augustus 2016) van de Rabobank te doorgronden. Is het systeem wel eerlijk? (more…)Read more · 10 minutes
Disclaimer: The results are valid only in the case when network attached storage is used in the computing cluster.
The amount of data is growing significantly over the past few years. It is not feasible for only one machine to process large amounts of data. Therefore, the need of distributed data processing frameworks is growing. It all started back in 2011 when the first version of Apache Hadoop was released (version 1.0.0). The Hadoop framework is capable of storing a large amount of data on a cluster. This is known as the Hadoop FileSystem (HDFS) and it is used at almost every company which has the burden to store Terabytes of data every day. Then the next problem arose: how can companies process all the stored data? Here is where Distributed Data Processing frameworks come into play. In 2014, Apache Spark was released and it now has a large community. Almost every IT section has implemented at least some lines of Apache Spark code. Companies gathered more and more data and the demand for faster data processing frameworks is growing. Apache Flink (released in March 2016) is a new face in the field of distributed data processing and is one answer to the demand for faster data processing frameworks.Read more · 13 minutes
Suppose you have a cluster. Suppose you would like to monitor your cluster as soon as possible without installing all kind of tools on the cluster. A new software package named ISA has been created which can do centralized monitoring for you! This article is a walkthrough for ISA and helps you setting up monitoring for your cluster in just a few minutes.
- ISA can collect many node statistics such as CPU usage, memory usage and disk I/O.
- It is easy to setup and it has flexible node configuration.
- ISA ensures minimal influence for the node statistics.
- No setup required on the nodes, the statistic management is done centrally.
In this tutorial, we will setup ISA and collect cluster statistics in a CSV.Read more · 12 minutes
Scala is jet another programming language in the world of programming languages. Its first version was released in 2001 and was conceptually developed by Martin Odersky, a professor at the EPFL in Switzerland. Fun fact: one of his first projects was called Pizza and was a super set of the Java language. Later on, he developed Scala. The name and logo are easy to explain. First of all, Scala stands for SCalable LAnguage. Another translation: “Scala” means “staircase” in Italian. The logo is based on a particular staircase in one of the buildings of EPFL. Now we have had a decent introduction to Scala, we can start exploring the language itself. So what is Scala all about?Read more · 11 minutes