

Data visualization under 15 minutes with the ELK stack
Following the series of my previous blog posts regarding data visualization with Highcharts and AngularJS, this time I’ll show you how easily the same outcome can be accomplished with another software stack – the ELK (Elasticsearch, Logstash and Kibana).
As a person who loves sport events, the dataset which we are going to use is Premier League season 2011/2012 statistics.
Prerequisites
- Download Elasticsearch
- Download Logstash
- Download Kibana
- Git clone https://github.com/boykodimitroff/ELK-blogpost.git
Let’s start
If we talk in the context of Business Intelligence then Logstash is the ETL(extract, transform, load) tool of the ELK stack. It needs to be configured to read our dataset, transform the different fields to proper data types and at the end to load everything into Elasticsearch(our data warehouse instance).
Let’s check the csv.conf file from the cloned project.
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 |
input { file { path => "/path/to/dataset" # read from the beginning of file start_position => beginning # load the data every time when logstash starts sincedb_path => "/dev/null" } } filter { csv { columns => ["Date","Player ID","Player Surname","Player Forename","Team", "Team Id","Opposition","Opposition id","Venue","Position Id","Appearances","Time Played", "Starts","Substitute On","Substitute Off","Goals","First Goal","Winning Goal", "Shots On Target inc goals","Shots Off Target inc woodwork","Blocked Shots", "Penalties Taken","Penalty Goals","Penalties Saved","Penalties Off Target", "Penalties Not Scored","Direct Free-kick Goals","Direct Free-kick On Target", "Direct Free-kick Off Target","Blocked Direct Free-kick","Goals from Inside Box", "Shots On from Inside Box","Shots Off from Inside Box","Blocked Shots from Inside Box", "Goals from Outside Box","Shots On Target Outside Box","Shots Off Target Outside Box", "Blocked Shots Outside Box","Headed Goals","Headed Shots On Target", "Headed Shots Off Target","Headed Blocked Shots","Left Foot Goals", "Left Foot Shots On Target","Left Foot Shots Off Target","Left Foot Blocked Shots", "Right Foot Goals","Right Foot Shots On Target","Right Foot Shots Off Target", "Right Foot Blocked Shots","Other Goals","Other Shots On Target","Other Shots Off Target", "Other Blocked Shots","Shots Cleared off Line","Shots Cleared off Line Inside Area", "Shots Cleared off Line Outside Area","Goals Open Play","Goals from Corners", "Goals from Throws","Goals from Direct Free Kick","Goals from Set Play","Goals from penalties", "Attempts Open Play on target","Attempts from Corners on target", "Attempts from Throws on target","Attempts from Direct Free Kick on target", "Attempts from Set Play on target","Attempts from Penalties on target", "Attempts Open Play off target","Attempts from Corners off target", "Attempts from Throws off target","Attempts from Direct Free Kick off target", "Attempts from Set Play off target","Attempts from Penalties off target", "Goals as a substitute","Total Successful Passes All"...] } mutate { convert => [ "Player ID","integer", "Position Id","integer","Appearances","integer","Time Played","integer","Starts","integer", "Substitute On","integer", "Substitute Off","integer", "Goals","integer", "First Goal","integer", "Winning Goal","integer", "Shots On Target inc goals","integer", "Shots Off Target inc woodwork","integer", "Blocked Shots","integer", "Penalties Taken","integer", "Penalty Goals","integer", "Penalties Saved","integer", "Penalties Off Target","integer", ... ] } } output { stdout { codec => rubydebug } elasticsearch { index => "csv_index" } |
Since our dataset is in csv format we need to configure a csv filter in order Logstash to know how to read the data. The columns property indicates the column names of the csv or the so called header. Later these field names will be used in our queries. The mutate filter will tell Logstash how to treat each value under each column. Finally the output configuration will ensure that the data will be loaded into Elasticsearch localhost instance under index called csv_index.
Let’s run Elasticsearch and then Logstash with the created configuration file:
- ./elasticsearch
- ./logstash -f /path/to/csv.conf
If everything is okay, Logstash will start the process of extracting the data from the csv file, transforms the values according to the specified schema and then loads everything in Elasticsearch.
Alright, now the data is loaded, let’s visualize it. Kibana’s what will do this in our ELK stack. I assume you already started it, so go to http://localhost:5601
Our data has “Date” property which represents the time period of a particular match. Every value in this field is a date between 2011 and 2012 because we have the statistics for Premier League season 2011/2012. Quick search on Google tells me that the season started on August 13, 2011 and ended on May 13, 2012.
By default Kibana shows records for the “Last 15 minutes”, so we need to change the time period.
Click on “Last 15 minutes” in the upper right corner of Kibana’s dashboard. Then select “Absolute” time range and change the calendar to represents the above mentioned dates.
We are going to prepare charts with the following three indicators:
- Goals per team with Pie chart
- Top 10 scorers with Bar chart
- Comparison between red and yellow cards per footballer with Line chart
Goals per team
Select Visualize ->Create a visualization -> Pie chart. Click on our newly created index csv_index. You should see something like this:
Since we want to visualize goals per team we need to select Sum from Aggregation drop down. After that Goals needed to be marked as a Field. Click on Split Slices and select Terms for Aggregation. A terms aggregation enables you to specify the top or bottom n elements of a given field to display, ordered by count or a custom metric. Our term field will be Team. We want to show every team in the league so we need to increase the default size from 5 to 20. At the end click on the play button in the header of the input controls section. The result should be this:
Manchester city scored the most goals in season 2011/2012 of the Premier League.
Top 10 scorers
Select Visualize ->Create a visualization -> Vertical bar chart. Click on our csv_index. You should see something like this:
Expand the Y-axis and select Sum for Aggregation. The selected Field value must be Goals. Add X-axis with Terms for Aggregation based on Player Surname field. We need the top 10 scorers so the default size should be increased to 10. Click on the play button. The result should be this:
Looks like Van Persie is the top scorer of the Premier League 11/12 with 28 goals.
Comparison between red and yellow cards per footballer – easy with the ELK stack
Select Visualize ->Create a visualization -> Line chart. Click on csv_index. You should see a blank chart with the already familiar input controls.
Expand Y-axis and select Sum for Aggregation based on Red cards field. Click on Add metrics button and add another Y-axis with Sum for Aggregation based on Yellow cards field. Add X-axis with Terms for Aggregation based on Player Surname. Increase the default size to 11 in order to see a whole team. Since we know that Manchester City scored the most goals in the league, let’s narrow down our results just to this team. In the search header replace the * with Team:=Manchester City.
Conclusion
It’s amazing what can be achieved with the ELK stack in such a short period of time. No needs to install heavy reporting platforms or database instances for data warehousing. Data modeling and loading is accomplished with such an ease. I love it.