Overview of Amazon Kinesis & Kinesis Firehose
What is Data Streaming?
‘Data streaming’ is a real-time flow of data, which is generated by multiple sources. Data streams can be moved to a centralized location for ‘data ingestion’.
‘Data ingestion’ is when a data stream is moved to a destination where it’s stored and analyzed in some fashion. Data ingestion can produce analytic data for later use.
What is Amazon Kinesis?
Amazon Kinesis is a scalable ‘data streaming’ service for ingesting data; up to thousands of sources of data, coming from many different platforms and services can ‘stream’ data in real-time to Kinesis, and be sent to a location for analysis.
A real world example of this would be if a mobile app with millions of users tracks the touch inputs of their users, and a stream is sent to Kinesis. It then can be analysed for useful information, such as most often used areas of the screen.
Architecture of Using Amazon Kinesis
Producers — any device or computing instance that is able to use Kinesis APIs to gather data; this could be servers, EC2 instances, IoT devices, mobile devices, etc. This is the data that will be used for the data stream
Amazon Kinesis Service — this is the ‘data stream’ that is produced; it is able to ‘ingest’ data from ‘producer devices’, which is kept in a 24-hour ‘data stream’ for ‘consumption’ by the ‘consumer device’
Shard — capacity unit for the ‘data stream’ that allows the data stream to scale; each stream starts with ‘1’ shard that is shared by all attached ‘producers’ and ‘consumers’ by default. Each shard allows 1 MB of ‘ingestion’ and 2 MB of ‘consumption’. Shards can be added and removed as needed to control scaling.
Consumers — any device or computing instance that is able to use Kinesis APIs to consume data streams.
Creating a Data Stream
This can be done at AWS Console>Services>Amazon Kinesis>Data Streams>Create Data Stream.
Data stream configuration:
- Data stream name — the name assigned to the created data stream
A ‘shard estimator’ is available to properly estimate needed capacity:
- Writing to the stream:
Average record size in KB — default is 1024
Max records written per second — default is 1 second - Reading from the stream:
Total number of consumers — default is 1 - Estimated number of open shards — default is 1
Data stream capacity:
- Number of open shards — how many shards are applied to the stream
What is Amazon Kinesis Firehose?
Amazon Kinesis Firehose is a service that takes data streams, and instead of the data being used by a ‘consumer’, is sent to be persistently stored in S3 or AWS Redshift.
Along with being sent to storage, Firehose also supports sending data to third-party services like Elasticsearch or Splunk.