Novel QoS-aware TLS Dataset for Encrypted Traffic Classification​

Introduction

The TLS working group is about to release the Encrypted ClientHello (ECH) amendment. ECH will increase user confidentiality but complicate the traffic classification problem. Particularly, ECH will encrypt TLS field Server Name Indicator (SNI). SNI contains information about the service client wants to connect to and is often used in traffic classification issues.

Traffic Classification in Encrypted ClientHello scenario

Modern traffic classifiers working in the ECH scenario are based on machine learning approaches. They analyze unencrypted information of key exchange and flow statistics such as packet lengths and packet arrival intervals. These classifiers are being tested on datasets mostly consisting of TLS1.2 or non-TLS traffic datasets. However, the ECH amendment requires version TLS 1.3. Moreover, some of the existing open datasets are poorly labeled, while others lack of different QoS classes. Finally, many of them are simply outdated. Thus, to test the existing and proposed algorithms properly, we collected a new TLS traffic dataset named WNL that satisfies the requirements.

WNL Dataset Description

The dataset we release includes 3547 flows. The flows are divided by services into 12 classes and 4 traffic types. First of all, the dataset contains live video traffic. We streamed videos on two popular platforms YouTube Live and Facebook Live with OBS. All the live video data was collected from PCs. The second traffic type is buffered audio. We played music and podcasts on the services Spotify, Apple Music, SoundCloud and Yandex Music and used as client applications browsers Mozilla Firefox, Safari, and Google Chrome to collect buffered audio traffic from PCs and corresponding mobile applications for collecting it from smartphones. Analogically, the buffered video traffic was collected from the same PCs, and smartphones with the same browsers and corresponding to buffered video traffic applications: Netflix, YouTube, Vimeo, Kinopoisk, and Amazon PrimeVideo. Finally, the dataset contains web flows of 100 popular web page downloading and background web traffic from mentioned services.

The data was collected in different networks in 3 cities (Moscow, Dolgoprudny, Zelenograd) on devices with various operating systems (Windows, macOS, Linux, IOS, Android). We collected the traffic in a few iterations from the end of 2020 to the end of 2021 year in order to diversify our dataset by the time of using each service. The table below shows the number of flows of each service presented in the dataset. It also gives the label SNI pattern of each service that was used for labeling.

We cut each flow by the first 200 packets. So the dataset is relatively lightweight and still can be used to test classifiers working in the ECH scenario: the classifiers based on analyzing statistics and the ones based on analyzing the TLS key exchange. The dataset can be downloaded via the following link.

https://drive.google.com/drive/folders/1F2iVgeMJk2gCLcwLjkPI63kc7YuK6t-8?usp=sharing