Training a Language Model to Mimic Twitch Chat: Part 1 - Gathering Chat Logs

Jordan Mryyan,TwitchLLMsAudio

Overview

Twitch is an exciting data source to experiment with due to the unique cultures that arise from channel to channel. The various subcultures on Twitch take on many distinct and interesting ways of speaking, similar to accents in the real world. This communication is not limited to English words, but also phrases that correspond to emotes which represent the various emotions the chat participant may be feeling while watching the stream.

Our goal for this series is to train a language model to mimic this unique way of speaking and expressing emotion. We will accomplish this task by transcribing the audio from past Twitch broadcasts using Distil Whisper (opens in a new tab), a light-weight and speedy variant of OpenAI's Whisper (opens in a new tab). We will then pair the transcriptions with the chat logs from the corresponding broadcast to create a training data set that accurately reflects the chat's response to each of the streamer's comments. Finally, we will use that training data to fine-tune Meta's Llama-2-7b-chat (opens in a new tab) model with the help of a powerful tool called Axolotl (opens in a new tab).

Table of Contents

This series will be broken up into the following six parts, with each part building off the last.

  1. Gathering Chat Logs (opens in a new tab) (You are here!)
  2. Downloading Videos
  3. Bulk Data Gathering
  4. Audio Transcription
  5. Building the Training Data Set
  6. Fine Tuning the Model

The code for each of these sections is available on GitHub: Training-a-Language-Model-to-Mimic-Twitch-Chat (opens in a new tab).

Downloading Chat Messages

The first dataset we need is a log of all the chat messages sent during a stream. To obtain this data, we can leverage Twitch's unofficial GraphQL endpoint. While there are likely packages available to perform this task for us, in this tutorial, we'll write the code ourselves. Since it's a relatively basic task, doing so will help you become more comfortable with finding and using data from Twitch's GraphQL API.

Finding the Request Schema

If you monitor the outgoing requests from your browser when loading a Twitch VOD, you’ll notice various calls to the "https://gql.twitch.tv/gql (opens in a new tab)" endpoint.

GraphQL_Outbound_Requests.png

The request that we are interested in is the VideoCommentsByOffsetOrCursor query, which has two slightly different implementations. Both requests pass some information specific to GraphQL under the extensions field, and a video id under the variables field. What differs is whether the contentOffsetSeconds or cursor variable is also passed.

The first request sent by Twitch will pass the contentOffsetSeconds variable. This variable is an integer and prompts GraphQL to begin returning batches of comments that were made at or after that timestamp.

GraphQL_contentOffsetSeconds.png

The second implementation passes the cursor variable. This variable is a base64 encoded string that comes from the response data of previous VideoCommentsByOffsetOrCursor calls.

GraphQL_Cursor.png

Now that we have the queries we need, let’s try to manually page through some of the comments before we code up a way to do it automatically.

Postman_Missing_ClientId.png

As you can see from the above image, sending this query with no modifications results in a bad request. That is because we need to pass a valid Client-Id in our headers.

Finding the Client-Id

If we go back to the browser and take a look at the request headers, we should be able to find the Client-Id that Twitch is passing to the GraphQL endpoint.

Website_ClientId.png

Retrying the original request while passing the Client-Id header we found in our browser, will result in the first page of comments being returned. In the response, you should also note the cursor value being returned, which we can use to page through all the comments.

GraphQL_Offset_Response.png

Using this cursor to page for the next batch of comments will result in an unexpected "failed integrity checks" error from the GraphQL API as shown below.

Failed_Integrity_Check.png

Your initial reaction might be to grab the Client-Integrity header seen in our browser’s outgoing request to the GraphQL endpoint. If we decode the base64 encoded value of that header, you’ll notice two interesting fields, exp and is_bad_bot.

Decoded_Integrity.png

The presence of the exp or expiration field likely means that if we want to use the client-integrity token in our requests, we'll need to periodically generate a new one when running our script.

The second field is the is_bad_bot property. I’ll skip over trying to programmatically call Twitch’s integrity API to save some of your time. Due to Twitch’s anti-bot implementation, when you call to get the integrity token programmatically, is_bad_bot will be set to true and any calls made with it will be invalidated.

So, how do we get the chats then? It turns out, the web, iOS and Android versions of Twitch all use different Client-Ids. Lucky for us, Twitch only requires the integrity check for requests that use the web’s Client-Id. You could get the other Client-Ids by monitoring the traffic from your mobile device and looking in the headers, though to save some time, we can just snag it from an amazing tool we plan on using later in this project: twitch-dl (opens in a new tab).

Using the Client-Id found in twitch-dl, you can see that paging through the comments is now successful.

Postman_Paging.png

With manual testing done, we can now move on to writing the script to gather the chat logs for us.

Programmatically Downloading Chats

To start, let’s first set up our request information. We need to start by creating a headers dictionary that will contain our Client-Id found above as well as a fake UserAgent which will be generated by the fake_useragent package.

import json
import time
import requests
from fake_useragent import UserAgent
from pathlib import Path
 
# Set up request info
user_agent = UserAgent()
request_headers = {
    "`Client-Id`": "kd1unb4b3q4t58fwlpcbzcbnm76a8fp",
    "User-Agent": user_agent.random
}
 
graphql_endpoint = "https://gql.twitch.tv/gql"

Next, we will write a basic method to send the GraphQL request to Twitch. It should be fine to hard-code all of the GraphQL specific stuff, and pass the variables so that we can switch between offset and cursor querying easily. When making the post request, we will pass the headers that were defined above.

def send_graphql_request(variables):
    query = [
        {
            "operationName": "VideoCommentsByOffsetOrCursor",
            "variables": variables,
            "extensions": {
                "persistedQuery": {
                    "version": 1,
                    "sha256Hash": "b70a3591ff0f4e0313d126c6a1502d79a1c02baebb288227c582044aa76adf6a"
                }
            }
        }
    ]
    response = requests.post(graphql_endpoint, json=query, headers=request_headers)
    return response.json()

Now, we need to write a few helper methods to extract some key information from the Twitch response. Below is an example of one such response. The fields that we’re interested in are the cursor, message, and hasNextPage.

[
    {
        "data": {
            "video": {
                "id": "2004234722",
                "creator": {...},
                "comments": {
                    "edges": [
                        {
                            "cursor": "eyJpZCI6IjE3MTE0NjA1LWE1MGEtNDY3OC1hYjU1LTkyZTQxNmVmNDI5MyIsImhrIjoiYnJvYWRjYXN0OjQzMjI5NDMzMDM1Iiwic2siOiJBQUFBYVVmS05BQVhvTlNMaHZnbkFBIn0",
                            "node": {
                                "id": "ec545849-6bd9-45d8-a8df-f2fa5eeba139",
                                "commenter": {...},
                                "contentOffsetSeconds": 257,
                                "createdAt": "2023-12-14T22:49:19.822Z",
                                "message": {
                                    "fragments": [
																			{
                                            "emote": null,
                                            "text": "oh so kind ",
                                            "__typename": "VideoCommentMessageFragment"
                                        },
                                        {
                                            "emote": {
                                                "id": "555555560;11;12",
                                                "emoteID": "555555560",
                                                "from": 11,
                                                "__typename": "EmbeddedEmote"
                                            },
                                            "text": ":D",
                                            "__typename": "VideoCommentMessageFragment"
                                        }
                                    ],
                                    "userBadges": [...]
                                    ],
                                    "userColor": "#008000",
                                    "__typename": "VideoCommentMessage"
                                },
                                "__typename": "VideoComment"
                            },
                            "__typename": "VideoCommentEdge"
                        }
                    ],
                    "pageInfo": {
                        "hasNextPage": true,
                        "hasPreviousPage": true,
                        "__typename": "PageInfo"
                    },
                    "__typename": "VideoCommentConnection"
                },
                "__typename": "Video"
            }
        },
        "extensions": {...}
    }
]

The code to extract this data is simple enough. Keep in mind that the cursor is consistent across the comment nodes, so we can take the value from any valid index.

def extract_comments(response):
    comments = response[0]['data']['video']['comments']['edges']
    return [comment['node'] for comment in comments]
 
def has_next_page(response):
	return response[0]['data']['video']['comments']['pageInfo']["hasNextPage"]
 
def extract_cursor(response):
	return response[0]['data']['video']['comments']["edges"][0]["cursor"]

With the ability to extract all the information we need from the response, we can now begin to page through the comments. We can achieve this by continually calling send_graphql_request with the new cursor until has_next_page returns false.

Once we have paged through the entire chat log, we will dump them to a file for processing later. The file name will be the broadcast's video id to make processing the data easier later on.

def get_video_comments(video_id, output_path):
    all_comments = []
    variables = {
        "videoID": video_id,
        "contentOffsetSeconds": 0
    }
 
    while True:
        # Get the batch of comments
        response = send_graphql_request(variables)
        comments = extract_comments(response)
 
        print(f"Received {len(comments)} comments...")
        all_comments.extend(comments)
 
        # Check if there are more pages
        if not has_next_page():
            break
 
        next_cursor = extract_cursor(response)
        print(f"Querying next link: {next_cursor}")
 
        # Update cursor for the next page
        variables = {
            "videoID": video_id,
            "cursor": next_cursor
        }
 
    # Write the comments to disk
    output_file_path = Path(output_path) / f"{video_id}.json"
    with open(output_file_path, "w") as f:
        json.dump(all_comments, f)

Finally, we’ll write the driver method. This method will handle retries if we fail to connect to Twitch’s GraphQL endpoint (which happens relatively often). We’ll also save the failed video ids so that we can replay them later if we want.

def download_chats(video_id: str):
    chat_output_path = Path("./chats")
 
    if (chat_output_path.joinpath(f"{video_id}.json").exists()):
        print(f"Chats for video {video_id} exists. Skipping.")
        return
    
    print(f"Getting chats for video: {video_id}")
 
    retry_count = 0
    while retry_count < 3:
        try:
            get_video_comments(video_id, chat_output_path)
            return
        except Exception as ex:
            print(ex)
            print(f"Failed to get chats for video {video_id}... Retrying.")
            retry_count = retry_count + 1
    
    print(f"Retries failed for video: {video_id}. Logging error and continuing.")
    with open("chat_error.txt", "a") as f:
        f.write(video_id + '\n')

Conclusion

And that’s it! We’ve successfully written a script to download the chat log for any given Twitch video. Next, we’ll need to download the audio / video. While you can use a method similar to the above, downloading video is much more involved. To make it easier on ourselves, we will be leveraging twitch-dl (opens in a new tab) and my developer friendly fork twitch-dl-API (opens in a new tab).

© Jordan Mryyan.RSS