GitHub - jdepoix/youtube-transcript-api: This is a python API which allows you to get the transcript/subtitles for a given YouTube video. It also works for automatically generated subtitles and it does not require an API key nor a headless browser, like other selenium based solutions do!
Key Points
- 1The `youtube-transcript-api` is a Python library for retrieving YouTube video transcripts and subtitles, including automatically generated ones, without requiring an API key or a headless browser.
- 2It offers functionalities to fetch transcripts by video ID, list available languages, translate content, and convert output into various formats like JSON or SRT.
- 3The tool also includes a command-line interface and robust proxy support to mitigate IP bans, though it leverages an undocumented YouTube API.
This paper describes the youtube-transcript-api Python library, an open-source tool designed to programmatically retrieve transcripts and subtitles for YouTube videos. The primary goal of the library is to provide a lightweight, efficient, and self-contained solution that does not require an API key, nor reliance on headless browsers or Selenium, which are common with other transcript extraction methods.
The core methodology of the library involves reverse-engineering and utilizing an undocumented part of the YouTube API, specifically the internal endpoints that the YouTube web-client itself uses to fetch transcript data. This approach allows the library to bypass the limitations and overhead associated with official API rate limits or browser automation. The library directly makes HTTP requests to these internal YouTube endpoints, parses the returned data, and structures it into user-friendly Python objects. It effectively handles both manually created and automatically generated subtitles, and crucially, supports translation of these subtitles.
Key functionalities include:
- Transcript Fetching: Users can retrieve transcripts for a given
video_idusingYouTubeTranscriptApi().fetch(video_id). By default, it attempts to fetch English transcripts but allows specifying a list of preferred language codes (e.g.,['de', 'en']) in descending priority. The fetched data is encapsulated in aFetchedTranscriptobject, which behaves like a list ofFetchedTranscriptSnippetobjects, each containingtext,starttime, andduration. Ato_raw_data()method provides the data as a list of dictionaries:[{'text': '...', 'start': 0.0, 'duration': 1.54}, ...]. TheFetchedTranscriptobject also includes metadata such asvideo_id,language,language_code, andis_generated. - Transcript Listing and Filtering: The
YouTubeTranscriptApi().list(video_id)method returns aTranscriptListobject, which allows users to discover all available transcripts for a video. This object supports filtering to find specific transcript types, such asfind_transcript(),find_manually_created_transcript(), orfind_generated_transcript(). These methods returnTranscriptobjects, which contain metadata (e.g.,is_generated,is_translatable,translation_languages) and afetch()method to retrieve the actual transcript data. - Transcript Translation: The library leverages YouTube's automatic translation feature. A
Transcriptobject can be translated using itstranslate(target_language_code)method, which returns a newTranscriptobject representing the translated version. - Proxy Support and IP Ban Workarounds: Recognizing that YouTube actively blocks IPs from cloud providers or those making excessive requests, the library provides robust proxy configuration. It offers specialized integration with Webshare's residential rotating proxies via
WebshareProxyConfig, allowing users to specify credentials and filter IP locations. Alternatively, aGenericProxyConfigclass supports any standard HTTP/HTTPS/SOCKS proxy. This mechanism, based onrequests.Session, helps mitigateRequestBlockedorIpBlockedexceptions by rotating through a pool of proxy addresses. - Custom
requests.SessionIntegration: For advanced users, the library allows injecting a pre-configuredrequests.Sessionobject into theYouTubeTranscriptApiconstructor. This enables fine-grained control over HTTP client settings, such as custom headers, SSL verification, or cookie sharing across multiple instances. - Formatters: A
formatterssubmodule provides utilities to convertFetchedTranscriptobjects into various string formats. Built-in formatters includeJSONFormatter,TextFormatter,WebVTTFormatter, andSRTFormatter. Users can also implement custom formatters by inheriting from theFormatterbase class. - Command-Line Interface (CLI): The library exposes its core functionalities via a CLI, allowing users to fetch, list, and translate transcripts directly from the command line, with options for language preferences, exclusion of certain transcript types, output formatting (e.g., JSON), and proxy configuration.
Limitations and Warnings:
A significant caveat is the reliance on an undocumented YouTube API. This means the library's functionality is subject to breakage if YouTube alters its internal API structure, necessitating prompt updates from the maintainers. Additionally, while cookie-based authentication for age-restricted videos was planned, the current implementation is noted as broken due to recent changes in YouTube's authentication mechanisms. The persistent challenge of IP bans from YouTube means that even with proxy solutions, reliability is maximized by using rotating proxy pools.