• Stars
    star
    149
  • Rank 248,619 (Top 5 %)
  • Language
    Python
  • License
    MIT License
  • Created over 4 years ago
  • Updated 7 months ago

Reviews

There are no reviews yet. Be the first to send feedback to the community and the maintainers!

Repository Details

A database of movie scripts from several sources

The Movie Script Database

This is an utility that allows you to collect movie scripts from several sources and create a database of 2.5k+ movie scripts as .txt files along with the metadata for the movies.

There are four steps to the whole process:

  1. Collect scripts from various sources - Scrape websites for scripts in HTML, txt, doc or pdf format
  2. Collect metadata - Get metadata about the scripts from TMDb and IMDb for additional processing
  3. Find duplicates from different sources - Automatically group and remove duplicates from different sources.
  4. Parse Scripts - Convert scripts into lines with just Character and dialogue

Usage

The following steps MUST be run in order

Clone

Clone this repository:

git clone https://github.com/Aveek-Saha/Movie-Script-Database.git
cd Movie-Script-Database

Dependencies

Read the instructions for installing textract first here.

Then install all dependencies using pip

pip install -r requirements.txt

Collect movie scripts

Modify the sources you want to download in sources.json. If you want a source to be included, set the value to true, or else set it as false.

python get_scripts.py

Collect all the scripts from the sources listed below:

{
    "imsdb": "true",
    "screenplays": "true",
    "scriptsavant": "true",
    "dailyscript": "true",
    "awesomefilm": "true",
    "sfy": "true",
    "scriptslug": "true",
    "actorpoint": "true",
    "scriptpdf": "true"
}
  • This might take a while (4+ hrs) depending on your network connection.
  • The script takes advantage of parallel processing to speed up the download process.
  • If there are missing/incomplete downloads, the script will only download the missing scripts if run again.
  • In case of scripts in PDF or DOC format, the original file is stored in the temp directory.

Collect metadata

Collect metadata from TMDb and IMDb:

python get_metadata.py

You'll need an API key for using the TMDb api and you can find out more about it here. Once you get the API key it has to be stored in a file called config.py in this format:

tmdb_api_key = "<Your API key>"

This step will also combine duplicates, and your final metadata will be in this format:

{
    "uniquescriptname": {
        "files": [
            {
                "name": "Duplicate 1",
                "source": "Source of the script",
                "file_name": "name-of-the-file",
                "script_url": "Original link to script",
                "size": "size of file"
            },
            {
                "name": "Duplicate 2",
                "source": "Source of the script",
                "file_name": "name-of-the-file",
                "script_url": "Original link to script",
                "size": "size of file"
            }
        ],
        "tmdb": {
            "title": "Title from TMDb",
            "release_date": "Date released",
            "id": "TMDb ID",
            "overview": "Plot summary"
        },
        "imdb": {
            "title": "Title from IMDb",
            "release_date": "Year released",
            "id": "IMDb ID"
        }
    }
}

Remove duplicates

Run:

python clean_files.py

This will remove the duplicate files as best as possible without false positives. In the end, the files will be stored in the scripts\filtered directory.

A new metadata file is created where only one file exists for each unique script name, in this format:

{
    "uniquescriptname": {
        "file": {
            "name": "Movie name from source",
            "source": "Source of the script",
            "file_name": "name-of-the-file",
            "script_url": "Original link to script",
            "size": "size of file"
        },
        "tmdb": {
            "title": "Title from TMDb",
            "release_date": "Date released",
            "id": "TMDb ID",
            "overview": "Plot summary"
        },
        "imdb": {
            "title": "Title from IMDb",
            "release_date": "Year released",
            "id": "IMDb ID"
        }
    }
}

The scripts are also cleaned to remove as much formatting weirdness that comes from using OCR to read from a PDF as possible.

Parse Scripts

Run:

python parse_files.py

This will parse your non duplicate scripts from the previous step. The parsed scripts are put into three folders

  • scripts/parsed/tagged: Contains scripts where each line has been tagged. The tags are
    • S = Scene
    • N = Scene description
    • C = Character
    • D = Dialogue
    • E = Dialogue metadata
    • T = Transition
    • M = Metadata
  • scripts/parsed/dialogue: Contains scripts where each line has the character name, followed by a dialogue, in this format, C=>D
  • scripts/parsed/charinfo: Contains a list of each character in the script and the number of lines they have, in this format, C: Number of lines

A new metadata file is created with the following format:

{
    "uniquescriptname": {
        "file": {
            "name": "Movie name from source",
            "source": "Source of the script",
            "file_name": "name-of-the-file",
            "script_url": "Original link to script",
            "size": "size of file"
        },
        "tmdb": {
            "title": "Title from TMDb",
            "release_date": "Date released",
            "id": "TMDb ID",
            "overview": "Plot summary"
        },
        "imdb": {
            "title": "Title from IMDb",
            "release_date": "Year released",
            "id": "IMDb ID"
        },
        "parsed": {
            "dialogue": "name-of-the-file_dialogue.txt",
            "charinfo": "name-of-the-file_charinfo.txt",
            "tagged": "name-of-the-file_parsed.txt"
        }
    }
}

Directory structure

After running all the steps, your folder structure should look something like this:

scripts
│
├── unprocessed // Scripts from sources
│   ├── source1
│   ├── source2
│   └── source3
│
├── temp // PDF files from sources
│   ├── source1
│   ├── source2
│   └── source3
│
├── metadata // Metadata files from sources/cleaned metadata
│   ├── source1.json
│   ├── source2.json
│   ├── source3.json
│   └── meta.json
│
├── filtered // Scripts with duplicates removed
│
└── parsed // Scripts parsed using the parser
    ├── dialogue
    ├── charinfo
    └── tagged

Sources

Metadata:

Scripts:

Note:

Citing

If you use The Movie Script Database, please cite:

@misc{Saha_Movie_Script_Database_2021,
    author = {Saha, Aveek},
    month = {7},
    title = {{Movie Script Database}},
    url = {https://github.com/Aveek-Saha/Movie-Script-Database},
    year = {2021}
}

Credits

The script for parsing the movie scripts come from this paper: Linguistic analysis of differences in portrayal of movie characters, in: Proceedings of Association for Computational Linguistics, Vancouver, Canada, 2017 and the code can be found here: https://github.com/usc-sail/mica-text-script-parser

More Repositories

1

DuskPlayer

A minimal music player built on electron.
Svelte
198
star
2

GitHub-Profile-Badges

🛡 Clean badges for your GitHub Profile Readme
JavaScript
152
star
3

ytdx

Download audio from YouTube videos
JavaScript
59
star
4

aveek-saha

I'd ask you to fork, but identity theft is not a joke Jim.
54
star
5

pixel-weather

A pixelated weather widget for your desktop
CSS
41
star
6

Online-Chess

A chess website where people can play against each other online.
JavaScript
36
star
7

FUSE-Filesystem

A basic file system in user space written in C using FUSE
C
25
star
8

ecommerce-website-template

A simple webpage template for a e-commerce website using html, bootstrap and javascript for front end. PHP and mySQL have been used for backend
PHP
20
star
9

GithubStats

A website where you can check download counts for GitHub releases, information like release author and date of publishing
Vue
17
star
10

spotify-box

🎧 Update a pinned gist to show your top Spotify tracks/artists.
JavaScript
13
star
11

HastyHeroes

An endless 2D jumping game made with Phaser and Electron, select a avatar and start playing
JavaScript
13
star
12

FireChess

Play online chess with your friends, powered by Firebase 🔥♟
Svelte
11
star
13

TwitterFakeNet

Classifying Verified used users on Twitter based on how likely they are to share Fake News articles
Jupyter Notebook
11
star
14

Chrome-dinosaur-game-clone

A clone of the classic chrome dinosaur game using Phaser
JavaScript
9
star
15

Sentiment-based-stock-price-forecasting

Apple Stock Price Forecasting using Sentiment Analysis
Jupyter Notebook
7
star
16

blog-box

📝 Update a gist to show your latest dev.to post.
JavaScript
6
star
17

js-data-structs

A small data structures library for JavaScript
JavaScript
6
star
18

Graph-Attention-Net

A TensorFlow 2 implementation of Graph Attention Networks (GAT)
Python
4
star
19

pix2ascii

Convert an image into ASCII art
JavaScript
3
star
20

ClepsBot

🔄 A discord bot for generating random teams.
JavaScript
3
star
21

Aveek-Saha.github.io

A website for my portfolio
JavaScript
3
star
22

lang-stats-box

💻 Update a pinned gist to show your most used programming languages.
JavaScript
3
star
23

wordle-solver

Solves the daily Wordle puzzle in hard mode and tweets it 🐦
Python
3
star
24

Cricket-score-predictor

A Big data application to predict the outcome of a T20 cricket match.
Jupyter Notebook
2
star
25

Pacman-AI

A repository for the Solutions for the PacMan assignment from Berkley
Python
2
star
26

GistBlog

Turn your Gists into blog posts.
HTML
2
star
27

tweego

Generate egocentric networks for Twitter users 🐦
Python
2
star
28

Taskify

A personalised, collaborative, To-Do List application. Add as many people as you want and keep track of Group goals and progress.
TypeScript
2
star
29

Graph-Conv-Net

A TensorFlow 2 implementation of Graph Convolutional Networks (GCN)
Python
2
star
30

Intal

C library to perform calculations on integers of arbitrary length
C
2
star
31

CovidAnalysis

Analyzing the spread of the novel Coronavirus COVID-19
Jupyter Notebook
2
star
32

ActorNet

🎥 Generate an ego network for any actor
Python
2
star
33

Orca-strator

A container orchestration system for scalable APIs
JavaScript
1
star
34

InvestmentTracker

A Java application to keep track of investments made in the stock market with real time stock prices.
Java
1
star
35

snek-qr

Can you fit a game of snake on a QR code?🐍
C
1
star
36

Transformer

A TensorFlow 2.0 Implementation of the Transformer: Attention Is All You Need
Python
1
star
37

VIRALIQ

Code implementation & CLI tool for the paper: "Graph Based Temporal Aggregation for Video Retrieval"
Python
1
star
38

Tuiter

A Twitter clone made with React
HTML
1
star
39

Autocorrect-and-spellcheck-webapp

A web application that gives autocomplete suggestions or checks the spellings of an input word.
C
1
star
40

TopSpotify

🎵 See all your top songs and artists from Spotify
TypeScript
1
star
41

MovieBoard

🎬 A place for all movie reviews, ratings and discussions
JavaScript
1
star
42

DashFlix

A quick way to check Now playing and Upcoming movies on android
TypeScript
1
star
43

MovieVue

🍿 Check out new new and popular movies on your phone. Built with Ionic and Vue.
Vue
1
star
44

Bitcoin-price-tracker

A desktop application to notify you of Bitcoin prices, when they rise above a price set by the user.
JavaScript
1
star
45

view-counter-badge

📈 A badge to count the number of visitors to your page
JavaScript
1
star
46

Pac-Man

🕹 A recreation of the classic Pac-Man game with better AI for the ghosts.
JavaScript
1
star