• Stars
    star
    18,484
  • Rank 1,425 (Top 0.03 %)
  • Language
    TypeScript
  • License
    ISC License
  • Created about 1 year ago
  • Updated 4 months ago

Reviews

There are no reviews yet. Be the first to send feedback to the community and the maintainers!

Repository Details

Crawl a site to generate knowledge files to create your own custom GPT from a URL

GPT Crawler

Crawl a site to generate knowledge files to create your own custom GPT from one or multiple URLs

Gif showing the crawl run

Example

Here is a custom GPT that I quickly made to help answer questions about how to use and integrate Builder.io by simply providing the URL to the Builder docs.

This project crawled the docs and generated the file that I uploaded as the basis for the custom GPT.

Try it out yourself by asking questions about how to integrate Builder.io into a site.

Note that you may need a paid ChatGPT plan to access this feature

Get started

Running locally

Clone the repository

Be sure you have Node.js >= 16 installed.

git clone https://github.com/builderio/gpt-crawler

Install dependencies

npm i

Configure the crawler

Open config.ts and edit the url and selector properties to match your needs.

E.g. to crawl the Builder.io docs to make our custom GPT you can use:

export const defaultConfig: Config = {
  url: "https://www.builder.io/c/docs/developers",
  match: "https://www.builder.io/c/docs/**",
  selector: `.docs-builder-container`,
  maxPagesToCrawl: 50,
  outputFileName: "output.json",
};

See config.ts for all available options. Here is a sample of the common configuration options:

type Config = {
  /** URL to start the crawl, if sitemap is provided then it will be used instead and download all pages in the sitemap */
  url: string;
  /** Pattern to match against for links on a page to subsequently crawl */
  match: string;
  /** Selector to grab the inner text from */
  selector: string;
  /** Don't crawl more than this many pages */
  maxPagesToCrawl: number;
  /** File name for the finished data */
  outputFileName: string;
  /** Optional resources to exclude
   *
   * @example
   * ['png','jpg','jpeg','gif','svg','css','js','ico','woff','woff2','ttf','eot','otf','mp4','mp3','webm','ogg','wav','flac','aac','zip','tar','gz','rar','7z','exe','dmg','apk','csv','xls','xlsx','doc','docx','pdf','epub','iso','dmg','bin','ppt','pptx','odt','avi','mkv','xml','json','yml','yaml','rss','atom','swf','txt','dart','webp','bmp','tif','psd','ai','indd','eps','ps','zipx','srt','wasm','m4v','m4a','webp','weba','m4b','opus','ogv','ogm','oga','spx','ogx','flv','3gp','3g2','jxr','wdp','jng','hief','avif','apng','avifs','heif','heic','cur','ico','ani','jp2','jpm','jpx','mj2','wmv','wma','aac','tif','tiff','mpg','mpeg','mov','avi','wmv','flv','swf','mkv','m4v','m4p','m4b','m4r','m4a','mp3','wav','wma','ogg','oga','webm','3gp','3g2','flac','spx','amr','mid','midi','mka','dts','ac3','eac3','weba','m3u','m3u8','ts','wpl','pls','vob','ifo','bup','svcd','drc','dsm','dsv','dsa','dss','vivo','ivf','dvd','fli','flc','flic','flic','mng','asf','m2v','asx','ram','ra','rm','rpm','roq','smi','smil','wmf','wmz','wmd','wvx','wmx','movie','wri','ins','isp','acsm','djvu','fb2','xps','oxps','ps','eps','ai','prn','svg','dwg','dxf','ttf','fnt','fon','otf','cab']
   */
  resourceExclusions?: string[];
  /** Optional maximum file size in megabytes to include in the output file */
  maxFileSize?: number;
  /** Optional maximum number tokens to include in the output file */
  maxTokens?: number;
};

Run your crawler

npm start

Alternative methods

To obtain the output.json with a containerized execution, go into the containerapp directory and modify the config.ts as shown above. The output.jsonfile should be generated in the data folder. Note: the outputFileName property in the config.ts file in the containerapp directory is configured to work with the container.

Running as an API

To run the app as an API server you will need to do an npm install to install the dependencies. The server is written in Express JS.

To run the server.

npm run start:server to start the server. The server runs by default on port 3000.

You can use the endpoint /crawl with the post request body of config json to run the crawler. The api docs are served on the endpoint /api-docs and are served using swagger.

To modify the environment you can copy over the .env.example to .env and set your values like port, etc. to override the variables for the server.

Upload your data to OpenAI

The crawl will generate a file called output.json at the root of this project. Upload that to OpenAI to create your custom assistant or custom GPT.

Create a custom GPT

Use this option for UI access to your generated knowledge that you can easily share with others

Note: you may need a paid ChatGPT plan to create and use custom GPTs right now

  1. Go to https://chat.openai.com/
  2. Click your name in the bottom left corner
  3. Choose "My GPTs" in the menu
  4. Choose "Create a GPT"
  5. Choose "Configure"
  6. Under "Knowledge" choose "Upload a file" and upload the file you generated
  7. if you get an error about the file being too large, you can try to split it into multiple files and upload them separately using the option maxFileSize in the config.ts file or also use tokenization to reduce the size of the file with the option maxTokens in the config.ts file

Gif of how to upload a custom GPT

Create a custom assistant

Use this option for API access to your generated knowledge that you can integrate into your product.

  1. Go to https://platform.openai.com/assistants
  2. Click "+ Create"
  3. Choose "upload" and upload the file you generated

Gif of how to upload to an assistant

Contributing

Know how to make this project better? Send a PR!



Made with love by Builder.io

More Repositories

1

qwik

Instant-loading web apps, without effort
TypeScript
20,052
star
2

partytown

Relocate resource intensive third-party scripts off of the main thread and into a web worker. 🎉
TypeScript
12,971
star
3

mitosis

Write components once, run everywhere. Compiles to React, Vue, Qwik, Solid, Angular, Svelte, and more.
TypeScript
12,266
star
4

builder

Visual Development for React, Vue, Svelte, Qwik, and more
TypeScript
7,258
star
5

ai-shell

A CLI that converts natural language to shell commands.
TypeScript
4,068
star
6

figma-html

Builder.io for Figma: AI generation, export to code, import from web
TypeScript
3,117
star
7

micro-agent

An AI agent that writes (actually useful) code for you
TypeScript
2,663
star
8

gpt-assistant

An experiment to give an autonomous GPT agent access to a browser and have it accomplish tasks
TypeScript
512
star
9

hydration-overlay

Overlay for hydration errors with explicit diff between renders.
CSS
477
star
10

framework-benchmarks

Test each framework for it's performance cost
TypeScript
473
star
11

nextjs-shopify

The ultimate starter for headless Shopify stores
TypeScript
441
star
12

vscode

Builder.io for VSCode - turn designs into code!
TypeScript
174
star
13

SSDiff

TypeScript
139
star
14

build.

A new visual programming language that reads and writes Typescript and Javascript
TypeScript
128
star
15

builder-shopify-hydrogen

Builder.io Visual CMS + page builder example with Shopify Hydrogen
TypeScript
77
star
16

gatsby-starter-builder

Gatsby example with drag and drop page building
JavaScript
68
star
17

nextjs-edge-personalization-ab-testing

High performance personalization & a/b testing example using Next.js, Edge Middleware, and Builder.io
TypeScript
58
star
18

snap

The fastest web framework
TypeScript
50
star
19

ts-lite

Compiled TypeScript. Generates Go, Swift, Kotlin, WASM, Binary
JavaScript
50
star
20

gatsby-builder-shopify

A starter for Gatsby + Shopify + Builder.io
TypeScript
32
star
21

qwik-city-build

`@builder.io/qwik-city` build artifacts from https://github.com/organizations/BuilderIO/qwik
JavaScript
25
star
22

qwik-tw-vercel-starter-kit

A starter kit for Qwik on Vercel
TypeScript
19
star
23

demo-editor

JavaScript
12
star
24

nextjs-builder-edge-personalization

TypeScript
11
star
25

nextjs-builder-starter

TypeScript
11
star
26

edge-personalize

Personalize and a/b test your static pages at the edge. Static speed with dynamic optimizations!
TypeScript
9
star
27

headlessapp.store

TypeScript
8
star
28

react-design-system-demo

JavaScript
8
star
29

sfcc-composable-storefront-example

SFCC + Builder.io Composable Storefront
JavaScript
8
star
30

qwik-city-e2e

Use to test Qwik City on each server
HTML
6
star
31

qwik-docs-es

TypeScript
6
star
32

this-package-uses-fetch

6
star
33

qwik-build

Build artifacts from https://github.com/organizations/BuilderIO/qwik
JavaScript
6
star
34

blog-example

Builder.io blog example
JavaScript
6
star
35

jsx-qwik-worker-post

A repo showing worker$
JavaScript
6
star
36

perf-experiments

Performance experiments
Astro
6
star
37

http-debug-proxy

This project contains a set up to create a HTTP proxy for the https://cdn.builder.io/ endpoint.
JavaScript
6
star
38

qwik-react-framer-motion

A demo for React Framer Motion inside a Qwik application
TypeScript
5
star
39

resumable-react-post

The code for the blog post: "Resumable React: How To Use React Inside Qwik"
TypeScript
5
star
40

builder-fiddle-demos

Demos of fun stuff for Builder fiddles
TypeScript
5
star
41

nextjs-app-router-example

TypeScript
5
star
42

vcp-design-systems-examples

Starters for experimenting with VCP and different design systems
TypeScript
5
star
43

builder-qwik-example

An example project using Builder.io's drag and drop headless CMS with Qwik
TypeScript
4
star
44

personalization-utils

TypeScript
3
star
45

mitosis-build

Build artifacts from https://github.com/BuilderIO/mitosis
3
star
46

qwik-create-cli-build

JavaScript
3
star
47

kibocommerce-nextjs-starter

A KiboCommerce + Builder.io store built on NextJS
TypeScript
3
star
48

builder-swift

Swift SDK for Builder.io
Swift
3
star
49

nextjs-elasticpath

TypeScript
2
star
50

qwik-raw-data

1
star
51

qase-for-qwik

A demo repo to showcase Qwik
TypeScript
1
star
52

gatsby-builder-transform-images

JavaScript
1
star
53

block-publish

CLI tool to block directly using npm publish
JavaScript
1
star
54

qwik-docs

WIP Qwik-docs
JavaScript
1
star
55

unified-demo

Unified Demo of builder use cases with next.js app router
TypeScript
1
star
56

qwik-labs-build

Continues build artifacts
JavaScript
1
star