Top Rating
- Top Contributors
  Discover the Top Open Source contributors by country or by language
- Interviews
  Discover real stories from Open Source developers
Discover

Discover your Favorite Language
Discover the top trending repositories and projects on Github. Explore the latest trends in your preferred languages.

Lua

CSS

HTML

Swift

Go

Ruby

Shell

Haskell

More Languages
Awesome

Awesome repositories
Discover the most awesome repositories and projects of your favorite languages. Inspired by the Awesome-* lists trend in GitHub.

Scala

Erlang

Kotlin

Perl

C#

Clojure

Racket

JavaScript

More Languages
By Country

Rankings by Country
Discover the community of talented open source contributors in each country.

🇻🇦 Vatican City

🇬🇬 Guernsey

🇧🇲 Bermuda

🇵🇹 Portugal

🇸🇾 Syria

🇦🇱 Albania

🇧🇿 Belize

🇦🇲 Armenia

All Countries Compare Countries

daveshap/PlainTextWikipedia

This repository has been archived on 13/Mar/2023
Stars
300
Rank 138,870 (Top 3 %)
Language
Python
License
MIT License
Created about 4 years ago
Updated over 3 years ago

daveshap/PlainTextWikipedia

daveshap

There are no reviews yet. Be the first to send feedback to the community and the maintainers!

Convert Wikipedia database dumps into plaintext files

PlainTextWikipedia

Convert Wikipedia database dumps into plain text files (JSON). This can parse literally all of Wikipedia with pretty high fidelity. There's a copy available on Kaggle Datasets

QUICK START

Download and unzip a Wikipedia dump (see Data Sources below) make sure you get a monolithic XML file
Open up wiki_to_text.py and edit the filename to point at your XML file. Also update the savedir location
Run wiki_to_text.py - it should take about 2.5 days to run, with some variation based on your CPU and storage speed

Data Sources

There are two primary data sources you'll want to use. See the table below for the root url.

Name	Description	Link
Simplified English Wikipedia	This is only about 1GB and therefore is a great test set	https://dumps.wikimedia.org/simplewiki/
English Wikipedia	This is all of Wikipedia, so about 80GB unpacked	https://dumps.wikimedia.org/enwiki/

Navigate into the latest dump. You're likley looking for the very first file in the download section. They will look something like this:

enwiki-20210401-pages-articles-multistream.xml.bz2 18.1 GB
simplewiki-20210401-pages-articles-multistream.xml.bz2 203.5 MB

Download and extract these to a storage directory. I usually shorten the folder name and filename.

Legal

https://en.wikipedia.org/wiki/Wikipedia:Reusing_Wikipedia_content

Wikipedia is published under Creative Commons Attribution Share-Alike license (CC-BY-SA).

My script is published under the MIT license but this does not confer the same privileges to the material you convert with it.

OpenAI_Agent_Swarm

HAAS = Hierarchical Autonomous Agent Swarm - "Resistance is futile!"

ACE_Framework

ACE (Autonomous Cognitive Entities) - 100% local and open source autonomous agents

ChatGPT_Custom_Instructions

Repo of custom instructions that you can use for ChatGPT

raven

RAVEN (Realtime Assistant Voice Enabled Network) Open Source Software (OSS) community repo

SparsePrimingRepresentations

Public repo to document some SPR stuff

LongtermChatExternalSources

GPT-3 chatbot with long-term memory and external sources

REMO_Framework

Rolling Episodic Memory Organizer (REMO) for autonomous AI systems

AI_Tools_and_Papers

Some of the coolest AI tools and papers I've found

RecursiveSummarizer

BSHR_Loop

BSHR "Basher" Loop: Brainstorm, Search, Hypothesize, Refine

Jupyter Notebook

PineconeInfiniteMemoryChatbot

Let's use Pinecone to give a basic chatbot INFINITE MEMORY

latent_space_activation

Simple repo demonstrating Latent Space Activation

Medical_Intake

Automated pipeline for medical intake, diagnosis, tests, etc.

ATOM_Framework

Autonomous Task Orchestration Manager for AI systems

NaturalLanguageCognitiveArchitecture

Open source copy of my book Natural Language Cognitive Architecture

Quickly_Extract_Science_Papers

Scientific papers are coming out TOO DAMN FAST so we need a way to very quickly extract useful information.

HeuristicImperatives

Reduce suffering, increase prosperity, increase understanding. A proposed framework to address the Control Problem.

weekly_arxiv

Quickly download the abstracts for arxiv papers related to a given topic and render with markdown

ChromaDB_Chatbot_Public

Public version of my ChromaDB chatbot that keeps track of user profile and historical topics

PythonGPT3Tutorial

Public Hello World to get used to Python and GPT-3

Coding_ChatBot_Assistant

Since ChatGPT has been lobotomized and GitHub Copilot is broken

YouTubeChapterGenerator

Make YouTube Chapters from a downloaded Transcript

SymphonyOfThought

Public repo for my book Symphony of Thought: Orchestrating Artificial Cognition

BenevolentByDesign

Public repo for my book about AGI and the control problem Benevolent By Design: Six Words to Safeguard Humanity

PostLaborEconomics

Collaborative book to promote the idea of Post Labor Economics

David_Shapiro_Reading_List

Public repo of the most influential books I've read

MultiDocumentAnswering

Experiment to answer questions from arbitrary number of sources

Postnihilism

Meaning is not necessary

GPT3_Finetunes

Public repo for my finetuning projects

Claude_Sentience

Long conversation I had with Claude 3 Opus. I am... uncertain what this all means.

Document_Scraping

Public repo for scraping PDF and Word documents with Python and PowerShell

KB_microservice

KB (knowledge base) microservice powered by GPT4. For chatbots, cognitive architectures, and autonomous agents

YouTube_Slide_Decks

Public repo for the slide decks that appear in my videos

ChatGPT_QA_Regenerative_Medicine

Build a ChatGPT API powered QA chatbot to accelerate regenerative medicine science

RLHI

Reinforcement Learning with Heuristic Imperatives - Finetuning LLMs for Post-Conventional Moral Intuition

Benevolent_AGI

Experiment to create an agentic autonomous AGI with benevolent programming

AutoMuse_Chapter_Planner

Reflective_Journaling_Tool

Use a customized version of ChatGPT for reflective journaling. No data saved for privacy reasons.

FinetuningTutorial

Finetuning tutorial for GPT-3

ChatGPT_API_Salience

Demonstrate the concept of "salience" using the ChatGPT API

ResumeBuilderGpt3

Build and optimize a resume with GPT-3. Maybe also resume search?

Open_MURPHIE

Multi Use Robot Platform Humanoid Intelligent Entity

AutoMuse_ChatGPT

Making a version of AutoMuse but for the ChatGPT API

LiteratureReviewBot

Experiment to use GPT-3 to help write grant proposals.

HierarchicalMemoryConsolidationSystem

HMCS - Experiments on how to consolidate and manage ACE memories

PTSD_prompts

GPT based PTSD experiments - USE AT OWN RISK - EXPERIMENTAL ONLY

CreativeWritingCoach

Finetune a GPT-3 model to provide copy editing (prose) feedback and critique

Automated_Consensus

Modeling the full breadth of human epistemology, philosophy, ethics, and morality to automatically determine consensus

CoverLetterGenerator

Finetune GPT-3 to ask a few questions and generate a perfect cover letter

AI_Future_of_Work

Public repo to document some thoughts and predictions about the future of work an AI

TutorChatbot

TIM the Tutor Chatbot - an experiment in finetuning GPT-3 to encourage curiosity

PDF_OCR_ChatGPT_Investigation

Using ChatGPT and PDF OCR to investigate documentation

DavidShapiroBlog

Semantic_Embedding_Reverse_Dictionary

A reverse dictionary/thesaurus empowered by vector search

RAVEN_MVP_Public

Public MVP of Raven. It's been long enough, time to do a full send.

MARAGI

Microservices Architecture for Robotics and Artificial General Intelligence

Democratic_AI_Inputs

My personal response to OpenAI's Grant Challenge

Mordin_Solus_Mode

Some helpful prompts to get ChatGPT and other chatbots to use more word economy.

ImpliedCognition

Public research about LLMs, Implied Cognition, experiments, tests, etc

Hierarchical_Document_Representation

Experiment I've been meaning to do. An evolution of REMO

Successor_Species

We are likely creating our successor species. This is an open collaborative book to unpack this.

Recreate_ChatGPT

"I used the ChatGPT to destroy the ChatGPT" - Idk Thanos or something

AutoMuse2

experiment to generate novel-length fiction from a single story premise

DiversePerspectives

Use GPT-3 to simulate debate between diverse perspectives

Raspberry

Create an open source toy dataset for finetuning LLMs with reasoning abilities

GATO_Framework

Global Alignment Taxonomy Omnibus

CoreObjectiveFunctions

The Core Objective Functions are the solution to the Control Problem. They will result in a benevolent and trustworthy AGI.

NLCA_Question_Generator

Finetuning experiments and datasets for Raven

SCOTUS_GPT3_Opinions

Let's see what we can do with SCOTUS opinions

SynopsisGenerator

Generate highly detailed plot synopses for a nearly infinite array of stories

GPT3_CriticalArgument

Public experiment with prompt-chaining to generate critical arguments

MovieScriptGenerator

Finetuning project for GPT-3

GibberishDetector

Detecting gibberish as a type of sentiment analysis with GPT2

Jupyter Notebook

AutoMuseBlogger

Automatically generate nonfiction content

PerfectEmailGenerator

Generate perfect emails for any purpose with GPT-3

PlotGenerator

From any synopsis, generate a solid plot

GPT3_ResearchAssistant

Experiment to see if GPT-3 can help with literature reviews and other kinds of research

C3P0

Collaborative Culture Community Policy: Zero Tolerance

Epistemic-Pragmatic_Orthogonality

Epistemic-Pragmatic Orthogonality Principle of AI - "true understanding" is uncorrelated (or irrelevant) to a machine's utility

ACE_WorldState

Microservice that consumes numerous sources to keep track of the global context of Planet Earth. Part of the ACE framework

The_Fair_Deal

Technology (AI, automation) is disruptive the balance of power in society. We need to negotiate a new social contract. I call it The Fair Deal.

ACE_L1_Aspiration

Aspirational Layer for the ACE Framework - morals and mission

Narratives_Emergence_Convergence

Collaborative Open Source Book about Narratives, Emergence, and Convergence

Nexus

Stream of consciousness nexus REST microservice

GPT4_Unemployment_Predictions

Trying to forecast unemployment numbers based on AI capabilities

MedicalQuestionAnswering

EmbeddingService

REST API microservice for handling Universal Sentence Encoder

raspberry_experiments

Keeping my personal experiments separate from the main repo

DalleHelperBot

Chatbot to help you craft DALLE prompts

AutoMuse

GAIA_Initiative

Global AI Agencies - an offshoot of GATO. Advocating for national, international, and global AI research and safety

nonfiction_drafting

private repo for nonfiction drafting

Raven_MVP

Public repo for Raven MVP

Functional_Sentience

Paper on Functional (vs Philosophical) Sentience

YouTubeCommentDownloader

Grand_Struggle_Great_Mystery

Core philosophies for the modern world.

EU_AI_Act

Let's decompose the EU AI Act with GPT

JobMatching

Experiment to match job applications with job descriptions using GPT-3

Flask_Chat_Voice

ENGAGE_Model

Embracing Novelty, Growth, and Genuine Experiences (ENGAGE)