• Stars
    star
    605
  • Rank 74,072 (Top 2 %)
  • Language
    Python
  • Created over 9 years ago
  • Updated 11 months ago

Reviews

There are no reviews yet. Be the first to send feedback to the community and the maintainers!

Repository Details

πŸ“‡ Extract social media profiles and more with regular expressions

Regular Expressions to Match Social Media Profiles

This repository lists regular expressions to match and extract information from URLs of social media profiles. So if you find a hyperlink to this repo somewhere on the web, i.e. https://github.com/lorey/social-media-profiles-regexs/, the regular expressions in this repo allow you find out it's a Github link pointing to a repo as well as extract the username lorey and the repo name social-media-profiles-regexs from this URL.

Features:

  • detect the platform a url points to (all major platforms supported)
  • extract the information contained within the url (without opening the url, of course)
  • extract emails and phone numbers from hyperlinks

Please note: If you want to extract social media links, depending on your case, there are possibly easier ways:

  • I've created a Python library called socials that uses these expressions to automate url detection and data extraction. You input the urls, it extracts the type of platform as well as the contained information, e.g. the linked social media profiles.
  • There's also a Socials API which makes the socials python package available via REST and JSON. You can use it for free at socials.karllorey.com or deploy it yourself. You simply input any URL you want to extract profiles from. It will then fetch and return all social media links from the given website. Try it here.

If you're missing a particular platform, please feel free to add it. Also feel free to add a test that does not work. An explanation of how this repo works can be found in CONTRIBUTING.md. You might also open an issue, of course, I'm happy to help!

Table of Contents

angellist

company

(?:https?:)?\/\/angel\.co\/company\/(?P<company>[A-z0-9_-]+)(?:\/(?P<company_subpage>[A-z0-9-]+))?

Examples:

job

(?:https?:)?\/\/angel\.co\/company\/(?P<company>[A-z0-9_-]+)\/jobs\/(?P<job_permalink>(?P<job_id>[0-9]+)-(?P<job_slug>[A-z0-9-]+))

Examples:

user

(?:https?:)?\/\/angel\.co\/(?P<type>u|p)\/(?P<user>[A-z0-9_-]+)

There are root-level direct links to users, e.g. angel.co/karllorey, that get redirected to these new user links now. Sometimes it's /p/, sometimes it's /u/, haven't figured out why that is...

Examples:

crunchbase

company

(?:https?:)?\/\/crunchbase\.com\/organization\/(?P<organization>[A-z0-9_-]+)

Examples:

person

(?:https?:)?\/\/crunchbase\.com\/person\/(?P<person>[A-z0-9_-]+)

Examples:

email

mailto

(?:mailto:)?(?P<email>[A-z0-9_.+-]+@[A-z0-9_.-]+\.[A-z]+)

This matches plain emails and mailto hyperlinks. This regex is intended for scraping and not as a validation. See why: "Your email validation logic is wrong".

Examples:

facebook

profile

(?:https?:)?\/\/(?:www\.)?(?:facebook|fb)\.com\/(?P<profile>(?![A-z]+\.php)(?!marketplace|gaming|watch|me|messages|help|search|groups)[A-z0-9_\-\.]+)\/?

A profile can be a page, a user profile, or something else. Since Facebook redirects these URLs to all kinds of objects (user, pages, events, and so on), you have to verify that it's actually a user. See https://developers.facebook.com/docs/graph-api/reference/profile

Examples:

profile by id

(?:https?:)?\/\/(?:www\.)facebook.com\/(?:profile.php\?id=)?(?P<id>[0-9]+)

Examples:

github

repo

(?:https?:)?\/\/(?:www\.)?github\.com\/(?P<login>[A-z0-9_-]+)\/(?P<repo>[A-z0-9_-]+)\/?

Exclude subdomains as these redirect to github pages sometimes.

Examples:

user

(?:https?:)?\/\/(?:www\.)?github\.com\/(?P<login>[A-z0-9_-]+)\/?

Exclude subdomains other than www. as these redirect to github pages sometimes.

Examples:

google plus

user id

(?:https?:)?\/\/plus\.google\.com\/(?P<id>[0-9]{21})

Matches profile numbers with exactly 21 digits.

Examples:

username

(?:https?:)?\/\/plus\.google\.com\/\+(?P<username>[A-z0-9+]+)

Matches username.

Examples:

hackernews

item

(?:https?:)?\/\/news\.ycombinator\.com\/item\?id=(?P<item>[0-9]+)

An item can be a post or a direct link to a comment.

Examples:

user

(?:https?:)?\/\/news\.ycombinator\.com\/user\?id=(?P<user>[A-z0-9_-]+)

Examples:

instagram

profile

(?:https?:)?\/\/(?:www\.)?(?:instagram\.com|instagr\.am)\/(?P<username>[A-Za-z0-9_](?:(?:[A-Za-z0-9_]|(?:\.(?!\.))){0,28}(?:[A-Za-z0-9_]))?)

The rules:

  • Matches with one . in them disco.dude but not two .. disco..dude
  • Ending period not matched discodude.
  • Match underscores _disco__dude
  • Max characters of 30 1234567890123456789012345678901234567890

Examples:

linkedin

company

(?:https?:)?\/\/(?:[\w]+\.)?linkedin\.com\/(?P<company_type>(company)|(school))\/(?P<company_permalink>[A-z0-9-Γ€-ΓΏ\.]+)\/?

This matches companies and schools. Permalink is an integer id or a slug. The id permalinks redirect to the slug permalinks as soon as one is set. Permalinks can contain special characters. Recently, company links that are actually schools get redirected to newly introduced /school/ permalinks, see the university example below.

Examples:

post

(?:https?:)?\/\/(?:[\w]+\.)?linkedin\.com\/feed\/update\/urn:li:activity:(?P<activity_id>[0-9]+)\/?

Direct link to a Linkedin post, only contains a post id.

Examples:

profile

(?:https?:)?\/\/(?:[\w]+\.)?linkedin\.com\/in\/(?P<permalink>[\w\-\_Γ€-ΓΏ%]+)\/?

These are the currently used, most-common urls ending in /in/

Examples:

profile_pub

(?:https?:)?\/\/(?:[\w]+\.)?linkedin\.com\/pub\/(?P<permalink_pub>[A-z0-9_-]+)(?:\/[A-z0-9]+){3}\/?

These are old public urls not used anymore, more info at quora

Examples:

medium

post

(?:https?:)?\/\/medium\.com\/(?:(?:@(?P<username>[A-z0-9]+))|(?P<publication>[a-z-]+))\/(?P<slug>[a-z0-9\-]+)-(?P<post_id>[A-z0-9]+)(?:\?.*)?

Examples:

post of subdomain publication

(?:https?:)?\/\/(?P<publication>(?!www)[a-z-]+)\.medium\.com\/(?P<slug>[a-z0-9\-]+)-(?P<post_id>[A-z0-9]+)(?:\?.*)?

Can't match these with the regular post regex as redefinitions of subgroups are not allowed in pythons regex.

Examples:

user

(?:https?:)?\/\/medium\.com\/@(?P<username>[A-z0-9]+)(?:\?.*)?

Examples:

user by id

(?:https?:)?\/\/medium\.com\/u\/(?P<user_id>[A-z0-9]+)(?:\?.*)

Now redirects to new user profiles. Follow with a head or get request.

Examples:

phone

phone number

(?:tel|phone|mobile):(?P<number>\+?[0-9. -]+)

Should be cleaned afterwards to strip dots, spaces, etc.

Examples:

  • tel:+49 900 123456
  • tel:+49900123456

reddit

user

(?:https?:)?\/\/(?:[a-z]+\.)?reddit\.com\/(?:u(?:ser)?)\/(?P<username>[A-z0-9\-\_]*)\/?

Examples:

skype

profile

(?:(?:callto|skype):)(?P<username>[a-z][a-z0-9\.,\-_]{5,31})(?:\?(?:add|call|chat|sendfile|userinfo))?

Matches Skype's URLs to add contact, call, chat. More info at Skype SDK's docs.

Examples:

  • skype:echo123
  • skype:echo123?call

snapchat

profile

(?:https?:)?\/\/(?:www\.)?snapchat\.com\/add\/(?P<username>[A-z0-9\.\_\-]+)\/?

Examples:

stackexchange

user

(?:https?:)?\/\/(?:www\.)?stackexchange\.com\/users\/(?P<id>[0-9]+)\/(?P<username>[A-z0-9-_.]+)\/?

This is the meta-platform above stackoverflow, etc. Username can be changed at any time, user_id is persistent.

Examples:

stackexchange network

user

(?:https?:)?\/\/(?:(?P<community>[a-z]+(?!www))\.)?stackexchange\.com\/users\/(?P<id>[0-9]+)\/(?P<username>[A-z0-9-_.]+)\/?

While there are some "named" communities in the stackexchange network like stackoverflow, many only exist as subdomains, i.e. gaming.stackexchange.com. Again, username can be changed at any time, user_id is persistent.

Examples:

stackoverflow

question

(?:https?:)?\/\/(?:www\.)?stackoverflow\.com\/questions\/(?P<id>[0-9]+)\/(?P<title>[A-z0-9-_.]+)\/?

Examples:

user

(?:https?:)?\/\/(?:www\.)?stackoverflow\.com\/users\/(?P<id>[0-9]+)\/(?P<username>[A-z0-9-_.]+)\/?

Username can be changed at any time, user_id is persistent.

Examples:

telegram

profile

(?:https?:)?\/\/(?:t(?:elegram)?\.me|telegram\.org)\/(?P<username>[a-z0-9\_]{5,32})\/?

Matches for t.me, telegram.me and telegram.org.

Examples:

twitter

status

(?:https?:)?\/\/(?:[A-z]+\.)?twitter\.com\/@?(?P<username>[A-z0-9_]+)\/status\/(?P<tweet_id>[0-9]+)\/?

Examples:

user

(?:https?:)?\/\/(?:[A-z]+\.)?twitter\.com\/@?(?!home|share|privacy|tos)(?P<username>[A-z0-9_]+)\/?

Allowed for usernames are alphanumeric characters and underscores.

Examples:

vimeo

user

(?:https?:)?\/\/vimeo\.com\/user(?P<id>[0-9]+)

Examples:

video

(?:https?:)?\/\/(?:(?:www)?vimeo\.com|player.vimeo.com\/video)\/(?P<id>[0-9]+)

Examples:

xing

profile

(?:https?:)?\/\/(?:www\.)?xing.com\/profile\/(?P<slug>[A-z0-9-\_]+)

Default slugs are Firstname_Lastname. If several people with the same name exist, a number is appended.

Examples:

youtube

channel

(?:https?:)?\/\/(?:[A-z]+\.)?youtube.com\/channel\/(?P<id>[A-z0-9-\_]+)\/?

Examples:

user

(?:https?:)?\/\/(?:[A-z]+\.)?youtube.com\/user\/(?P<username>[A-z0-9]+)\/?

Examples:

video

(?:https?:)?\/\/(?:(?:www\.)?youtube\.com\/(?:watch\?v=|embed\/)|youtu\.be\/)(?P<id>[A-z0-9\-\_]+)

Matches youtube video links like https://www.youtube.com/watch?v=dQw4w9WgXcQ and shortlinks like https://youtu.be/dQw4w9WgXcQ

Examples:

Monster Regex

If you want to match and extract the data from all urls with one regex, use this monster. It will return the data for all the platforms above. The regex subgroups are prefixed with the platform, e.g. angellist__company instead of just company in the angellist company regex, as some regex implementations don't support defining subgroups more than once which would introduce errors if the same subgroup name is used in two or more platforms.

(?P<angellist__company>(?:https?:)?\/\/angel\.co\/company\/(?P<angellist__company__company>[A-z0-9_-]+)(?:\/(?P<angellist__company__company_subpage>[A-z0-9-]+))?)|(?P<angellist__job>(?:https?:)?\/\/angel\.co\/company\/(?P<angellist__job__company>[A-z0-9_-]+)\/jobs\/(?P<angellist__job__job_permalink>(?P<angellist__job__job_id>[0-9]+)-(?P<angellist__job__job_slug>[A-z0-9-]+)))|(?P<angellist__user>(?:https?:)?\/\/angel\.co\/(?P<angellist__user__type>u|p)\/(?P<angellist__user__user>[A-z0-9_-]+))|(?P<crunchbase__company>(?:https?:)?\/\/crunchbase\.com\/organization\/(?P<crunchbase__company__organization>[A-z0-9_-]+))|(?P<crunchbase__person>(?:https?:)?\/\/crunchbase\.com\/person\/(?P<crunchbase__person__person>[A-z0-9_-]+))|(?P<email__mailto>(?:mailto:)?(?P<email__mailto__email>[A-z0-9_.+-]+@[A-z0-9_.-]+\.[A-z]+))|(?P<facebook__profile>(?:https?:)?\/\/(?:www\.)?(?:facebook|fb)\.com\/(?P<facebook__profile__profile>(?![A-z]+\.php)(?!marketplace|gaming|watch|me|messages|help|search|groups)[A-z0-9_\-\.]+)\/?)|(?P<facebook__profile_by_id>(?:https?:)?\/\/(?:www\.)facebook.com/(?:profile.php\?id=)?(?P<facebook__profile_by_id__id>[0-9]+))|(?P<github__repo>(?:https?:)?\/\/(?:www\.)?github\.com\/(?P<github__repo__login>[A-z0-9_-]+)\/(?P<github__repo__repo>[A-z0-9_-]+)\/?)|(?P<github__user>(?:https?:)?\/\/(?:www\.)?github\.com\/(?P<github__user__login>[A-z0-9_-]+)\/?)|(?P<google_plus__user_id>(?:https?:)?\/\/plus\.google\.com\/(?P<google_plus__user_id__id>[0-9]{21}))|(?P<google_plus__username>(?:https?:)?\/\/plus\.google\.com\/\+(?P<google_plus__username__username>[A-z0-9+]+))|(?P<hackernews__item>(?:https?:)?\/\/news\.ycombinator\.com\/item\?id=(?P<hackernews__item__item>[0-9]+))|(?P<hackernews__user>(?:https?:)?\/\/news\.ycombinator\.com\/user\?id=(?P<hackernews__user__user>[A-z0-9_-]+))|(?P<instagram__profile>(?:https?:)?\/\/(?:www\.)?(?:instagram\.com|instagr\.am)\/(?P<instagram__profile__username>[A-Za-z0-9_](?:(?:[A-Za-z0-9_]|(?:\.(?!\.))){0,28}(?:[A-Za-z0-9_]))?))|(?P<linkedin__company>(?:https?:)?\/\/(?:[\w]+\.)?linkedin\.com\/(?P<linkedin__company__company_type>(company)|(school))\/(?P<linkedin__company__company_permalink>[A-z0-9-Γ€-ΓΏ\.]+)\/?)|(?P<linkedin__post>(?:https?:)?\/\/(?:[\w]+\.)?linkedin\.com\/feed\/update\/urn:li:activity:(?P<linkedin__post__activity_id>[0-9]+)\/?)|(?P<linkedin__profile>(?:https?:)?\/\/(?:[\w]+\.)?linkedin\.com\/in\/(?P<linkedin__profile__permalink>[\w\-\_Γ€-ΓΏ%]+)\/?)|(?P<linkedin__profile_pub>(?:https?:)?\/\/(?:[\w]+\.)?linkedin\.com\/pub\/(?P<linkedin__profile_pub__permalink_pub>[A-z0-9_-]+)(?:\/[A-z0-9]+){3}\/?)|(?P<medium__post>(?:https?:)?\/\/medium\.com\/(?:(?:@(?P<medium__post__username>[A-z0-9]+))|(?P<medium__post__publication>[a-z-]+))\/(?P<medium__post__slug>[a-z0-9\-]+)-(?P<medium__post__post_id>[A-z0-9]+)(?:\?.*)?)|(?P<medium__post_of_subdomain_publication>(?:https?:)?\/\/(?P<medium__post_of_subdomain_publication__publication>(?!www)[a-z-]+)\.medium\.com\/(?P<medium__post_of_subdomain_publication__slug>[a-z0-9\-]+)-(?P<medium__post_of_subdomain_publication__post_id>[A-z0-9]+)(?:\?.*)?)|(?P<medium__user>(?:https?:)?\/\/medium\.com\/@(?P<medium__user__username>[A-z0-9]+)(?:\?.*)?)|(?P<medium__user_by_id>(?:https?:)?\/\/medium\.com\/u\/(?P<medium__user_by_id__user_id>[A-z0-9]+)(?:\?.*))|(?P<phone__phone_number>(?:tel|phone|mobile):(?P<phone__phone_number__number>\+?[0-9. -]+))|(?P<reddit__user>(?:https?:)?\/\/(?:[a-z]+\.)?reddit\.com\/(?:u(?:ser)?)\/(?P<reddit__user__username>[A-z0-9\-\_]*)\/?)|(?P<skype__profile>(?:(?:callto|skype):)(?P<skype__profile__username>[a-z][a-z0-9\.,\-_]{5,31})(?:\?(?:add|call|chat|sendfile|userinfo))?)|(?P<snapchat__profile>(?:https?:)?\/\/(?:www\.)?snapchat\.com\/add\/(?P<snapchat__profile__username>[A-z0-9\.\_\-]+)\/?)|(?P<stackexchange__user>(?:https?:)?\/\/(?:www\.)?stackexchange\.com\/users\/(?P<stackexchange__user__id>[0-9]+)\/(?P<stackexchange__user__username>[A-z0-9-_.]+)\/?)|(?P<stackexchange_network__user>(?:https?:)?\/\/(?:(?P<stackexchange_network__user__community>[a-z]+(?!www))\.)?stackexchange\.com\/users\/(?P<stackexchange_network__user__id>[0-9]+)\/(?P<stackexchange_network__user__username>[A-z0-9-_.]+)\/?)|(?P<stackoverflow__question>(?:https?:)?\/\/(?:www\.)?stackoverflow\.com\/questions\/(?P<stackoverflow__question__id>[0-9]+)\/(?P<stackoverflow__question__title>[A-z0-9-_.]+)\/?)|(?P<stackoverflow__user>(?:https?:)?\/\/(?:www\.)?stackoverflow\.com\/users\/(?P<stackoverflow__user__id>[0-9]+)\/(?P<stackoverflow__user__username>[A-z0-9-_.]+)\/?)|(?P<telegram__profile>(?:https?:)?\/\/(?:t(?:elegram)?\.me|telegram\.org)\/(?P<telegram__profile__username>[a-z0-9\_]{5,32})\/?)|(?P<twitter__status>(?:https?:)?\/\/(?:[A-z]+\.)?twitter\.com\/@?(?P<twitter__status__username>[A-z0-9_]+)\/status\/(?P<twitter__status__tweet_id>[0-9]+)\/?)|(?P<twitter__user>(?:https?:)?\/\/(?:[A-z]+\.)?twitter\.com\/@?(?!home|share|privacy|tos)(?P<twitter__user__username>[A-z0-9_]+)\/?)|(?P<vimeo__user>(?:https?:)?\/\/vimeo\.com\/user(?P<vimeo__user__id>[0-9]+))|(?P<vimeo__video>(?:https?:)?\/\/(?:(?:www)?vimeo\.com|player.vimeo.com\/video)\/(?P<vimeo__video__id>[0-9]+))|(?P<xing__profile>(?:https?:)?\/\/(?:www\.)?xing.com\/profile\/(?P<xing__profile__slug>[A-z0-9-\_]+))|(?P<youtube__channel>(?:https?:)?\/\/(?:[A-z]+\.)?youtube.com\/channel\/(?P<youtube__channel__id>[A-z0-9-\_]+)\/?)|(?P<youtube__user>(?:https?:)?\/\/(?:[A-z]+\.)?youtube.com\/user\/(?P<youtube__user__username>[A-z0-9]+)\/?)|(?P<youtube__video>(?:https?:)?\/\/(?:(?:www\.)?youtube\.com\/(?:watch\?v=|embed\/)|youtu\.be\/)(?P<youtube__video__id>[A-z0-9\-\_]+))

More Repositories

1

mlscraper

πŸ€– Scrape data from HTML websites automatically by just providing examples
Python
1,301
star
2

github-stars-by-topic

⭐ Generate a list of your GitHub stars by topic - automatically!
Python
71
star
3

socials

πŸ‘¨β€πŸ‘©β€πŸ‘¦ Social account detection and extraction in Python, e.g. for crawling/scraping.
Python
46
star
4

personal-crm

πŸ—‚ Minimalist personal CRM to keep in touch with contacts
Python
41
star
5

list-of-countries

List of all countries in different formats (ISO, tld, capital, language, population)
PHP
22
star
6

totally-not-jarvis

πŸ€– My personal assistant
Python
21
star
7

top-regional-repositories

🌍 The most-relevant repositories for all countries and many cities worldwide.
20
star
8

socials-api

πŸ‘¨β€πŸ‘©β€πŸ‘§β€πŸ‘¦ (Rest) API to extract social media profiles from websites or specific URLs
Python
18
star
9

obsi

πŸ’Ž supercharge your note-taking with index pages, Anki decks, calendar pages, and more.
Python
17
star
10

envato-cli

◼️ command line interface for envato market (e.g. themeforest)
Python
8
star
11

hubspot-reporting

πŸ“ˆ Creating diagrams from HubSpot automatically.
Python
8
star
12

resume

πŸ“„ Karl Lorey's resume
TeX
7
star
13

awesome-hubspot

Awesome list of HubSpot tools and libraries
7
star
14

programmermap

Find programmers and interesting projects near you and worldwide.
HTML
7
star
15

hubspot-contact-import

πŸ‘₯ Import Xing contacts and vCards into Hubspot CRM
Python
6
star
16

meeting-bot

πŸ“” Telegram bot that reminds you to create meeting notes in your Hubspot CRM
Python
3
star
17

dotfiles

⚫ dotfiles for awesomewm, zsh, vimperator
Lua
2
star
18

data-intensive-latex-documents

Python framwork for data-intensive LaTeX documents.
Python
2
star
19

pflichtenheft

Ein Pflichtenheft in LaTeX
2
star
20

karllorey.com

πŸ‘€ My personal website built with Next.js
JavaScript
2
star
21

laravel-latex

A Laravel package for handling LaTeX input and output
PHP
2
star
22

lorey

About me
2
star
23

roadgenius.de

CSS
1
star
24

screeps

My Screeps bot or "me writing JavaScript is like trying to write poems as a first-grader"
JavaScript
1
star
25

pretendtobeworking.com

4 hour venture - a project completed within four hours at PionierGarage, KIT, Karlsruhe
JavaScript
1
star
26

mlscraper-experiments

HTML
1
star
27

find-underpriced-cars

πŸš™ Python tool that uses a scraper and Machine Learning to find underpriced cars online.
Python
1
star