Twitter scraper selenium

Python's package to scrape Twitter's front-end easily with selenium.

Table of Contents

Getting Started
- Prerequisites
- Installation
  - Installing from source
  - Installing with PyPI
Usage
- Available Functions in this package- Summary
- Scraping profile's details
Privacy
License

Prerequisites

Internet Connection

Python 3.6+

Chrome or Firefox browser installed on your machine

Installation

Installing from the source

Download the source code or clone it with:

git clone https://github.com/shaikhsajid1111/twitter-scraper-selenium

Open terminal inside the downloaded folder:

 python3 setup.py install

Installing with PyPI

pip3 install twitter-scraper-selenium

Usage

Available Function In this Package - Summary

Function Name	Function Description	Scraping Method	Scraping Speed
`scrape_profile()`	Scrape's Twitter user's profile tweets	Browser Automation	Slow
`scrape_keyword()`	Scrape's Twitter tweets using keyword provided.	Browser Automation	Slow
`scrape_topic()`	Scrape's Twitter tweets by URL. It expects the URL of the topic.	Browser Automation	Slow
`scrape_keyword_with_api()`	Scrape's Twitter tweets by query/keywords. For an advanced search, query can be built from here.	HTTP Request	Fast
`get_profile_details()`	Scrape's Twitter user details.	HTTP Request	Fast
`scrape_topic_with_api()`	Scrape's Twitter tweets by URL. It expects the URL of the topic	Browser Automation & HTTP Request	Fast
`scrape_profile_with_api()`	Scrape's Twitter tweets by twitter profile username. It expects the username of the profile	Browser Automation & HTTP Request	Fast

Note: HTTP Request Method sends the request to Twitter's API directly for scraping data, and Browser Automation visits that page, scroll while collecting the data.

To scrape twitter profile details:

from twitter_scraper_selenium import get_profile_details

twitter_username = "TwitterAPI"
filename = "twitter_api_data"
get_profile_details(twitter_username=twitter_username, filename=filename)

Output:

{
	"id": 6253282,
	"id_str": "6253282",
	"name": "Twitter API",
	"screen_name": "TwitterAPI",
	"location": "San Francisco, CA",
	"profile_location": null,
	"description": "The Real Twitter API. Tweets about API changes, service issues and our Developer Platform. Don't get an answer? It's on my website.",
	"url": "https:\/\/t.co\/8IkCzCDr19",
	"entities": {
		"url": {
			"urls": [{
				"url": "https:\/\/t.co\/8IkCzCDr19",
				"expanded_url": "https:\/\/developer.twitter.com",
				"display_url": "developer.twitter.com",
				"indices": [
					0,
					23
				]
			}]
		},
		"description": {
			"urls": []
		}
	},
	"protected": false,
	"followers_count": 6133636,
	"friends_count": 12,
	"listed_count": 12936,
	"created_at": "Wed May 23 06:01:13 +0000 2007",
	"favourites_count": 31,
	"utc_offset": null,
	"time_zone": null,
	"geo_enabled": null,
	"verified": true,
	"statuses_count": 3656,
	"lang": null,
	"contributors_enabled": null,
	"is_translator": null,
	"is_translation_enabled": null,
	"profile_background_color": null,
	"profile_background_image_url": null,
	"profile_background_image_url_https": null,
	"profile_background_tile": null,
	"profile_image_url": null,
	"profile_image_url_https": "https:\/\/pbs.twimg.com\/profile_images\/942858479592554497\/BbazLO9L_normal.jpg",
	"profile_banner_url": null,
	"profile_link_color": null,
	"profile_sidebar_border_color": null,
	"profile_sidebar_fill_color": null,
	"profile_text_color": null,
	"profile_use_background_image": null,
	"has_extended_profile": null,
	"default_profile": false,
	"default_profile_image": false,
	"following": null,
	"follow_request_sent": null,
	"notifications": null,
	"translator_type": null
}

get_profile_details() arguments:

Argument	Argument Type	Description
twitter_username	String	Twitter Username
output_filename	String	What should be the filename where output is stored?.
output_dir	String	What directory output file should be saved?
proxy	String	Optional parameter, if user wants to use proxy for scraping. If the proxy is authenticated proxy then the proxy format is username:password@host:port.

Keys of the output:
Detail of each key can be found here.

To scrape profile's tweets:

In JSON format:

from twitter_scraper_selenium import scrape_profile

microsoft = scrape_profile(twitter_username="microsoft",output_format="json",browser="firefox",tweets_count=10)
print(microsoft)

Output:

{
  "1430938749840629773": {
    "tweet_id": "1430938749840629773",
    "username": "Microsoft",
    "name": "Microsoft",
    "profile_picture": "https://twitter.com/Microsoft/photo",
    "replies": 29,
    "retweets": 58,
    "likes": 453,
    "is_retweet": false,
    "retweet_link": "",
    "posted_time": "2021-08-26T17:02:38+00:00",
    "content": "Easy to use and efficient for all \u2013 Windows 11 is committed to an accessible future.\n\nHere's how it empowers everyone to create, connect, and achieve more: https://msft.it/6009X6tbW ",
    "hashtags": [],
    "mentions": [],
    "images": [],
    "videos": [],
    "tweet_url": "https://twitter.com/Microsoft/status/1430938749840629773",
    "link": "https://blogs.windows.com/windowsexperience/2021/07/01/whats-coming-in-windows-11-accessibility/?ocid=FY22_soc_omc_br_tw_Windows_AC"
  },...
}

In CSV format:

from twitter_scraper_selenium import scrape_profile


scrape_profile(twitter_username="microsoft",output_format="csv",browser="firefox",tweets_count=10,filename="microsoft",directory="/home/user/Downloads")

Output:

tweet_id	username	name	profile_picture	replies	retweets	likes	is_retweet	retweet_link	posted_time	content	hashtags	mentions	images	videos	post_url	link
1430938749840629773	Microsoft	Microsoft	https://twitter.com/Microsoft/photo	64	75	521	False		2021-08-26T17:02:38+00:00	Easy to use and efficient for all – Windows 11 is committed to an accessible future. Here's how it empowers everyone to create, connect, and achieve more: https://msft.it/6009X6tbW	[]	[]	[]	[]	https://twitter.com/Microsoft/status/1430938749840629773	https://blogs.windows.com/windowsexperience/2021/07/01/whats-coming-in-windows-11-accessibility/?ocid=FY22_soc_omc_br_tw_Windows_AC

...

scrape_profile() arguments:

Argument	Argument Type	Description
twitter_username	String	Twitter username of the account
browser	String	Which browser to use for scraping?, Only 2 are supported Chrome and Firefox. Default is set to Firefox
proxy	String	Optional parameter, if user wants to use proxy for scraping. If the proxy is authenticated proxy then the proxy format is username:password@host:port.
tweets_count	Integer	Number of posts to scrape. Default is 10.
output_format	String	The output format, whether JSON or CSV. Default is JSON.
filename	String	If output parameter is set to CSV, then it is necessary for filename parameter to passed. If not passed then the filename will be same as username passed.
directory	String	If output_format parameter is set to CSV, then it is valid for directory parameter to be passed. If not passed then CSV file will be saved in current working directory.
headless	Boolean	Whether to run crawler headlessly?. Default is `True`

Keys of the output

Key	Type	Description
tweet_id	String	Post Identifier(integer casted inside string)
username	String	Username of the profile
name	String	Name of the profile
profile_picture	String	Profile Picture link
replies	Integer	Number of replies of tweet
retweets	Integer	Number of retweets of tweet
likes	Integer	Number of likes of tweet
is_retweet	boolean	Is the tweet a retweet?
retweet_link	String	If it is retweet, then the retweet link else it'll be empty string
posted_time	String	Time when tweet was posted in ISO 8601 format
content	String	content of tweet as text
hashtags	Array	Hashtags presents in tweet, if they're present in tweet
mentions	Array	Mentions presents in tweet, if they're present in tweet
images	Array	Images links, if they're present in tweet
videos	Array	Videos links, if they're present in tweet
tweet_url	String	URL of the tweet
link	String	If any link is present inside tweet for some external website.

To scrape tweets using keywords with API:

from twitter_scraper_selenium import scrape_keyword_with_api

query = "#gaming"
tweets_count = 10
output_filename = "gaming_hashtag_data"
scrape_keyword_with_api(query=query, tweets_count=tweets_count, output_filename=output_filename)

Output:

{
  "1583821467732480001": {
    "tweet_url" : "https://twitter.com/yakubblackbeard/status/1583821467732480001",
    "tweet_details":{
      ...
    },
    "user_details":{
      ...
    }
  }, ...
}

scrape_keyword_with_api() arguments:

Argument	Argument Type	Description
query	String	Query to search. The query can be built from here for advanced search.
tweets_count	Integer	Number of tweets to scrape.
output_filename	String	What should be the filename where output is stored?.
output_dir	String	What directory output file should be saved?
proxy	String	Optional parameter, if user wants to use proxy for scraping. If the proxy is authenticated proxy then the proxy format is username:password@host:port.

Keys of the output:

Key	Type	Description
tweet_url	String	URL of the tweet.
tweet_details	Dictionary	A dictionary containing the data about the tweet. All fields which will be available inside can be checked here
user_details	Dictionary	A dictionary containing the data about the tweet owner. All fields which will be available inside can be checked here

To scrape tweets using keywords with browser automation

In JSON format:

from twitter_scraper_selenium import scrape_keyword
#scrape 10 posts by searching keyword "india" from date 30th August till date 31st August
india = scrape_keyword(keyword="india", browser="firefox",
                      tweets_count=10,output_format="json" ,until="2021-08-31", since="2021-08-30")
print(india)

Output:

{
  "1432493306152243200": {
    "tweet_id": "1432493306152243200",
    "username": "TOICitiesNews",
    "name": "TOI Cities",
    "profile_picture": "https://twitter.com/TOICitiesNews/photo",
    "replies": 0,
    "retweets": 0,
    "likes": 0,
    "is_retweet": false,
    "posted_time": "2021-08-30T23:59:53+00:00",
    "content": "Paralympians rake in medals, India Inc showers them with rewards",
    "hashtags": [],
    "mentions": [],
    "images": [],
    "videos": [],
    "tweet_url": "https://twitter.com/TOICitiesNews/status/1432493306152243200",
    "link": "https://t.co/odmappLovL?amp=1"
  },...
}

In CSV format:

from twitter_scraper_selenium import scrape_keyword

scrape_keyword(keyword="india", browser="firefox",
                      tweets_count=10, until="2021-08-31", since="2021-08-30",output_format="csv",filename="india")

Output:

tweet_id	username	name	profile_picture	replies	retweets	likes	is_retweet	posted_time	content	hashtags	mentions	images	videos	tweet_url	link
1432493306152243200	TOICitiesNews	TOI Cities	https://twitter.com/TOICitiesNews/photo	0	0	0	False	2021-08-30T23:59:53+00:00	Paralympians rake in medals, India Inc showers them with rewards	[]	[]	[]	[]	https://twitter.com/TOICitiesNews/status/1432493306152243200	https://t.co/odmappLovL?amp=1

...

scrape_keyword() arguments:

Argument	Argument Type	Description
keyword	String	Keyword to search on twitter.
browser	String	Which browser to use for scraping?, Only 2 are supported Chrome and Firefox,default is set to Firefox.
until	String	Optional parameter, Until date for scraping, a end date from where search ends. Format for date is YYYY-MM-DD.
since	String	Optional parameter, Since date for scraping, a past date from where to search from. Format for date is YYYY-MM-DD.
proxy	Integer	Optional parameter, if user wants to use proxy for scraping. If the proxy is authenticated proxy then the proxy format is username:password@host:port
tweets_count	Integer	Number of posts to scrape. Default is 10.
output_format	String	The output format, whether JSON or CSV. Default is JSON.
filename	String	If output parameter is set to CSV, then it is necessary for filename parameter to passed. If not passed then the filename will be same as keyword passed.
directory	String	If output parameter is set to CSV, then it is valid for directory parameter to be passed. If not passed then CSV file will be saved in current working directory.
since_id	Integer	After (NOT inclusive) a specified Snowflake ID. Example here
max_id	Integer	At or before (inclusive) a specified Snowflake ID. Example here
within_time	String	Search within the last number of days, hours, minutes, or seconds. Example `2d, 3h, 5m, 30s`.
headless	Boolean	Whether to run crawler headlessly?. Default is `True`

Keys of the output

Key	Type	Description
tweet_id	String	Post Identifier(integer casted inside string)
username	String	Username of the profile
name	String	Name of the profile
profile_picture	String	Profile Picture link
replies	Integer	Number of replies of tweet
retweets	Integer	Number of retweets of tweet
likes	Integer	Number of likes of tweet
is_retweet	boolean	Is the tweet a retweet?
posted_time	String	Time when tweet was posted in ISO 8601 format
content	String	content of tweet as text
hashtags	Array	Hashtags presents in tweet, if they're present in tweet
mentions	Array	Mentions presents in tweet, if they're present in tweet
images	Array	Images links, if they're present in tweet
videos	Array	Videos links, if they're present in tweet
tweet_url	String	URL of the tweet
link	String	If any link is present inside tweet for some external website.

To scrape topic tweets with URL using API

from twitter_scraper_selenium import scrape_topic_with_api

topic_url = 'https://twitter.com/i/topics/1468157909318045697'
scrape_topic_with_api(URL=topic_url, output_filename='solana_cryptocurrency', tweets_count=50)

Output:

{
  "1584979408338632705": {
    "tweet_url" : "https://twitter.com/AptosBullCNFT/status/1584979408338632705",
    "tweet_details":{
      ...
    },
    "user_details":{
      ...
    }
  }, ...
}

scrape_topic_with_api() arguments:

Argument	Argument Type	Description
URL	String	Twitter's Topic URL
tweets_count	Integer	Number of tweets to scrape.
output_filename	String	What should be the filename where output is stored?.
output_dir	String	What directory output file should be saved?
proxy	String	Optional parameter, if user wants to use proxy for scraping. If the proxy is authenticated proxy then the proxy format is username:password@host:port.
browser	String	Which browser to use for extracting out graphql key. Default is firefox.
headless	String	Whether to run browser in headless mode?

Keys of the output:

Same as scrape_keyword_with_api

To scrape topic tweets with URL using browser automation:

from twitter_scraper_selenium import scrape_topic
# scrape 10 tweets from steam deck topic on twitter
data = scrape_topic(filename="steamdeck", url='https://twitter.com/i/topics/1415728297065861123',
                     browser="firefox", tweets_count=10)

Keys of the output:

Same as scrape_profile

scrape_topic() arguments:

Arguments	Argument Type	Description
filename	str	Filename to write result output.
URL	str	Topic URL.
browser	str	Which browser to use for scraping? Only 2 are supported Chrome and Firefox. default firefox
proxy	str	If user wants to use proxy for scraping. If the proxy is authenticated proxy then the proxy format is username:password@host:port
tweets_count	int	Number of posts to scrape. default 10.
output_format	str	The output format whether JSON or CSV. Default json.
directory	str	Directory to save output file. Deafult current working directory.

To Scrap profile's tweets with API:

from twitter_scraper_selenium import scrape_profile_with_api

scrape_profile_with_api('elonmusk', output_filename='musk', tweets_count= 100)

scrape_profile_with_api() Arguments:

Argument	Argument Type	Description
username	String	Twitter's Profile username
tweets_count	Integer	Number of tweets to scrape.
output_filename	String	What should be the filename where output is stored?.
output_dir	String	What directory output file should be saved?
proxy	String	Optional parameter, if user wants to use proxy for scraping. If the proxy is authenticated proxy then the proxy format is username:password@host:port.
browser	String	Which browser to use for extracting out graphql key. Default is firefox.
headless	String	Whether to run browser in headless mode?

Output:

{
  "1608939190548598784": {
    "tweet_url" : "https://twitter.com/elonmusk/status/1608939190548598784",
    "tweet_details":{
      ...
    },
    "user_details":{
      ...
    }
  }, ...
}

Using scraper with proxy (http proxy)

Just pass proxy argument to function.

from twitter_scraper_selenium import scrape_keyword

scrape_keyword(keyword="#india", browser="firefox",tweets_count=10,output="csv",filename="india",
proxy="66.115.38.247:5678") #In IP:PORT format

Proxy that requires authentication:

from twitter_scraper_selenium import scrape_profile

microsoft_data = scrape_profile(twitter_username="microsoft", browser="chrome", tweets_count=10, output="json",
                      proxy="sajid:[email protected]:5678")  #  username:password@IP:PORT
print(microsoft_data)

Privacy

This scraper only scrapes public data available to unauthenticated user and does not holds the capability to scrape anything private.

LICENSE

MIT

shaikhsajid1111/twitter-scraper-selenium

shaikhsajid1111

Reviews

Repository Details

Twitter scraper selenium

Table of Contents

Prerequisites

Installation

Installing from the source

Installing with PyPI

Usage

Available Function In this Package - Summary

To scrape twitter profile details:

Keys of the output:
Detail of each key can be found here.

To scrape profile's tweets:

To scrape tweets using keywords with API:

To scrape tweets using keywords with browser automation

To scrape topic tweets with URL using API

To scrape topic tweets with URL using browser automation:

Using scraper with proxy (http proxy)

Privacy

LICENSE

More Repositories

shaikhsajid1111/twitter-scraper-selenium

shaikhsajid1111

Reviews

Repository Details

Twitter scraper selenium

Table of Contents

Prerequisites

Installation

Installing from the source

Installing with PyPI

Usage

Available Function In this Package - Summary

To scrape twitter profile details:

Keys of the output: Detail of each key can be found here.

To scrape profile's tweets:

To scrape tweets using keywords with API:

To scrape tweets using keywords with browser automation

To scrape topic tweets with URL using API

To scrape topic tweets with URL using browser automation:

Using scraper with proxy (http proxy)

Privacy

LICENSE

More Repositories

Keys of the output:
Detail of each key can be found here.