• Stars
    star
    117
  • Rank 300,006 (Top 6 %)
  • Language
    Python
  • Created over 2 years ago
  • Updated about 1 month ago

Reviews

There are no reviews yet. Be the first to send feedback to the community and the maintainers!

Repository Details

Nesta's Skills Extractor Library

Skills Extractor

Welcome to Nesta's Skills Extractor Library

Welcome to the documentation of Nesta's skills extractor library.

This page contains information on how to install and use Nesta's skills extraction library. The skills library allows you to extract skills phrases from job advertisement texts and maps them onto a skills taxonomy of your choice.

We currently support three different taxonomies to map onto: the European Commission’s European Skills, Competences, and Occupations (ESCO), Lightcast’s Open Skills and a “toy” taxonomy developed internally for the purpose of testing.

If you'd like to learn more about the models used in the library, please refer to the model card page.

You may also want to read more about the wider project by reading:

  1. Our Introduction blog
  2. Our interactive analysis blog

Installation

You can use pip to install the library:

pip install ojd-daps-skills

You will also need to install spaCy's English language model:

python -m spacy download en_core_web_sm

Note that this package was developed on MacOS and tested on Ubuntu. Changes have been made to be compatible on a Windows system but are not tested and cannot be guaranteed.

When the package is first used it will automatically download a folder of neccessary data and models. (~1GB)

TL;DR: Using Nesta's Skills Extractor library

The library supports three key skills extraction functionalities :

  1. Extract AND map skills to a taxonomy of your choice;
  2. Extract skills from job adverts;
  3. Map a list of skills to a taxonomy of your choice.

The option local=False can only be used by those with access to Nesta's S3 bucket.

1. Extract AND map skills

If you would like to extract AND map skills in one step, you are able to do so with the extract_skills method.

from ojd_daps_skills.pipeline.extract_skills.extract_skills import ExtractSkills #import the module

es = ExtractSkills(config_name="extract_skills_toy", local=True) #instantiate with toy taxonomy configuration file

es.load() #load necessary models

job_adverts = [
    "The job involves communication skills and maths skills",
    "The job involves Excel skills. You will also need good presentation skills"
] #toy job advert examples

job_skills_matched = es.extract_skills(job_adverts) #match and extract skills to toy taxonomy

The outputs are as follows:

job_skills_matched
>>> [{'SKILL': [('communication skills', ('communication, collaboration and creativity', 'S1')), ('maths skills', ('working with computers', 'S5'))]}, {'SKILL': [('Excel skills', ('working with computers', 'S5')), ('presentation skills', ('communication, collaboration and creativity', 'S1'))]}]

2. Extract skills

You can simply extract skills from a job advert or list of job adverts:

from ojd_daps_skills.pipeline.extract_skills.extract_skills import ExtractSkills #import the module

es = ExtractSkills(config_name="extract_skills_toy", local=True) #instantiate with toy taxonomy configuration file

es.load() #load necessary models

job_adverts = [
    "The job involves communication skills and maths skills",
    "The job involves Excel skills. You will also need good presentation skills"
] #toy job advert examples

predicted_skills = es.get_skills(job_adverts) #extract skills from list of job adverts

The outputs are as follows:

predicted_skills
[{'EXPERIENCE': [], 'SKILL': ['communication skills', 'maths skills'], 'MULTISKILL': []}, {'EXPERIENCE': [], 'SKILL': ['Excel skills', 'presentation skills'], 'MULTISKILL': []}]

3. Map skills

You can map either the predicted_skills output from get_stills or simply map a list of skills to a taxonomy of your choice. In this instance, we map a list of skills:

from ojd_daps_skills.pipeline.extract_skills.extract_skills import ExtractSkills #import the module

es = ExtractSkills(config_name="extract_skills_toy", local=True) #instantiate with toy taxonomy configuration file

es.load() #load necessary models

skills_list = [
    "Communication",
    "Excel skills",
    "working with computers"
] #list of skills (and/or multiskills) to be matched

skills_list_matched = es.map_skills(skills_list) #match formatted skills to toy taxonomy

The outputs are as follows:

skills_list_matched
>>> [{'SKILL': [('Excel skills', ('working with computers', 'S5')), ('Communication', ('use communication techniques', 'cdef')), ('working with computers', ('communication, collaboration and creativity', 'S1'))]}]

App

If you would like to demo the library using a front end, we have also built a streamlit app that allows you to extract skills for a given text. The app allows you to paste a job advert of your choice, extract and map skills onto any of the configurations: extract_skills_lightcast and extract_skills_esco.

nesta_esco

Development

If you'd like to modify or develop the source code you can clone it by first running:

git clone [email protected]:nestauk/ojd_daps_skills.git

Setup

  • Meet the data science cookiecutter requirements, in brief:
    • Install: direnv and conda
  • Create a blank cookiecutter conda log file:
    • mkdir .cookiecutter/state
    • touch .cookiecutter/state/conda-create.log
  • Run make install to configure the development environment
  • Install spaCy's English language model:
    • python -m spacy download en_core_web_sm

Project structure

The project is split into three core pipeline folders:

  • skill_ner - Training a Named Entity Recognition (NER) model to extract skills from job adverts.
  • skill_ner_mapping - Matching skills to an existing skills taxonomy using semantic similarity.
  • extract_skills - User friendly functionality to extract and map skills from job adverts.

Much more about these steps can be found in each of the pipeline folder READMEs.

An example of extracting skills and mapping them to the ESCO taxonomy.

Testing

Some functions have tests, these can be checked by running

pytest

Analysis

Various pieces of analysis are done in the analysis folder. These require access to various datasets from Nesta's private S3 bucket and are therefore only designed for internal Nesta use.

Contributor guidelines

The technical and working style guidelines can be found here.

If contributing, changes will need to be pushed to a new branch in order for our code checks to be triggered.


This project was made possible via funding from the Economic Statistics Centre of Excellence

Project template is based on Nesta's data science project template (Read the docs here).

More Repositories

1

clio-lite

Lightweight intelligent searching of elasticsearch data
Python
38
star
2

mapping-career-causeways

Public repository for the research outputs of the Mapping Career Causeways project
Jupyter Notebook
24
star
3

discovery_generative_ai

We are exploring the potential impact of Generative AI on Nesta's Missions and work to uncover opportunities and risks that can inform Nesta’s strategy.
Python
23
star
4

ojo_daps_mirror

The Open Jobs Observatory public mirror repo
Python
19
star
5

old_nesta_daps

[archived]
Python
18
star
6

svizzle

Svelte components for data visualisation and utilities for data transformation.
JavaScript
15
star
7

arxiv_ai

An analysis of arXiv data, in terms of AI and Deep Learning research
Jupyter Notebook
11
star
8

skills-taxonomy-v2

new skills taxonomy using TextKernel data
Python
11
star
9

innovation_sweet_spots

Data-driven horizon scanning for emerging tech and innovations
Jupyter Notebook
10
star
10

rhodonite

A Python package for the creation and study of coocurrence networks.
Python
9
star
11

our-futures

A game to imagine news ways to involve people in thinking about the future
7
star
12

creative_nation

Code for data analysis in the Creative Nation project
Jupyter Notebook
6
star
13

im-tutorials

Data Science tutorials by the Innovation Mapping team
6
star
14

ds-cookiecutter

A data science cookiecutter for Nesta projects.
Python
6
star
15

health_mosaic

Mapping health innovation globally for the Robert Wood Johnson Foundation.
6
star
16

skill_demand_report

This Github repo is to hold supplementary material for the report on skill demand produced for the Economics Statistics Centre of Excellence
Jupyter Notebook
6
star
17

beis-indicators

BEIS and Nesta have co-developed a spatial data tool to access, visualise and compare indicators that show the scale of R&D systems at a subregional level.
HTML
6
star
18

gtr

Python
6
star
19

DSI4EU

PHP
4
star
20

grjobs

identifying jobs in green industries within the OJO database
Python
4
star
21

gtr_analysis

Gateway To Research data analysis as part of the Arloesiadur project
Jupyter Notebook
4
star
22

narrowing_ai_research

Repository with code for the Nesta paper: "A Narrowing of AI research?"
HTML
4
star
23

dap_medium_articles

The code behind Data Analytics at Nesta's Medium Articles
Jupyter Notebook
3
star
24

ai_research

AI research work.
Python
3
star
25

cci_cameroon

Using Data Science and Collective Intelligence methods to help with crisis work in Cameroon.
Python
3
star
26

iss_forecasting

Anticipating trends in impact investing
Python
3
star
27

wiki_topic_labels

Suggest Wikipedia article titles as labels for topics from topic model
Python
3
star
28

industrial_taxonomy

Refactor of nestauk/industrial-taxonomy which upon completion will replace it.
Python
3
star
29

openjobs_beta

Examples, code snippets and small packages for working with labour market data
Jupyter Notebook
3
star
30

nuts_finder

You give it a point, it tells you all the EU NUTS regions.
Python
3
star
31

DSI4EU_Dataviz

DSI4EU Dataviz application
HTML
2
star
32

metaflow_extensions

Nesta plugins for Metaflow (metaflow.org)
Python
2
star
33

tech_topic_meetup_blog

Repo with code and data for the analysis of new tech topics using Meetup data
Jupyter Notebook
2
star
34

arxlive

arXlive front end
HTML
2
star
35

cci_nepal

Using Data Science and Collective Intelligence methods to help with crisis work in Nepal..
Python
2
star
36

pypatstat

Tools for loading and retrieving patstat's global data into *any* SQL database, without having to click anything ever.
Python
2
star
37

openjobs-SDS-NOS-2019

Jupyter Notebook
2
star
38

ai_genomics

Open-source code for innovation mapping of the AI in genomics landscape
Python
2
star
39

drone_industry

Analysis of the UK drone industry technology developers, service providers and research organisations.
CSS
2
star
40

discovery_gtr

Automating data transfer from the GtR API, processing, and storing on Amazon S3
Python
2
star
41

ci_mapping

Python
1
star
42

sdg-mapping

Jupyter Notebook
1
star
43

show_homes

Heat pump show homes
HTML
1
star
44

gtr_data_processing

Repo with Gateway to Research data processing
Jupyter Notebook
1
star
45

crisis_intelligence

Collective crisis intelligence solutions for two Red Cross National Societies
Jupyter Notebook
1
star
46

dap_tutorials

Tutorials from the data analytics team at Nesta
Jupyter Notebook
1
star
47

nesta_ds_utils

Data Science utilities
Python
1
star
48

technation

Code and data for TechNation 2.0 project
R
1
star
49

african-meetup-analysis

Analysis of Meetups in Africa
Jupyter Notebook
1
star
50

baking-cookies

Makefile
1
star
51

epc_data_analysis

EPC Data Analysis
HTML
1
star
52

kuebiko

An example ds-cookiecutter project with associated guides for idiomatic development.
1
star
53

crosswalk_apprenticeships_to_ONet

A crosswalk between US O*Net occupations and UK apprenticeships
1
star
54

dap_aria_mapping

Mapping technology innovation to support The Advanced Research and Innovation Agency (ARIA)
HTML
1
star
55

innovation_networks

Arloesiadur data pilot #2 --> Innovation Networks Exploration
Python
1
star
56

nestauk.github.io

1
star
57

ai_trend_analysis

Analysis of research trends in arXiv with an application to AI
Makefile
1
star
58

covid-19_research

Nesta Innovation mapping work about Covid-19
Jupyter Notebook
1
star
59

sg_covid_impact

Project with the Scottish Government to map the impact of Covid-19 in Scotland
HTML
1
star
60

ai_ci

Repository with code for the AI - Creative Industries analysis
Jupyter Notebook
1
star
61

eis

Data collection and analysis for the European Innovation Scoreboard
Python
1
star
62

fnf

Analysis for the Future News Fund
Python
1
star
63

openjobs-PIN

HTML
1
star
64

dataviz_ojo

Landing page and data visualisations for the Open Jobs Observatory project
Svelte
1
star
65

ahl_food_reformulation

Code used to identify priority foods for reformulation.
Python
1
star
66

inclusive-innovation-pilot

CB analysis focusing on inclusive innovation for Scotland and the EU.
Jupyter Notebook
1
star
67

arloesiadur_analysis_data

Repository with analytical code and data for the Arloesiadur innovation mapping project
Jupyter Notebook
1
star
68

asf_green_jobs_site

A web app to explore analysis to measure and identify green jobs
HTML
1
star
69

heat_pump_adoption_modelling

Modelling and predicting heat pump adoption in collaboration with EST
Python
1
star