RecipeNLG: A Cooking Recipes Dataset for Semi-Structured Text Generation

This is an archive of code which was used to produce dataset and results available in our INLG 2020 paper: RecipeNLG: A Cooking Recipes Dataset for Semi-Structured Text Generation

What's exciting about it?

The dataset we publish contains 2231142 cooking recipes (>2 millions). It's processed in more careful way and provides more samples than any other dataset in the area.

Where is the dataset?

Please visit the website of our project: recipenlg.cs.put.poznan.pl to download it.
NOTE: The dataset contains all the data we gathered including from other datasets. To access only our gathered recipes (with no 12 instead of 1/2 etc), filter the dataset for source=Gathered. It results in approx 1.6M recipes of better quality.

I've used the dataset in my research. How to cite you?

Use the following BibTeX entry:

@inproceedings{bien-etal-2020-recipenlg,
    title = "{R}ecipe{NLG}: A Cooking Recipes Dataset for Semi-Structured Text Generation",
    author = "Bie{\'n}, Micha{\l}  and
      Gilski, Micha{\l}  and
      Maciejewska, Martyna  and
      Taisner, Wojciech  and
      Wisniewski, Dawid  and
      Lawrynowicz, Agnieszka",
    booktitle = "Proceedings of the 13th International Conference on Natural Language Generation",
    month = dec,
    year = "2020",
    address = "Dublin, Ireland",
    publisher = "Association for Computational Linguistics",
    url = "https://www.aclweb.org/anthology/2020.inlg-1.4",
    pages = "22--28",
}

Where are your models?

The pyTorch model is available in HuggingFace model hub as mbien/recipenlg. You can therefore easily import it into your solution as follows:

from transformers import AutoTokenizer, AutoModelWithLMHead
tokenizer = AutoTokenizer.from_pretrained("mbien/recipenlg")
model = AutoModelWithLMHead.from_pretrained("mbien/recipenlg")

You can also check the generation performance interactively on our website (link above).
The SpaCy NER model is available in the ner directory

Could you explain X and Y?

Yes, sure! If you feel some information is missing in our paper, please check first in our thesis, which is much more detailed. In case of further questions, you're invited to send us a github issue, we will respond as fast as we can!

How to run the code?

We worked on the project interactively, and our core result is a new dataset. That's why the repo is rather a set of loosely connected python files and jupyter notebooks than a working runnable solution itself. However if you feel some part crucial for the reproduction is missing or you are dedicated to make the experience smoother, send us a feature request or (preferably), a pull request.

Glorf/recipenlg

Glorf

Reviews

Repository Details