Skip to content

emilgraichen/SwedishLSdataset

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

24 Commits
 
 
 
 
 
 
 
 

Repository files navigation

SwedishLSdataset

This dataset is the first Lexical Simplification Dataset developed for Swedish as a part of a Bachelor's thesis in Cognitive Science at Linköping University. It contains 150 quadruples of complex words sourced from the Swedish Kelly list, their corpus frequencies in the "BloggMix odat" corpus, replacements to the complex word sourced from SynLex and their corresponding word frequencies in the BloggMix corpus, and an example sentence from SALDO where the complex word is found. The human assessment of each quadruple is also included in the dataset (regarding quality, coverage, and complexity).

Links

For a more detailed description of the work, please follow this link: http://liu.diva-portal.org/smash/get/diva2:1767273/FULLTEXT01.pdf.

For links to other repositories related to this thesis, please see the following links:

Lexical Simplification System for Swedish: https://github.com/emilgraichen/SwedishLexicalSimplifier

Complex Word Identification Dataset: https://github.com/emilgraichen/SwedishCWI

Structure of the Dataset

A picture showing the structure of the dataset

Links to the resources used for this dataset:

BloggMix Odat: https://spraakbanken.gu.se/resurser/bloggmix

Kelly Swedish: https://spraakbanken.gu.se/resurser/kelly

SynLex: http://folkets-lexikon.csc.kth.se/synlex.html

SALDO: https://spraakbanken.gu.se/resurser/saldoe

About

No description, website, or topics provided.

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published