LukProt - an animal evolution-centric eukaryotic protein database

LukProt is the EukProt database with additional species added, mostly the undersampled animal and some holozoan taxa. Please report any problems or suggestions to Lukasz Sobala: lukasz.sobala (at) hirszfeld.pl. The database is composed of sequences translated from annotated genomes, transcriptomes...

Full description

Saved in:

Bibliographic Details
Main Author:	Sobala, Łukasz F.
Format:	Data Set
Language:	English
Published:	Zenodo 24-09-2024
Subjects:	animal origins EukProt evolution holozoa LukProt metazoa phylogeny
Online Access:	Get full text
Tags:	Add Tag No Tags, Be the first to tag this record!

Description
Summary:	LukProt is the EukProt database with additional species added, mostly the undersampled animal and some holozoan taxa. Please report any problems or suggestions to Lukasz Sobala: lukasz.sobala (at) hirszfeld.pl. The database is composed of sequences translated from annotated genomes, transcriptomes or ESTs. The main purpose of the database is to be a convenient resource to look up proteins of interest and provide a single place where sequences from undersampled animal taxa can be found. The current version of the database (v1.5.1) is based on EukProt v3. The home of all public versions of LukProt is this page (Zenodo). The proteomes that are novel in LukProt are denoted as LPXXXXX and those coming from AniProtDB are called APXXXXX. The sequence IDs from EukProt are conserved in LukProt. This means that each sequence is assigned an ID in the following format: (A/E/L)PXXXXX_Species_epithet_(strain)_PYYYYYY where XXXXX is a number from 00001 to 99999 and YYYYYY is a number from 000001 to 999999. Each sequence is assigned a unique number assigned to each sequence within a taxon. All the IDs are compatible with BLAST v5 "-parse_seqids" option and the database can be readily deployed, for example on a server running SequenceServer. Within each of the source fasta files, the source sequence identifier was kept after a blank space, so that it can still be retrieved if needed. A publicly available BLAST server providing LukProt search is available at: https://lukprot.hirszfeld.pl/. Comparison of EukProt v2/v3, LukProt 1.4.1 and LukProt v1.5.1 in their main areas of difference: Taxogroup EukProt v2 EukProt v3 LukProt v1.4.1 LukProt v1.5.1 Holozoa (excluding Metazoa) 31 40 39 43 Ctenophora 2 2 35 38 Porifera 4 5 30 47 Placozoa 2 2 3 6 Cnidaria 3 5 65 88 Bilateria 51 51 94 142 Included with the database are: ready to use main database files: LukProt_v1.5.1_single_species_FASTA.7z – a FASTA file with the sequences - 7-zipped, uncompressed size: 17.6 GB to concatenate all into one file, run this in the parent directory: for file in $(find . -type f -name "*.fasta"); do awk 'FNR==1{print ""}1' $file >> LukProt_v1.5.1.fa; done. This will create single FASTA file with all the sequences in the parent directory. awk is used to insert a new line after every file because cat would sometimes merge the last sequence with the header of the first sequence. LukProt_v1.5.1_full_BLAST_db.7z – a preformatted, full BLAST database (NCBI BLAST database format version: v5, masked with segmasker), uncompressed size: 28.3 GB LukProt_v1.5.1_taxogroup_BLAST_db.7z – a collection of BLAST databases where each proteome is one taxogroup and is placed within the eukaryotic tree of life directory structure, uncompressed size: 26.3 GB LukProt_v1.5.1_single_species_BLAST_db.7z – a collection of BLAST databases where each proteome is one BLAST database and is placed within the eukaryotic tree of life directory structure, uncompressed size: 26.4 GB auxiliary database files: LukProt_v1.5.1.cdhit70.7z – the full database clustered at 70% identity using CD-HIT with the following command: cd-hit -g 1 -d 0 -T 20 -M 90000 -c 0.7 -uL 0.2 -uS 0.9 -s 0.2, uncompressed sizes: fasta file - 11 GB, clstr file - 2.5 GB LukProt_IDs_mapped.txt.gz – a text file mapping the LukProt IDs to the AniProtDB IDs and EukProt IDs that are different BUSCO_tables.ods – a spreadsheet with full result tables generated by BUSCO analysis OMAmer_output.zip – a folder with full results of OMAmer analyses (includes per-sequence taxonomy classification) OMArk_output.zip – a folder with the results of all OMArk analyses metadata: README.md – a README file describing the metadata LukProt_metadata_sheet.ods – main metadata file. A spreadsheet with information about each proteome (in an open .ods format, most compatible with LibreOffice) LukProt_metadata_other.zip – an archive with other metadata files, documented in the README. Contents include: the LukProt taxonomy in various formats supporting scripts for data manipulation and visualization a recoloring script (modified by LFS, originally by Dr. Celine Petitjean). The script is in public domain and reuploaded here only for convenience. other files - see README changelog.md – database changelog Words of caution: The database has been synchronized to EukProt v3 in version v1.5.1. This means that identifiers were modified in comparison to LukProt v1.4.1. The convention is not expected to change any more in future updates. Many proteomes, especially those transcriptome-based, may contain contamination from different species. In addition, the translation algorithms often introduce errors (e.g. the transcript may not represent a full length protein). For this reason, to get accurate sequences from each organism, users are directed to source data and to the included OMAmer, OMArk and BUSCO data for details. The taxonomy is different to UniEuk/EukMap but UniEuk data were integrated where possible. A few NCBI taxids are missing. A number of proteomes present in some metadata, are unpublished and were held back. While the database contains metadata that present a particular phylogeny of animals, holozoans and other eukaryotes, no particular claims or hypotheses are made by the author(s). However, in the future efforts will be made to name clades officially, once they are more firmly established. Acknowledgements: Andrew E. Allen Lab for creating the original PhyloDB. Daniel Richter et al. for creating EukProt and keeping it updated. Members of the Multicellgenome Lab, especially Michelle Leger (for donating her database), for the bioinformatics support and for doing great science. All the authors of the original data. National Science Centre of Poland for funding of the project 2020/36/C/NZ8/00081, "The role of glycosylation in the emergence of animal multicellularity", which enabled the creation of this database.
Bibliography:	10.6084/M9.FIGSHARE.6771635.V2 RelationTypeNote: Cites -- 10.5282/UBM/DATA.202 10.6084/m9.figshare.6125030.v1 RelationTypeNote: Cites -- 10.5061/DRYAD.JDFN2Z3CV RelationTypeNote: IsReferencedBy -- 10.24072/pci.genomics.100368 10.5061/DRYAD.6CM1166 10.7910/DVN/INLEPM RelationTypeNote: Cites -- 10.5061/DRYAD.50DC6 RelationTypeNote: Cites -- 10.6084/m9.figshare.6124802.v1 RelationTypeNote: Cites -- 10.5524/100483 10.6084/m9.figshare.5426494.v1 RelationTypeNote: Cites -- 10.7910/DVN/25071 RelationTypeNote: HasPart -- 10.6084/m9.figshare.12417881.v3 RelationTypeNote: HasVersion -- 10.5281/zenodo.7089121 10.5281/zenodo.7089121 RelationTypeNote: Cites -- 10.7939/R3794177K 10.1093/glycob/cwad041 RelationTypeNote: Cites -- 10.6084/m9.figshare.7108433.v1 RelationTypeNote: Cites -- 10.7910/DVN/INLEPM 10.6084/m9.figshare.6233573.v1 10.5281/zenodo.11321046 10.5061/DRYAD.50DC6 RelationTypeNote: Cites -- 10.6084/m9.figshare.5848068.v1 10.5281/zenodo.10522407 10.5281/zenodo.13765164 10.7910/DVN/24737 RelationTypeNote: HasVersion -- 10.5281/zenodo.13765164 RelationTypeNote: IsDescribedBy -- 10.1101/2024.01.30.577650 10.6084/m9.figshare.6124802.v1 10.5524/100483 RelationTypeNote: Cites -- 10.7910/DVN/24737 10.7939/R3S000 RelationTypeNote: Cites -- 10.5281/zenodo.10654583 10.5281/zenodo.10654583 10.5061/DRYAD.R2N70 10.6084/m9.figshare.6819812 RelationTypeNote: Cites -- 10.6084/m9.figshare.6233573.v1 RelationTypeNote: IsDescribedBy -- 10.1093/gbe/evae231 10.7910/DVN/25071 10.6084/m9.figshare.5848068.v1 10.5061/DRYAD.DNCJSXM47 10.5061/DRYAD.JDFN2Z3CV RelationTypeNote: Cites -- 10.5061/DRYAD.DNCJSXM47 10.7939/R30R9M73W RelationTypeNote: Cites -- 10.6084/m9.figshare.5426494.v1 RelationTypeNote: Cites -- 10.6084/m9.figshare.6819812 RelationTypeNote: Cites -- 10.5061/DRYAD.TN0F3 RelationTypeNote: Cites -- 10.5061/DRYAD.R2N70 10.7939/R3794177K RelationTypeNote: Cites -- 10.6084/m9.figshare.6125030.v1 10.6084/m9.figshare.8299529.v2 10.1101/2024.01.30.577650 10.5061/DRYAD.7717Q RelationTypeNote: HasVersion -- 10.5281/zenodo.10522407 RelationTypeNote: Cites -- 10.5061/DRYAD.7717Q 10.1093/gbe/evae231 RelationTypeNote: Cites -- 10.7939/R3S000 https://research.nhgri.nih.gov/aniprotdb/ 10.6084/M9.FIGSHARE.1334306.V3 10.5282/UBM/DATA.202 10.6084/m9.figshare.10001870.v3 RelationTypeNote: Cites -- 10.6084/m9.figshare.10001870.v3 RelationTypeNote: HasVersion -- 10.5281/zenodo.13829058 RelationTypeNote: Cites -- 10.5061/DRYAD.6CM1166 10.5281/zenodo.13829058 10.5061/DRYAD.TN0F3 10.6084/m9.figshare.20497143.v1 RelationTypeNote: IsCitedBy -- 10.1093/glycob/cwad041 RelationTypeNote: Cites -- 10.6084/m9.figshare.22126232.v1 10.6084/m9.figshare.22126232.v1 RelationTypeNote: Cites -- 10.6084/m9.figshare.20497143.v1 10.24072/pci.genomics.100368 RelationTypeNote: Cites -- 10.6084/m9.figshare.8299529.v2 RelationTypeNote: Cites -- 10.6084/M9.FIGSHARE.6771635.V2 10.6084/m9.figshare.12417881.v3 10.6084/m9.figshare.7108433.v1 RelationTypeNote: Cites -- 10.6084/M9.FIGSHARE.1334306.V3 1759-6653 RelationTypeNote: HasVersion -- 10.5281/zenodo.11321046 RelationTypeNote: Cites -- 10.7939/R30R9M73W
ISSN:	1759-6653 1759-6653
DOI:	10.5281/zenodo.7089120