3 months ago
You will design and build future-proof databases, large-scale processing systems and APIs in collaboration with Bioinformatics, Machine Learning and modeling experts, by developing, constructing, testing and maintaining data acquisition and dissemination methods. Deciding the best methods to acquire, curate, store and retrieve many primary and secondary data types along with metadata pertaining to various data domains.
Analysing characteristics of data sets (-omics, imaging, structural) required by Bioinformatics, Machine Learning and Science team members, and using that understanding to discover and develop methods to make them available.
Developing and implementing the most optimal methods for regular extraction, curation, transformation, storage, retrieval and delivery of large and complex scientific datasets for Research and Product Development
Recommending and implementing ways to improve data reliability, efficiency, and quality, through systems integration methods, automation of acquisition and quality control/assurance processes
Actively identifying patterns and anomalies in datasets using data surveillance tools as part of data performance reviews, and identify methods to improve existing processing pipelines.
A Bachelors' degree (Computer Science/ Mathematics/ Statistics) followed by a minimum of five years' experience developing and working with a variety of databases and data sets
At least three years' of deep experience, with demonstrated evidence of:
Performing analysis on at least a couple of types of data sets to understand their properties and advising end-user teams on their value
Developing/ optimising high-volume data pipelines, large datasets and big-data architectures
Successfully building processes for transforming data, creating unique data structures to suit end uses, ensuring sufficiency of metadata, and developing methods for automated delivery of data sets (software tools, APIs)
Working on building and using data stores in AWS
Big data tools and stream-processing systems: Hadoop, Spark, Kafka, Storm, Spark-Streaming
Relational SQL and NoSQL databases, including Postgres and Cassandra.
Data pipeline and workflow management tools: Luigi, Airflow, etc.
AWS cloud services: EC2, S3, Glue, Athena, API Gateway, Redshift
Designing and building APIs (Restful, etc.)
Ontologies such as Gene Ontology, and ontological modelling tools and editors such as W3C Wiki, Basic Formal Ontology, etc.
Remote working possible