Full-Time Cloud Data Engineer Based in Boston, MA
We’re looking for a Data Engineer to help us transform our data systems and architecture to support greater variety, volume, and velocity of data and data sources. You might be a good fit if:
You enjoy extracting data from a variety of sources and find ways to connect them and make them suitable for use in software systems and for the development of models and algorithms.
You enjoy interacting with new database systems and learning new data technologies and are interesting in developing your knowledge of new tools and techniques.
You are interested in automating data engineering efforts to minimize human interaction and optimizing data quality.
You have an interest in developing your knowledge of practical data science techniques and technologies in addition to your data engineering knowledge and experience.
This role requires comprehensive data engineering skills and is not a SQL developer role though SQL is a required skill.
We’re looking for an experienced data engineer to help us:
- Build and Maintain serverless data ingestion and refresh pipelines in terabyte scale using AWS cloud services – Amazon Glue, Amazon Redshift, Amazon S3, Amazon Athena, DynamoDB, and others
- Incorporate new data sources from external vendors using flat files, APIs, web-scraping, and databases.
- Maintain and provide support for the existing data pipelines using Python, Glue, Spark, and SQL
- Work to develop and enhance the database architecture of the new analytic data environment that includes recommending optimal choices between relational, columnar, and document databases based on requirement
- Identify and deploy appropriate file formats for data ingestion into various storage and/or compute services via Glue for multiple use cases
- Develop real-time/near real-time data ingestion from web and web service logs from Splunk
- Maintain existing processes and develop new methods to match external data sources to Homesite data using exact and fuzzy methods
- Implement and use machine learning based data wrangling tools like Trifacta to cleanse and reshape 3rd party data to make suitable for use.
- Develop and implement tests to ensure data quality across all integrated data sources.
- Serve as internal subject matter expert and coach to train team members in the use of distributed computing frameworks for data analysis and modeling including AWS services and Apache projects
- Master’s degree in Computer Science, Engineering, or equivalent work experience
- Two to four years’ experience working with datasets with hundreds of millions of rows using a variety of technologies
- Intermediate to expert level programming experience in Python and SQL in Windows and Mac/Linux environment
- Intermediate level experience working with distributed computing frameworks, especially Spark
- Intermediate level experience working with relational databases including PostgreSQL and Microsoft SQL Server
- Experience working with contemporary data file formats like Apache Parquet, Avro, and columnar databases like RedShift
- Experience working with distributed SQL query engines like Presto DB and Athena
- Experience with Amazon Web Services including Redshift, S3, Kinesis, Glue, and DynamoDB
- Experience analyzing data for data quality and supporting the use of data in an enterprise setting.
Nice to have:
- Some experience working with clustering and classification models
- Some experience working with Trifacta
- Some experience working with Google Analytics
- Some familiarity working with RDFs and SparQL and some experience working with Graph Databases
- Experience with enterprise search engine systems including ElasticSearch and Apache Solr
34 total views, 1 today