It also doesn't even categorize the products they compete with correctly[0].
Why not contribute some of your resources to one of the many active open source libraries already trying to solve some of these problems, and focus your engineering efforts on your core product?
[0] Fivetran is only considered "Orchestrate" but is actually competes directly with Alooma in the Extract and Load. Also, there are DOZENS of company in that space. https://gitlab.com/meltano/meltano/blob/master/README.md#dat...
I agree Fivetran also belongs in extract and load and updated it https://gitlab.com/meltano/meltano/commit/1df9813f5ab42c4479... Do you think it should be removed from Orchestrate? Any other suggestions for proprietary products in that category?
Consider how you trust using dbt more than rolling your own transformation tool. Why wouldn't this apply to the rest of your stack? The 10+ companies that offer data extraction and loading are likely a better choice. Again with Analytics - the dozens of companies that offer BI tools are probably going to be the better choice.
Maybe you can build all these tools better than the hundreds of companies with thousands of employees and millions of dollars. It just seems like the odds that you build the best of each is so unlikely.
I would have been more impressed if your team had designed some API that other tools/platforms could plug in to coordinate a lot of the above jobs with your CI system. There is a SERIOUS need for that and I've had a lot of conversations with companies about what that would look like.
To answer your quest, no, Fivetran does not currently belong in the orchestration area, IMO. I've heard they are soon to release some sort of orchestration tooling to compete with dbt, but it isn't the type of orchestration you get with Airflow.
I'm not 100% with all the tools you are using, but stringing together random SaaS tools and having to survey a random number of open source tools in order to assemble a sensible platform makes way less sense.
At the very least, what we end up with is a group of folks working together in the open to surface some of the limitations and challenges and attempt to work out some of the alternative solutions to the problems that arise in this space.
So, I applaud your effort. Ignore the salesmen and the haters.
A lot of the solutions out there are fantastic but aren't up to the tasks we are looking for. Why shouldn't the whole life cycle be in one tool, be open source, and be version controllable? That's what we are looking for in a tool.
That's by no means a bad thing though. While yes, there are downsides to tightly coupled tools, there are also advantages. If GitLab is trying to do the same thing for data analytics that they've already done for source control, they may very well succeed.
If the CEO is following this, please improve basic user stories like:
* As a user, I want to easily know who has approved my merge request. Note the word "easily". The UI lists the people who did not approve next to label "Approved" and the people who did approve next to the label "Approved by". Makes absolutely no sense
* As a user, I want to see all the merge requests that I need to review because I am listed as an approved (it boggles my mind that this doesn't exist)
* As a user, I want to be notified by todos that only have any pending actions on them
* As a user, I want to disapprove a merge request
There are so many basic areas of the core product that are almost unusable. All of our engineers who have to regularly switch between github and gitlab prefer the github ui.
And while some integration is good... A lot of recent stuff is just "we try to grab the easy money"
Can you explain further what you mean by "pending actions on them"? We are working to simplify and streamline our notifications and todos in GitLab. In particular, the current thinking is that they are very similar. A "notification" is an email, and a "todo" is a something that GitLab calls your attention to in the Web UI to take action on. So mechanically, they are very similar and we would like to harmonize them.
Our latest discussion is in https://gitlab.com/gitlab-org/gitlab-ce/issues/48787.
We've improved a number of confusing approval widget states in GitLab 11.2 (https://gitlab.com/gitlab-org/gitlab-ee/issues/5439) which will ship later this month in and the ability to filter merge requests by approver is in development by a community member (https://gitlab.com/gitlab-org/gitlab-ee/issues/1951).
This is just the beginning though – code reviews and approvals are at the heart of the daily workflows of writing software and we'll be continuing to make them even better. I'm particularly excited about more structured code reviews with batch comments in 11.3 (https://gitlab.com/gitlab-org/gitlab-ee/issues/1984), better navigation between files in merge request diffs with a file tree in 11.4 (https://gitlab.com/gitlab-org/gitlab-ce/issues/14249), and our first iteration of code owners (https://gitlab.com/gitlab-org/gitlab-ee/issues/5382) also in 11.4.
Thanks for the disapprove merge request idea. We're considering this idea in https://gitlab.com/gitlab-org/gitlab-ee/issues/761 where further feedback would be much appreciated, or on any other issue.
This doesn't mean anything, maybe the customers are simply tired of reporting issues. For example last year we didn't do any updates for 6 months because we were afraid it'd break something and we were too busy to be willing to spend the time reporting problems.
We also don't report issues that are already open on gitlab.com, reporting the issue means your customer is willing to spend time reporting, following up and testing your bug. This is your job, not the customer's. At the moment we are only reporting issues that are either blocking us from work or slowing down our development. The majority of issues we are facing are performance problems.
I just wrote a script to plot the number of issues on gitlab-ce over time and percentage of open/close issues, and the overall period they have been open for, you are accumulating issues with: `backend`, `UX`, `technical debt`, `performance`, `CI/CD`, ... labels, a lot of them don't have a Milestone and have been open for a long time.
I am not sure how emailing you would help us, it's not like the problems are not reported or you don't already know about them. It just appears that the priority of GitLab, as a company, is not shipping a quality product anymore.
EDIT: I work in the aerospace industry and one of the stages of our pipelines is to run stress test on our product. I would suggest you to run a stress test on an instance of GitLab, this would be an amazing place to start looking for performance problems.
Couldn't that just be because you have more silent customers now? Probably from the people moved their projects from GitHub
The size of the GitLab is constantly growing and Meltano is adding to GitLabs capabilities, not subtracting. We've hired 2 very awesome Python developers for Meltano specifically. They each have tons of experience in the ELT space.
All this to say, that no one at GitLab has turned their eyes away from GitLab, it's the opposite. This business is here to help GitLab as our first customer. Rather than having GitLab struggle to get it's data tools together, and make business decisions based on that data, we've devoted a whole team to provide a solution while helping the community at the same time.
- Studying each source to figure out the right data model
- Chasing down a million weird corner cases
- Working around dumb bugs in the data sources
This is the kind of problem where paying for software really works better. When people build data pipelines in-house, they tend to hack at it until it works for their use case and then stop. When we build data pipelines, we map out every feature of the data source, implement the whole thing at once, and then put it through a beta period with multiple real users. This is easy to do when you have a tight-knit dev team; much harder for a group of part-time open-source contributors.Personally I work as a "lone wolf" (to my own complains) because I'm in a small company that can't afford a huge team. Most of my (ETL) Transforms are done in SQL which happen to be pretty standardized as opposed to many ETL products I've seen so far.
This solution is probably far from being ready, but I find this approach quite interesting, because it look like a code based ETL that use SQL for transform (so I might be biased). Overall this might result in a more maintainable/versionable data pipeline model than GUI-first ETL which usually generate spaghetti code. Because you are usually forced to regularly adapt data-pipeline to unstable external inputs, being able to easily diff ETL process would be a blessing.
One thing that gets me really excited about it is the way we want to build version control in from the start. To give you an example of where that's really powerful - we have a bunch of dashboards in Looker. Right now, figuring out what Looks/Dashboards rely on a given field is very challenging. If I change a column in my extraction, right now I can fairly easily propagate it to my final transformed table (thanks to dbt!) and even to the LookML. But knowing what in Looker is going to change / break if I change the LookML is way harder.
But if everything was defined in code from extraction, loading, transformation, modeling, _and_ visualization, that'd be really powerful from my perspective.
The Meltano team has several user personas that they're looking at focusing on, data engineers are definitely one of them, but data analyst/BI users are as well, and we want the product to work well for the whole data team.
IMHO, if you want to make a dent in the space, figure out better debugging tools!
In particular, tools that explain how a certain (specific) value was calculated in the system, tools that let you bisect the source data in some way and let you focus on the source data that are likely to have a problem, tools that help you figure out that certain intermediate value in calculations is an outlier, tools that let you test certain assumptions about data over the whole pipeline..
I'd love for a more robust way to test data pipelines and the data within them generally. I was at DataEngConf earlier this year and many people were talking about this problem exactly. One way we're trying to address it a bit is by using the Review Apps feature on Merge Requests within GitLab. Right now, when you open an MR on our repo it will create a clone of the data warehouse that's completely isolated from production. This, obviously, can't scale once the DW is beyond a certain size, but I think there are ways to keep this sort of practice going.
The idea is to give users a set of default extractors (which are the ones we use internally, so they are battle tested), along with loaders, transformers etc. With documentation on how to build their own. For our MVP, and possibly into the future, it will work similar to Wordpress plugins where you have an extractor directory that you place your extractor which is written following our protocol, and the UI will recognize it and give you choices of extractors to run, same for loaders, and so on.
We do not want to be chasing down every last corner case, for extractors (except for our own) because that's just not a good long term solution, needing constant maintenance (as we've seen already). With user contributions, I believe it can work.
Once you take VC funding, you gotta go where the money is. Everyone wants/expects "fast, stable, like Github" for free unless you have special needs. So, you do analytics on what people are doing with your free site, you offer enterprisey features, you get into the "platform" business etc.
I think Gitlab distracts itself, spreads itself thin, and isn't great at partnering, its ambition to do-it-all knows no bounds, which is both commendable and a smh moment. It's not likely sustainable or scalable. They're definitely trying to "go big or go home" as a company, which is not how most originally felt about Gitlab (a fast, stable OSS alternative to Github).
At the same time, I can't blame them. I think it comes down to: Don't hate the player, hate the game.
We have hired 3 times as many people in our security team for GitLab.com (not our product team for security) as are working on Meltano.
We have hired 3 times as many people in our SRE teams as are working on Meltano.
And we still have a lot of vacancies for both https://about.gitlab.com/jobs/
BTW We don't call it a family https://about.gitlab.com/handbook/leadership/#management-tea...
Thanks for the link - we'll definitely keep an eye on it.
I was very glad to see this is Python! Python has some of the best data tools out there, and a mature ecosystem for solving all the engineering problems that go along with a great data stack.
I fully expect we'll have a use case for the "cool" machine learning stuff, but there's a lot of groundwork to cover with the basics first. Meltano is focusing on those basics for right now.
I think this market is not being served properly, most of them seem to still require most of the heavy lifting to be done by the ML practitioner.
I suppose I would even be okay with a service that just saves all my graphs from tensorboard for later reviewing.
Extraction/Loading Dell Boomi SAP SAS Pentaho Domo Oracle IBM Microsoft Informatica Talend JitterBit SnapLogic Mulesoft SyncSort Information Builders Actian Attunity Datameer Alteryx Striim Treasure Data Cask StreamSets Snowplow DataTorrent Astronomer Panoply Apache Nifi Stitch Data FlyData Bedrock Data Alooma ETLeap Fivetran Xplenty MethodMill Celigo TerraSky DBSync Youredi Scribe Civis Analytics DataScience Dataloader.io datorama Astera
Analyze Microsostrategy GoodData Sisense Looker Power BI Wagon Birst Tableau Qlik Domo Hue Mode Chartio Periscope Pentaho
The amount of hype and BS in the Notebook space would require me to spend some time combing through that again.
1. Do we have enough money / budget for a tool like this? 2. Can we derive enough insights from this product fast enough to make a good ROI? 3. Does this tool use a proprietary language that no one wants to learn or can I code in a language that is relevant? 4. In all honesty, can I get insights faster in a spreadsheet than these tools? 5. What is the learning curve? 6. Can I answer the business question that was originally asked?
Open to more discussions around the topic as it is a lot harder to answer than a few philosophical questions, but it certainly resonates with many data & analytics professionals. A nice goal would be to have project where you can stand up a business, turn your data pipelines on, ingest the data, and view the insights needed to make a business decision all within a short timeframe of when a business goes live.