(by unusable, I mean I can never find the page I am looking for when I search. Basically, I have to maintain my own wiki of important links I may need to reference in the future)
* First you have a few occurrence of the same search query in your search history (because only a few people searched similar words in the past)
* You can't either use synonyms of remove stop words to recommend better content (IT, can means "information technology, or the pronoun. THE can be an acronym, ...).
So basically the only thing you can do is search words. Confluence is worse than that because it tries to remove stop words and do things that break exact match search. But this is a difficult job. Ways to improve search: allow multi titles, index with tags, attributes, only do exact words match, allow users to suggest content for a specific search query, search autocompletion, searching in live during typing ... (many things that Confluence doesn't care about). You also have to respect rights when returning documents, each documents, can have rights from folder or document itself, inherited from team access or user access, so this is really computation intensive too, or pre-compute rights
(Working on a competitor [0] of Confluence and I have put plenty of hours of work on that specific issue, and I can tell you this is really hard)
Even early Google had more power user features than a typical B2B product search bar.
Boolean expressions (NOT, OR, AND), exact match strings, links-to, linked-from, in-folder/category, etc. should be mandatory for these workflows. Better if you can include search queries as live page content, as in Notion & Height.
Power users are a small share of users of knowledge management software, so it is difficult to build a system only for them. Most people just type a few words and give up if they don't find the result in the 5 first results.
I wrote a custom search engine that worked by running on cron, pulling in all of the content from Confluence and writing it into a SQLite table with SQLite full-text search enabled (using https://sqlite-utils.datasette.io/en/stable/python-api.html#...), then sticking a https://datasette.io/ interface in front of it.
On another level, and bearing in mind that Confluence is a paid product, this absolutely should not be necessary and competent search is something that Atlassian should provide out of the box.
(Yes, I have beef with Confluence, but in my case it's primarily due to the historically awful editing experience.)
The docs for the POST content literally says to write what you want in confluences WYSIWYG, then do a GET API call to see what it should look like.
This methodology works
https://ccc.inaoep.mx/~villasen/bib/AN%20OVERVIEW%20OF%20EVA...
and I used it to tune up the relevance of a search engine for patents to the point where users could immediately perceive that it worked better than other products.
After I worked on that I wound up talking to the developers and/or marketing people for many enterprise search engines and few of them, if any, did any kind of formal benchmarking of relevance.
People at one firm told me that they used to go to TREC conferences because they thought it got them visibility but that they decided it didn't so they quit going.
A message I got repeatedly was that these firms thought that the people who bought the search engines didn't care much about relevance, but they did care about there being 200 or more plug-ins to import data from various sources.
In principle the tuning is unique to the text corpus. One reason for that is that there is a balancing act of having a search engine that prefers small documents (they have spiky vectors that look more like query vectors) or large documents (they have so many words they match everything.) Different corpuses have different distributions of document sizes, not to mention different distributions of words that appear.
Few organizations are willing to do the work to tune up a search engine (you have to decide about the relevance of 10,000+ document hits), but I've had the experience that you can beat the pants off the defaults even using a generic tuning. For instance that patent search engine was tuned up against the GOV2 corpus instead of a patent corpus. A small patent corpus showed us we were on the right track, however.
Aside from the organizational issues, I think there's a problem where basically no search system can be good for every org with any kind of internal info and different queries from perhaps several distinct types of users with different goals. To get good, a system needs to improve through at least rudimentary ML. At its simplest, if Alice searches for X today and clicks doc3, if Bob searches for X tomorrow, doc3 should rank higher. This requires collecting and aggregating click stream data, and using this count info (with cardinality #docs x #queries) at search time. But sometimes it requires a richer model relating search terms to terms in relevant (clicked) docs and optimizing for some measure of search quality (NDCG) etc. All of this requires detailed access to docs, search/click histories, and a fair amount of computation and storage. But customers have legit reasons for wanting these docs to only be accessible by their own employees. And they don't want to dedicate their own staff to improving such a system. No one wants to hear that their model retaining ran out of memory, etc. So shipping a simple system which doesn't improve but doesn't have moving parts becomes a local optima.
Search has been an area of focus on and off for the most part of the last 15 years. It actually has gotten a lot better and Atlassian has an entire team focused on improving the search experience across their suite of products (they started with Confluence). And from what I hear, they are focusing on all the right things.
To your point, no search system can be a good fit for every possible use-case. Confluence has a number of different use-cases, but let's just pick "documentation" and "intranet" as an example here.
Intranets are, to a large degree, about keeping up with what's new in a company. Therefore recent content is likely more relevant than older content.
When used for documentation, recency doesn't matter at all. If a document was written 2 years ago, but the content is still accurate, it's just as relevant as it was on day one.
That means no single relevance configuration will work well for all use-cases. Leveraging ML is essential. But even a single ML model across an entire Confluence instance is not going to work as different spaces are used for different use-cases. What's really required here is to build different models for different spaces to create a tailored relevancy for each space. It's not an easy problem to solve, but I'm confident they will get there with time.
Seeing the challenges with Search at Atlassian, despite having a large, dedicated team of engineers working on the problem, is what motivated me to join http://sajari.com. We've been doing a lot of work on reinforcement learning and Neural Search. Our focus right now is on public content websites and e-commerce, but eventually we will get around to enable products like Confluence to create a great search experience without the need for an entire team. Search is a hard problem, but there is so much opportunity to improve the experiences that are available today. Exciting times.
OTOH I'm also a believer that you should be able to navigate to the right information.
People seem to think that writing pages is sufficient. A library works because pages are gathered in books, organised by sections and has an army of librarians to keep it running smoothly.
I treat documentation like code - DRY, refactor apply just the same. e.g. I might split a page up so that some common part can be re-used. I'll cull obsolete information or mark it obsolete. I'll _also_ updated headings to help them show up in searches.
I can gain far more functionality with a properly implemented self-hosted mediawiki server (the same code that runs wikipedia itself) with a number of useful plugins installed and enabled.
It doesn't require a rocket science level of apache2+php7+mariadb knowledge to set up. The instructions are really quite straightforward.
Confluence users are enterprise companies, and getting a self-hosted server up and running is too much pain to be bothered to deal with.
This is a process problem. The steps to get one would be something like:
- try and find the “provision a server” option in the corporate service portal (there probably isn’t one)
- ask someone if they know how to provision one. Get a link to a separate system where you can make the request
- you need to associate the instance with a cost centre, or maybe you literally need a credit card number, don’t forget to attach written manager approval
- update the project’s budget to include the unexpected cost of this internal service. Hopefully there’s actually some margin to afford it.
- wait a day or two for the request to go through
- get the instance details, RDP in and try and set everything up. Realise you need to make a separate request for admin rights to install non-base software if you don’t want to use IIS and MSSQL server
- wait a day for admin rights. Don’t forget to add written manager approval to the request or else it will be denied
- realise you need to make a separate DNS request to get a friendly url for the team to access it. Also, how are you going to secure access to just your team members? Need to integrate with the corporate AD
- …about a dozen more steps
Compare all of that with:
- Go to the corporate confluence instance
- click “Create”, add your team members with edit rights.
- done
Confluence itself may not be a great experience to use, but it’s solving the problem of getting to the point of having a wiki setup in the first place.
And yet many of them self-host Confluence. And many other things. And provision servers all the time. And you have to provide a CC (or maybe PO) for Confluence in any case. And you can't just associate Confluence with a cost centre. And you have to budget it. And... literally every single one of your arguments applies just as much to Confuence.
So off we went to Atlassian. It has many flaws, but nobody is pining for the old days of Mediawiki. And the hooks Confluence has in to Jira is something you don't get with plain Mediawiki, and that has real use for us.
My bigger question though is why the average user is important. Most large companies have employees whose entire job is ... knowledge management. If they can't figure out how to write wikitext then maybe they're not a good fit for the role?
Honestly I wish I knew more but it was like pulling teeth trying to get people there to speak openly about why it's so hard when it is solved in so many other products.
Some existing tooling:
Google cloud search has a confluence connector https://developers.google.com/cloud-search/docs/connector-di...
Elastic workplace search has a connector. https://www.elastic.co/guide/en/workplace-search/current/wor...
Lessonly had / had a thing called Obie https://www.lessonly.com/blog/how-to-search-better-in-conflu...
Raytion https://www.raytion.com/connectors/raytion-confluence-connec...
I think neural-network powered search will be the long-term solution for Wiki search specifically, and SaaS search more generally.
Keyword has too many failure cases, and works poorly when there's not a lot of data, or when searching through content authored by others.
I'll contact you offline. Would love to hear more about your experience in this area.
"Atlassian Tools" is on my list of automatic rejections for companies I'm thinking of working at for this reason.
This concludes, and fully encompasses, everything good that I have to say about Atlassian products.
Includes the language breakdown and such. Makes it much easier to know if you should be blocking out 5 mins to 5 hours to review something and if you even should be depending on the language.
Alas, Atlassian does not offer PVA for Bitbucket (it was meant to be there in July) so I cannot release it since it costs me money to host. I really wish they would invest more time into Bitbucket.
- Atlassian probably
The desktop people want the latest and greatest software ASAP if not sooner.
The server people want nothing to change, ever.
I'm sure enterprise software has similar rules and incentives.
Today I use one that they host and there is nothing wrong with it.
Virtually every page load took upwards of 5 seconds.
The organization of most teams' documentation is horrendous at my company. There are at least 3 different pages I have to go to for how-to articles and that's just within my current team's space. Not to mention there's limited information on those pages.
Documentation is an after thought. We've also seen a lot of attrition this year. I'm the senior person on my team as a midlevel. I have one contractor who's term is up in a couple months and one junior. They can't fill the 4 positions that have been open for 2-3 months.
What’s the prevailing wisdom these days on the best solution for an internal knowledge base/wiki platform?
My colleagues and I have been grumbling for ages that our instance of Confluence must be really badly configured. If you put in a single word search term, there will be lots of results, but no guarantee that any pages containing that word in the title (or body), will appear above ones where it doesn't.
The search problem was solved long ago by Apache Solr/Lucene. Although this may not be true for multiple languages.
1. give pages labels. This lets you insert a label-based index, and also makes it possible to narrow search by label
2. use spaces. Separate the content into spaces based on who is likeliest to need that information. You can narrow search by space, and put a search box on the page in the space.
3. use the hierarchy. You have to put the pages somewhere in the hierarchy anyway, so try to make it reasonable.
4. Make useful index pages. Obviously, this doesn't scale, but if you can provide people with useful starting points, it will help them. For example, at Khan Academy we have a space for the whole org with a front page to get you to every team's front page. The engineering team has a front page with a small collection of useful & commonly-used links
5. if you have a page in your hierarchy with a lot of content underneath it, add a search box on that page that constrains the search to that set of pages.
The biggest problem Confluence search has is that it's terrible with relevance, and using its tools to narrow down the search can improve the relevance of the results considerably.
It does partial matches anywhere in a word, supports every language even in the same document, and even has regex support for those who need it. Update instantly with instant filters.
It can find things like 168.0 in 192.168.0.1 which the existing confluence search cannot for example. Or search for AKIA credentials /AKIA[A-Z0-9]{16}/ I have heard people describe it as Agolia for confluence which makes me happy.
https://marketplace.atlassian.com/apps/1225034/better-instan...
As for why their search is so bad? It's probably due to how they apply permissions. Every permission for their search needs to apply per search per user. It makes it complex and hard to apply changes, making it hard to improve things. I imagine it's one of those parts of confluence that is a major pain to work with.
I think a lot of this is also due to their cloud migration. When using the server version they were allowing you to host yourself you could store the index on disk. With cloud they suddenly need to keep the index state somewhere persistent, but they also want to dynamically scale up and down.
Lastly, they also apply stop words, stemming and such, using out of the box lucene. Lucene is a great tool, but it can also be a pain to work with. You can see problems when you mix languages on the page too, such as having Thai, Chinese and English on a single page which confuses the Lucene tokeniser.
When choosing 3 years ago, we used the following criteria:
* WYSIWYG editor. Any user must have a minimum effort to write documentation
* Flexible access permissions to various parts of the documentation. Public documentation is open to anonymous users, the internal one is divided into many sections with access for certain groups
* Multilingual support. Not out of the box, but possible with plugins
* Multilingual pdf export. In some markets, some customers prefer to have exported manuals
* The ability to inherit articles. We need to be able to make edits once, instead of duplicating the same articles
* Have a relatively modern appearance. Wiki engines are familiar to many because the whole world uses Wikipedia, but this does not make them more pleasing to the eyes, if I can say so
3 years have passed, I periodically look at alternatives, so far only wiki.js seems like a good solution but it’s not even close yet.
MediaWiki?
I use Confluence and Jira because, again, we use them at work. So I guess I'm using them because I have to. I also understand it's a pain to move our company from one to another (oh we've had discussions to move to Coda and others) but again, I'm not taking on that project. Again, UI/UX, search - all meh - they are working and I got used to it.
The inconvenience of using them does not justify the amount of time I need to spend to overcome my inconvenience. Some things, you just have to let them slide.
I think most search engine designers want to make the index as broad as possible, but the problem seems to be that people rarely want such broad searches. What they really want are very detailed indices and metadata implications over well trodden folders.
Mutt, Pine, grep, awk, etc. I don't understand why throwing a GUI interface on top automatically seems to make email search absolutely awful, this includes Gmail. I so often need to find a specific old email using a hazy match criteria that I am half tempted to pipe my email into Splunk (I run a small Splunk cluster at home for other needs) and use it (as then I don't need a local copy of every email on all devices or to need to SSH into a central box to do a TUI based search)
Maybe this is something google should take on. A search plugin for Confluence where google crawlers logs in from time to time for internal crawling to enable non-public teach request on that data. That boost knowledge workers efficiency a lot. I hope somebody from Google reads this and takes on the challenge. I'm sure companies would pay a lot for this.
https://workspace.google.com/products/cloud-search/
https://marketplace.atlassian.com/apps/1212945/google-cloud-...
If you want best of both words, you can use the "Favorite Pages Macro" on any page to reference all of the pages that you have saved for later, which makes keeping that page up to date with your latest changes to saved pages trivial.
For a start, it interprets multiple words in a query as an OR. You search for a "hello world", you get "hello nobody" and "goodbye world" and the search results.
It also always applies stemming, which mangles technical terms. At Cloudflare we have a daemon called "cloudflared" and it's impossible to find it in the damn wiki.
If it even tries to do any prioritization, it's indistinguishable from random. I search for a project's name, I get fragment of meeting notes from 7 years ago, not the project's homepage.
And the UI is unusably awful too. The fancy-ajaxy JS overlay breaks the Back button, so if you click on an irrelevant result (and all of them are irrelevant), pressing back doesn't go back to search results, but instead makes you lose document you were on.
The people who make the decision to buy Confluence aren't the ones who have to use it.