kintamanimatt, there are definitely a few other things to look out for. One thing to be careful of, is my perspective is based mostly on an escalation standpoint, so I don't necessarily get a complete view of the ecosystem. Also, without knowing what sort of apps you develop, it's hard to be too specific, since some items are more relevant to void, than they are to web, etc.
1. Power States
On UMTS, if the radio is "idle", it won't lose packets, but just a small amount of data, isn't enough to trigger a transition to a higher power state. These packets generally won't be lost, but can take 2 seconds or more to transit the RF network. If a transition is required, from the lowest power state, that alone can take 500-800ms. The important lesson from this, is that when just starting out a TCP/IP connection, it could be very slow for the first little bit, until the phone/network realize there is more data passing.
On LTE, this is less of an issue, as there are only two power states, which are idle and active. As such, you always have to transition to an active power state to do any data, and the transitions have been significantly optimized. However, the transition can still take a few hundred ms in this case.
This mostly becomes an issue, when you want to do something in real-time. We see lots of reports and have many investigations on things like call setup time on voip connections, etc. From an app perspective, this can be pretty hard to account for, but with the move towards LTE, this will constantly get better for you, since it's both faster and simpler.
2. Packet Loss
This point is hard to get across internally sometimes, but packet loss will occur. It's a radio network, and there are interference sources, and situations where we can't deliver every packet within a guaranteed time. You'll also get temporary losses of connectivity (user goes into a tunnel). This can get even weirder, where we get stuff like the uplink is working, but the down-link has very high loss.
We have seen many DPI, IDS and Firewall vendors have problems with this packet loss, and may cause inspection to fail, or the connection to get delayed. One thing that often happens, is these vendors don't use a large enough reassembly buffer for a mobile network. With the higher latencies we get on wireless networks, there tends to be more in-flight data at any one time. On CDMA, when a re-transmission was required, more than 100 packets of in flight data could pass before the ack/sack indicated the packet was to be re-transmitted.
3. Initial Window
On the server side, it should be fine to set the initial window to 10 packets as per the ietf draft: http://tools.ietf.org/html/draft-ietf-tcpm-initcwnd-00
I haven't had an opportunity to test this out, but my gut feeling is this should actually help, as the more pending traffic will help indicate what's happening from a power state standpoint. There are guides for many different OS about how to change this setting, as it won't be applied by default.
4. Exponential Back off
This probably isn't so important at you're level, but we've had issues more at a device stack level, that if some piece of the network is lost, we get two problems. Fixing the piece of the network, and fixing the "mass calling event", which is all the devices trying at the same time to reconnect to that resource. This is mostly out of you're control, but perhaps keep this in mind for your own server's benefit. If your server goes down, don't have all you're clients go nuts trying to re-establish connections as fast as they possible can.
5. SMS
Just in case you use SMS in any way, perhaps as part of push, one property of SMS, is if it's not successfully delivered on the first try, when it get's queued for redelivery, it could be delivered several hours later. Also, at least on our network with the newest technology, we've had some bugs we've had to look into with the same SMS being delivered multiple times. One property of SMS you can also take advantage of, is if the device is out of coverage (powered off, tunnel, etc), when it comes back into coverage, the pending messages should be delivered.
6. Reject Codes
HTTP and many other services usually have the concept of transient failure and permanent failures. Make sure not to retry on permanent failures. I got at this in my previous topic, but we've seen more than once, where something like a large email get's stuck in an outbox, because the mail relay has a maximum size limit on it. However, the rejection occurs after the upload, so the device sits their constantly uploading the same email over and over again. I'd even be careful with transient failure, and if the transient failure occurs for more than 3 or 4 tries, or an hour, to give up.
7. Compression
If you can, you may as well use compression to the best of your ability, to keep resources smaller and faster. Even though the network is faster, it'll still deliver a smaller payload faster than a bigger payload.
8. Packet Size
This one is often missed, but can be hugely important. And I almost forgot to tell you. What happens in a mobile network, is you have IP traffic that is destined towards the mobile. However, the mobiles position and pathway isn't fixed, so our network has to track the user and update the path. The way this is done, is network equipment will encapsulate the IP traffic with additional IP headers, for delivery internally within our network. However, our network has the same limitations for maximum packet size. So what happens, is when we do our encapsulation, we have to chop the packet into pieces, and then put it back together on the air interface. For awhile, we also used to just IP fragment you're packet into smaller pieces, but unfortunately this is also problematic as devices don't necessarily do the best job joining the fragments either, especially if they get delivered out of order.
So on my network, we use something call mss clamping, to limit the maximum size of TCP payload data in a single frame. We do this, so that when we encapsulate the packet, it will fit into one packet with our headers. However, MSS is a negotiation that only happens on TCP, so it cannot happen on UDP traffic. I also know for a fact, that not all carriers will do this, and I have talked to one or two developers about why their app will work on our network and not others. This is something you can adjust server side, to be consistent across carriers. As such, on the public interface of the server, I would recommend something like a MSS of 1350 bytes, which is enough for our internal headers, and IPSEC, but not require the packet to be chopped, than reassembled by the network.
As for lobbying the higher ups, I'll see what I can do. I think it's a great idea, and carriers worldwide aren't doing enough in the space. However, this is a really tough one, since as a large company, communications and branding are greatly metered, and it doesn't help to do all this work, and have no one read it anyway, because app stores today seem to be about volume, not quality. Really, what I'm seeing on my side of the network (The Packet Core), is a lot of the technology development, is to allow the network to be more flexible and robust towards the way it is used, than attempting to control the ecosystem. Ultimately what happens more and more, is we get just strait PC's connecting to our network, and we want to offer a superior experience in this space. Also, the most efficient devices are losing, where blackberry used to be an order of magnitude more efficient than everybody else, has largely been eroded by having a faster network and competing devices although less efficient, deliver superior experience.
However, a little bit of searching did turn up some initiatives. Not so much best practices, but in Canada the major carriers put together a consistent API access for location, SMS, and billing.
http://canada.oneapi.gsmworld.com/
Also, it looks like ATT has some information, but it looks like it's more geared at enterprise, and most of the content is locked:
http://developer.att.com/developer/forward.jsp?passedItemId=...
Hopefully we will continue to see, more partnerships among carrier, more standardization from the 3GPP and GSMA, and better API's to get the ecosystem more mature, and it'll be better for everyone. I'm also hoping to see more work on SCTP or multipath TCP, so we can start to see connection level handoffs between different access technology, ie wifi offload when you're at home. There are some technologies for call continuity today from you're wifi, but they're amazingly complicated, and only work in very specific scenarios.
--
* my views in this post are my own, and do not reflect those of my employer.