Extracting all the data

Well, I stumbled upon the problem with the default 1000 entries being sent from database. I need to recover all of the data (our server is fully functioning for just about a week and we have only a few participants for testing purposes). I tried using this script:

participant_accel_tr = LAMP.SensorEvent.all_by_participant(participant_1, 
                                                           origin="lamp.accelerometer",
                                                           _from=1577735460618,
                                                           to=1577735460737)

I tried to use it in my script for downloading data in one category:

begin = datetime(2020,1,1).timestamp()
end = datetime(2021,2,2).timestamp()
events = LAMP.SensorEvent.all_by_participant(participant['id'], origin='lamp.accelerometer', 
                                                 _from=begin, to=end)['data']

But it does not seem to be working, I tried extending the time range, but I am only able to extract the last thousand each time. I looked up the timestamp info few times, but only problem might be the UTC conversion but I think that would just make the list start hour later/sooner etc…

Is there any other way how to extract all the SensorEvents from the database?

Some more details:
When I use the attributes _from, to with the timestamps I posted and count the events that come, it is zero entries for each patient. But it does not throw any error.

Today I tought about the possibility that my timestamp might be a different format. I tried googling a few things about UTC timestamps etc. Because I remembered how I had to translate datetime from UTC Timestamp.

After trying to normalize my datetime the method threw an error that it is recieving already normalized datetime, therefore I tried to convert it to UTC time with the following method:

def dt2ts(dt):
    return calendar.timegm(dt.utctimetuple())

Before you ask, yes it’s from stack.

The results were interesting:

begin = datetime(2021,1,1).timestamp()
end = datetime(2021,1,2).timestamp()
begin2 = dt2ts(datetime(2021,1,1))
end2 = dt2ts(datetime(2021,1,2))

print(begin)
print(begin2)
print(end)
print(end2)

This printed out the following outputs:
1609455600.0
1609459200
1609542000.0
1609545600
They are slightly different to their original non UTC counterparts, but do not look like they should have different time beginings (date from which they count).

And yeah it still made a query that did not retrieve any data… maybe I am blind and just can not see the solution…

@sakac We’re looking into this on our end since it’s quite peculiar! You can always download from CouchDB directly to validate things - check out the LAMP-server source code in the src/repository/ActivityEvent.ts file on what CouchDB query is made. (Also cc’ing @rhays from our team who may be able to help quicker with Python!)

1 Like

Well this is strange. Maybe the problem was on my side but it looks like the from and to attributes are somehow switched or something.

I still could not change the limit of events per query so I modified my script to ask for next 1000 events and I am changing only ‘to’ parameter to be always the same as the last timestamp in the retrieved data.

When and if I manage to retrieve the data successfully I will try to understand and report the problem that I encountered.

Ok so I was able to retrieve data with these steps:

  1. create timestamp of the ending (most recent event) I want to retrieve, lets call this timestamp end
  2. then sending this timestamp as a to parameter - to=end
events = LAMP.SensorEvent.all_by_participant(participant['id'],
                                             origin='lamp.gps',
                                             to=end)['data']
  1. after the 1000 most recent events are retrieved I checked the last event that I recieved,
    that means the least recent one
  2. I assigned timestamp of the least recent event as a new value of the end variable

    Rince and repeat until the number of events retrieved is 0

    I was not able to achieve similar outcome by using _from timestamp value, neither I was able to change the limit of retrieved events.

    It looks like whilts retrieving the last 1000 default events from db, they are being retrieved in “reverse” order. By that I mean, that the first event written into the cluster of retrieved events is the last to actually occur.

    If the goal of setting from-date - to-date is to retrieve larger number of entries, there might be problem with constant value of limit variable, and that your algorithm propably is really counting and writing the events starting from to value and not starting from from, which would also explain my failed attempts to do the same thing with retrieving 1000 entries by continually moving the starting from timestamp and "iterating’ through the data.

    I tried to look at the source code, but I was not able to understand it as a whole in the little time I had got. And for a quite some time after looking at Activity and Sensor Events in repository I became suspicious that the problem might be in the service part where the attributes from me are being recieved and the queries should be generated.

    In conclusion I think either mindLAMP server is creating “bad” queries (if and only if this is not by design ofcourse) or I just misunderstood the documentation again:

Query responses are limtied to 1000 entries. In the event that there are more than 1000 events for a given sensor, the most recent 1000 events are returned. To access more data—or to query during a specific time range—you must use the “_to” and “from” parameters

This is cited from the oficial documentation.

To access more data - or to query specific time range

The start of the sentence might left me expecting other functionalities embedded in the backend and also I was expecting that using _from parameter would set at least where the start of timeframe of the 1000 events will be.

There is a lot of things that might be wrong on my end and I do not want to criticize, just to help if anything, but there is one thing that I am 100% certain of and that is: There is a small typo at the end of the paragraph. :sweat_smile:

“_to” and “from” parameters

I really don’t know if I am only one who encountered this problem.

So I can safely say I had overcome my problem and understood how querying works. I am sorry if I wasted anyone’s time. And hopefully these posts will help to understand these queries to people like me.

1 Like

Thank you @sakac for elaborating so well on your inquiry! It definitely is not a waste of time – in fact, I think it’s really useful for other users (who almost certainly come across the same ambiguity) and for everyone on the admin/dev team.

To address your main points: (1) data is always returned from most to least recent, so the “_from” and “to” are, in a way, reversed; and (2) queries are always limited to a max of 1000 response objects.

Regarding (2), we as a team are considering how to best provide users the ability to query all of their data efficiently while also preventing unintentional massive queries that hammer the backend servers. This is an especially important consideration for sensor events, as due to their high resolution (5 readings/s for some), there can be gigabytes and gigabytes of data for just one sensor for one user. Your approach of iteratively setting the “to” param is a correct one, and is the approach that I use as well.

1 Like