Stack Exchange receives thousands of questions per day across all of their sites. Not all of these are high quality posts. Fortunately, users of the Stack Exchange network are given tools to help keep that low quality stuff to a minimum. One of these tools is the chat network that spans the Stack Exchange sites.
In the chat rooms, a convention has arisen to tag a message as cv-pls for questions that need to be closed for one reason or another. Over time, this evolved to include other tags such as:
- del-pls for a deletion request
- spam for notification that spam made it through the already impressive spam filters
- reopen for a reopen request
- a few others to cover specific flag types (eg. Not an answer, Very Low Quality or Offensive)
The problem with these is that the requests are only seen by users active in the specific room where it was posted. Other users across the network miss the request. Zephyr was built to resolve this problem. Zephyr monitors several rooms where these types of requests are frequent. These requests all all posted into a single chat room. This provides users with a single room to monitor to see requests for multiple questions and sites across the network.
Here is an example of what Zephyr's chat activity looks like during a spam wave:
How it works¶
Zephyr utilizes the ChatExchange package to join and read the chat rooms. To do this, Zephyr required a dedicated account. I decided to run Zephyr with a dedicated account to completely separate the bot that would sit and watch multiple chat rooms 24/7 from my account. Zephyr maintains a small SQLite database of all the posts that it records. The idea behind this, is that eventually this data will be utilized to train other systems on unwanted content. This information is pulled via the API.
Zephyr watches the chat rooms for specific string patterns. If these patterns are matched, a message is posted if
True for the matched pattern.
Overall, a nice simple application. It performs some pattern matching and a couple API calls.
In addition to watching user activity, Zephyr also watches two other quality bots that patrol Stack Exchange for low quality content: SmokeDetector and Phamhilator. If either of these bots detect spam, Zephyr takes note of the information by recording it to the database, but not reposting. Since both of those bots post their reports, it didn't make sense for Zephyr to add a second (or third, if both of the others detected spam) message to the chat room. The information is recorded, though, to help future training for other systems.
Updated May 8, 2015
Over time Zephyr has been updated to include new rooms to monitor or new patterns to match. Those changes are small (and simple). There are, however, a few larger changes that I'd like to note below.
The other bots that Zephyr monitors respond to user input. Zephyr has very little that requires user interaction since all of it's
posts are generated by user input. However, there have been times where I, as the bot owner, would like to be able to issue
certain commands to it. My most common desire is to see a report of how many spam posts Zephyr has seen. Thus, Zephyr now responds
to the command
spamreport from me. It then prints out a nice summary of information. This information has been utilized in
SmokeDetector to watch for commonly spammed domains.
Upgrade from SQLite to MariaDB¶
Zephyr was originally built against an SQLite database. This worked, but was getting slower as more data was being added. This slow down was beginning to affect performance. I started seeing this error more and more frequently:
Traceback (most recent call last): File "H:\python-virtualenvs\zephyr-se-voterequests\lib\site-packages\sqlalchemy\pool.py", line 255, in _close_connection self._dialect.do_close(connection) File "H:\python-virtualenvs\zephyr-se-voterequests\lib\site-packages\sqlalchemy\engine\default.py", line 418, in do_close dbapi_connection.close() ProgrammingError: SQLite objects created in a thread can only be used in that same thread.The object was created in thread id 4824 and this is thread id 4660
After spending a lot of time troubleshooting and not resolving it to my satisfaction, I decided to upgrade to a more robust database. I'd used MySQL/MariaDB before and I happened to have another application utilizing MariaDB at the moment so that is the solution I picked.
The first step was transferring data. I learned that there isn't a decent utility to do a straight migration. So, I took these steps to transfer the data:
- Export table structures and data from SQLite
- Convert the SQLite dump to MySQL format. Though both systems use SQL, there are slight differences in dialect. I utilized this Python script as a starting point. It got me most of the way there, but not completely.
- Data clean up. Ugh. The dreaded part of the job for anyone who handles data. Fortunately, the script above did most of the work. I ended up fixing a couple stray back ticks that didn't convert properly, escaping a very extra quotation marks, and replacing a few "smart quotes" (of both the left and right variety). I wish data at the office job was this easy to clean...
- Import into MariaDB
Since the transfer to MariaDB, I've noticed no performance degradation. The error about threads has been eliminated as well.
Upgrade to utilize web sockets¶
Originally, Zephyr used the
watch method when monitoring a room. This method would long poll the room. It turns out that this is
pretty unreliable. I'd get multiple errors through out the week, ranging from
Connection Aborted errors to random
404 messages. The
solution has been to switch to
watch_socket. The only time I've had problems since this switch is when the Stack Exchange
web sockets go down. This saves a lot of restarts to get everything up and running again.